Software Defined Infrastructure

Big Ideas, Big Challenges

At randrr, we have big ideas. We do everything in the cloud. We let Amazon Web Services worry about the pliers and wires so we can focus on product.

To bring big ideas to market at scale, we can't hire an army of engineers to click around in the Amazon Web Services console or log into our instances all the time to configure them.

We have better things to do so we need automation that can handle this for us.


The Old Way is Not So Old

We could do it like this...

What's happening here?

  1. Name your servers cute and memorable names.
  2. Lovingly handcraft your configurations.
  3. Hire lots of engineers to deploy/maintain your infrastructure by logging into systems, copying files around, and tweaking things.

Benefits? It can be faster to get started. Problems? It doesn't scale. Working all weekend to upgrade a component, configuration drift, security issues, knowledge silos. The list goes on and on.

Who would do this?

EVERYBODY! Most organizations, including some of the biggest and most successful companies in the world, do exactly the process described above. The landscape is changing, but it takes time to rework the legacy.


This is nuts. We can do better.

Since we're starting from scratch during a tools revolution and we don't have any legacy code, we're automating everything from the start.


  • No knowledge silos. The departure of any team member should not cripple the organization due to the departure of their knowledge. Everything is in the code.
  • No drones. A small team of super smart engineers is better in every way than a huge team of... well... you know...
  • No pets. Our servers are not our pets. We don't name them or lovingly care for them. We replace them when they have issues. Bad for Fido, good for the cloud.
  • Built to fail. Stuff breaks, so the system should expect it and react accordingly with no downtime.
  • Built to scale. We're going to succeed, so we need to handle the load.

"Sounds great! How do I do it?"


Software Defined Infrastructure

Software defined infrastructure is a technique for defining what our networks, server instances, database configurations, and container runtime environments look like, then telling a system to enact those changes within the infrastructure, all without logging into each system and manually intervening. If we're doing it right, we may have 200 server instances across 3 continents and only have to log into one of them to maintain them all.


How we do it at randrr

We use Ansible for our automation. It's simple, powerful and extensible. We can spin up whole environments for development and testing in a few minutes, then tear it all down later so we're not spending money for non-production environments that aren't needed at night. We maintain all components through Ansible scripts, codifying our experience and making us faster to react to problems, no matter who is on call. We can patch a whole environment with all our components using a single command.

Here's how it works for our Kafka service:

  • Kafka Configuration Template - We think about this once when we are engineering the system.
  • Kafka Variables - We tune these as we learn about our usage profile.
  • Dynamic Inventory - We let Amazon tell us about our servers and where they are.

Here is a real (but simplified) sample of Kafka provisioning at work. We provision EC2 instances, download Kafka, create directories and volumes, and turn it on.

We create a file to hold our configurations and define the layout of our instances across availability zones. This would work for defining instances across regions as well.

  - availability_zone: "us-east-1a"
    region: "us-east-1"
    count: 1
  - availability_zone: "us-east-1c"
    region: "us-east-1"
    count: 1
  - availability_zone: "us-east-1d"
    region: "us-east-1"
    count: 1
  - availability_zone: "us-east-1e"
    region: "us-east-1"
    count: 1
kafka_scala_version: 2.11
kafka_data_directory: /data/kafka
kafka_log_directory: /var/log/kafka

Now that we have codified our configuration, we tell Amazon Web Services to provision the instances we require for Kafka to run. This code assumes that the variables have been defined, which we defined in the configuration file above. Note: this code can be used for all service types, not just Kafka.

- name: Provision EC2 for {{env}} {{type}}
    module: ec2
    key_name: "randrr-{{ env }}"
    group_id: "{{ dynamic_securitygroup_id }}"
    instance_type: "{{ ec2_instance_type }}"
    image: "{{ ec2_image }}"
    zone: "{{item.availability_zone}}"
    vpc_subnet_id: "{{ item.subnet }}"
    region: "{{ item.region }}"
    instance_tags: '{"Name":"{{env}} {{type}}","Type":"{{type}}","Environment":"{{env}}","BoxID":"{{box_id}}"}'
    assign_public_ip: yes
    wait: true
    count: "{{item.count}}"
      - device_name: /dev/xvda
        volume_type: gp2
        volume_size: "{{ ec2_root_volume_size }}"
        delete_on_termination: true
      - device_name: /dev/xvdb
        ephemeral: ephemeral0
    "{{ instance_layout }}"
  register: ec2_hosts
- name: Add all instance private IPs to host group
  add_host: hostname={{item.private_ip}} groups=new_hosts
  when: "{{ public_ssh | default(false) }} == false"
    - "{{ec2_hosts}}"
- name: Wait for the instances to boot via private IP checks
  wait_for: host={{item.private_ip}} port=22 delay=60 timeout=320 state=started
  when: "{{ public_ssh | default(false) }} == false"
    - "{{ec2_hosts}}"

At this point in the process the instances are provisioned and running. We tell each of them, in parallel, to configure themselves and fire up Kafka. This orchestration can be customized on a service-by-service basis.

- name: yum update
  tags: update
  yum: name=* state=latest
- name: Download and unzip kafka
  tags: update
  unarchive: >-    src={{kafka_version}}/kafka_{{kafka_scala_version}}-{{kafka_version}}.tgz
    creates: "/opt/kafka_{{kafka_scala_version}}-{{kafka_version}}"
- name: Create data directory
  tags: update
  file: >-
- name: Create log directory
  tags: update
  file: >-
- name: Generate Kafka init script
  tags: update,configure
  template: >
- name: Enable Kafka Service
  tags: update
  service: name=kafka enabled=yes
- name: Start Kafka
  tags: update
  service: name=kafka state=restarted

We then execute the command to deploy the changes. Ansible generates the configuration files, pushes them to the servers in parallel, performs rolling restarts on Kafka to implement the changes, and notifies Slack when it's done.

$ ansible-playbook --private-key=~/.ssh/randrr-$ENVIRONMENT.pem \ 
   -e "init_cluster=true env=$ENVIRONMENT type=kafka" \
   -i localhost, $SCRIPTPATH/../kafka-provision.yml

We actually wrap this call in a shell script in order to provide consistency across all services we run. The result is that all we have to do to provision a kafka cluster is run the following one-line command.

$ provision-kafka dev

One command to start up Kafka across multiple availability zones. We build all our system’s automation in this mold, from provisioning to runtime management, to patching and maintenance.

This runs for five minutes or so and the result is a working, resilient Kafka cluster running across four availability zones, built to fail and built to scale. Our dev team doesn't need to know how the sausage is made to use the system, and the operations team doesn't need to be an army of 20 to keep it running.

Kafka is just one of our components. Our stack is built on Kubernetes talking to Cassandra and ElasticSearch, queueing into Kafka which feeds Spark streaming against data-lake backed machine learning models. All fully monitored, locked down, and running hot-hot across multiple regions and availability zones.


Serious Craftsmanship

It's serious craft and we need serious craftspeople to make it happen. If you think like us, if you're up to it, and if you are looking to work with some of the best around, drop us a line at [email protected].