I spent more than ten years developing code with little thought to how that code really made it into the hands of my end users. When I first began developing software in my undergraduate classes, delivering code was literally delivering code files. Eventually, it became building an installer, or packaging those files into a JAR file, or at the very least a zip file. After joining the workforce, delivering code didn’t change too much but it became a lot more frequent.

That frequency necessitated repeatability. Frequency and repeatability lead to automation. The first time I built an installer from code checked into source control using Jenkins with automated unit tests I thought I had reached the pinnacle of continuous integration. It certainly is for those industries where software delivery is still a physical installation on the end user’s side. It was only until recently that I began deploying web applications into the cloud. It’s a very different world. There’s something quite different about pushing bits to a live website. It’s both exhilarating and terrifying.

Working for a company the size of Rally, I have learned a lot about the process of deploying code. Frequency and repeatability are even more important. Automation is more than just a necessity. It is a business requirement, and that has led us to automating more than just our software deployments. With Terraform and Amazon Web Services, we are automating our infrastructure.

Becoming Dev Ops

As a developer who never gave a second thought to how my users wound up using my software, I now spend my time putting my team’s code in the cloud and managing the servers where that code runs. Going from developer to operations, I was concerned I would have to learn about networking and server management, operating system updates and firewalls, subnets and VPNs. I was right. I have learned about all of these things, but the tools we use have made it much easier to grasp the concepts. With Terraform and AWS, the transition to operations was much simpler than I thought.

Terraform was designed to be declarative. Instead of simply wrapping the AWS CLI in some sort of DSL, Terraform forces us to think about state. Instead of thinking about how to set up our infrastructure, Terraform allows us to think about what it should look like. Its declarative nature allows us to identify relationships and dependencies with respect to servers, load balancers, firewalls, DNS records, databases, etc. With just a little bit of knowledge about Terraform and AWS, a newcomer can open the configuration files to discover how our infrastructure is put together.

The following Terraform code describes a few different resources in AWS. Even though I did not know what any of these were at the time, reading carefully and with just a little bit of research, I learned that the code describes how communication occurs between our tool network and our development network. Specifically, the code describes a subnet per availability zone, count = "${length(split(" ",var.az))}". The subnets are then associated with route tables. Those route tables control the traffic on the development network. In order for traffic from the tool network to communicate to the development network via those route tables, the aws_route resource creates the connection through something called a peering connection.

resource "aws_subnet" "dev-private" {
  vpc_id = "${aws_vpc.dev.id}"
  count = "${length(var.az)}"
  cidr_block = "${cidrsubnet(var.vpc-cidr,4,count.index + length(var.az))}"
  availability_zone = "${element(var.az,count.index)}"
  tags {
    Name = "private.${element(split(" ",var.az),count.index)}.${var.namespace}"
  }
}

resource "aws_route_table_association" "dev-private" {
  count = "${length(var.az)}"
  subnet_id = "${element(aws_subnet.dev-private.*.id,count.index)}"
  route_table_id = "${element(aws_route_table.dev-private.*.id,count.index)}"
}

resource "aws_route_table" "dev-private" {
  vpc_id = "${aws_vpc.dev.id}"
  count = "${length(var.az)}"
  propagating_vgws = ["${var.propagating_vgws}"]
  tags {
    Name = "${element(var.az, count.index)}.private.${var.namespace}"
  }
}

resource "aws_route" "dev-tool-private" {
  count = "${length(var.az)}"
  route_table_id = "${element(split(" ", aws_route_table.dev-private.id),count.index)}"
  destination_cidr_block = "${aws_vpc.tool.cidr_block}"
  vpc_peering_connection_id = "${aws_vpc_peering_connection.tool-dev.id}"
}

Even though it took me a while to understand all of these concepts and how they truly work together, I can easily contribute to this code base. For example, if I want to add a connection between our tool network and our production network, I can follow the same pattern using the production route tables instead of the development route tables. Or, if I want resources in the development network to communicate to resources in the tool network instead of just the one-way communication described above, I would add another aws_route for the tool-private aws_route_tables.

Certainly, I’m still learning. I was lucky that someone with much more experience and operations knowledge laid the ground work for our team. Our Terraform configurations are readable and encapsulated. They are organized in such a way to keep different concerns separate. We take advantage of modularization as much as possible to achieve code reuse for maintainability. We try to keep our variable names and resource names descriptive and self-documenting. In essence, our Terraform follows the rules of clean code, and that makes the transition from dev to devops smoother because I can apply the lessons learned from developing software for nearly a decade.

For example, we have a source repository for the Terraform files that define our network layout, one for the files that define our QA environments, one for the files that define our CI environment, and so on. We can identify the risk of a change by looking at the repository where the files are stored. The organization of the files helps us focus on the things that need changing because everything in the repository is related to a single aspect of the infrastructure.

In another example, Terraform gives us the tools to limit copy pasta. We found the need to have multiple launch configurations because some servers needed more memory and others needed more CPU. Even though the servers need different hardware specifications, the initialization scripts are identical. We take advantage of the template_file resource type to define the scripts for the servers, AKA user data. By not repeating ourselves, we make our Terraform much more maintainable.

Side note: When getting started with organizing our terraform files, we took this blog post pretty seriously. This approach makes our files easier to read and way safer to run.

Change Control

If we only considered code cleanliness, Terraform would not be our only choice for managing our infrastructure. On its own, we can consider Amazon Web Services infrastructure as code. With the AWS CLI, we could have written scripts to manage all of our resources. In fact, we do have scripts that help us identify when servers are healthy and ready for software. The APIs were designed with automation in mind. However, Terraform builds on that, helping us abstract away all the API wrangling and also understand the changes we make.

Terraform compares the current state of resources in AWS to the desired state described in the configuration files. If Terraform identifies differences, Terraform makes the appropriate AWS API calls to reconcile those differences, creating, modifying, or destroying resources as appropriate. Even before applying the changes, Terraform allows us to view a plan of those changes. We can confirm our changes to the configuration files are making the expected changes to the resources.

The following shows an example of a Terraform plan. It shows the changes required to enable sticky sessions on a load balancer. The load balancer is changed from using TCP to using HTTP communication with a certificate configured for HTTPS encryption. The stickiness policy is tied to the load balancer on a specific port with an expiration time of one hour.

~ aws_elb.app
  listener.2249117627.instance_port: "" => "443"
  listener.2249117627.instance_protocol: "" => "https"
  listener.2249117627.lb_port: "" => "443"
  listener.2249117627.lb_protocol: "" => "https"
  listener.2249117627.ssl_certificate_id: "" => "arn:aws:iam::account:server-certificate/cert-id"
  listener.2974294026.instance_port: "80" => "0"
  listener.2974294026.instance_protocol: "tcp" => ""
  listener.2974294026.lb_port: "80" => "0"
  listener.2974294026.lb_protocol: "tcp" => ""
  listener.2974294026.ssl_certificate_id: "" => ""
  listener.3057123346.instance_port: "" => "80"
  listener.3057123346.instance_protocol: "" => "http"
  listener.3057123346.lb_port: "" => "80"
  listener.3057123346.lb_protocol: "" => "http"
  listener.3057123346.ssl_certificate_id: "" => ""
  listener.610193557.instance_port: "443" => "0"
  listener.610193557.instance_protocol: "tcp" => ""
  listener.610193557.lb_port: "443" => "0"
  listener.610193557.lb_protocol: "tcp" => ""
  listener.610193557.ssl_certificate_id: "" => ""

+ aws_lb_cookie_stickiness_policy.app
  cookie_expiration_period: "3600"
  lb_port: "443"
  load_balancer: "app-env"
  name: "app-anv"

The pull request to the configuration files was less than twenty lines. The change itself is not that complicated. It requires about three or four API calls or a dozen clicks in the AWS Console, but encapsulating those calls in a code change makes it so much easier to validate the change. The configuration files make it repeatable and remove the possibility for human error.

Push the Button

The repeatability that comes from using Terraform is both exciting and a little dangerous. With Terraform, we have a way to automatically create all the necessary resources to standup a full application environment just as it is in production. We can literally push a button (maybe two) and within an hour we have a fully functioning QA environment in which the only difference from production is the size of the servers. The conformity of our environments gives us confidence in our QA process. If the code works in the testing environment, then it should work in production (ignoring any differences in data, but that’s a different problem). That is the exciting part. The danger arises because these resources are not free. If we push the button too many times, someone might come take that button away.

Even more dangerous is the fact that automation necessitates more and more automation. If we stand up a dozen environments, we have to make sure those environments do not drift from one another. Otherwise, we might end up chasing ghosts. We need an automated process to push changes across environments when the code change makes it to the master branch. We need monitors and automated tests to give us confidence in our changes across those environments. We are not quite there yet, but we continue to improve our monitoring. We continue to automate the flow of changes from developers’ keyboards to production. As any good devops engineer knows, the goal is to automate our jobs away.

A small clarification. While Terraform on its own can enable this kind of simplicity and repeatability, we use a few other tools to make it even better. We use TeamCity to manage our environments, manage our variables, and automate the Terraform runs. We use Octopus Deploy to deploy our software. These three tools together (Terraform, TeamCity, and Octopus Deploy) allow us to automatically spin up environments in the morning and spin them down at night with a few scheduled triggers. TeamCity kicks off the Terraform jobs that create and destroy the resources and Octopus Deploy automatically deploys code to the servers when they register.

A Culture of Code

When our infrastructure was hosted on more traditional infrastructure, that is physically managed hardware, few on our team knew much about where or how the code made it to the end user. Now, with our transition to using Terraform and AWS, it is much easier to spread the knowledge. Our infrastructure is documented in code. Approving changes to infrastructure is no different than approving changes to our software. We are approaching a point where development and operations is actually one thing, devops. Terraform and AWS are becoming just two more tools in our tool set to delivering user value. Our full-stack developers become more than just front-end and back-end experts. They can deliver the code and the servers where it runs.

Above I mentioned how we use Terraform, TeamCity, Octopus Deploy to manage our environments for testing. Well, we also use Terraform to create the infrastructure that runs TeamCity and Octopus Deploy. Using Terraform in this way, we can easily test out new versions of our tools without impacting developers day-to-day work. With a change to a few variables, we stand up a copy of our TeamCity infrastructure including servers, database, etc., and test changes to the TeamCity software before rolling out to our developers. This also allows us to create similar infrastructure for other teams at Rally who are also interested in using Octopus Deploy and TeamCity. We can create logical separation but it is all based on the same code.

Our team within Rally is not the first to adopt the use of AWS, but we are the first to go all in on using Terraform to manage our resources in AWS. In our move to AWS, we had help from other teams within Rally. When they made changes to our infrastructure, they made it manually. We took that opportunity to document the changes in Terraform, revert their work, and apply it via Terraform. We are very protective of our AWS account. If anyone requires a change for any reason, it has to go through Terraform.

Managing Windows

I have not mentioned operating systems yet which might seem strange considering I am discussing infrastructure management. Our team at Rally does most of its work in .NET. That means we are running software on Windows. If someone is comfortable managing Windows servers on traditional, managed infrastructure, I imagine the transition to using Terraform and AWS would not be a big one. Having never managed traditional infrastructure, I cannot comment on that transition, but I can say that ensuring our servers are up-to-date and secure is very straightforward.

Along with Terraform, we use a tool called Packer, also developed by Hashicorp. Like Terraform allows us to declare what our infrastructure looks like, Packer allows us to use code to declare what a Windows server looks like in terms of installed software and OS configurations. Packer uses JSON to define what files and scripts will be installed and run on an AWS EC2 instance. Once all the scripts run, we can take an image of the server, an AMI. We then use that AMI to provision the servers in our infrastructure.

When we want to update our servers with the latest security patches, we simply need to use a different base AMI. There is no management of existing servers. They are truly ephemeral. If a server dies, we lose nothing. Logs and events are captured in Splunk and Datadog respectively. With the following Terraform option, we can roll out servers with the latest updates without any downtime.

lifecycle {
  create_before_destroy = true
}

Using that option on our autoscaling groups and our launch configurations allows us to stand up new servers before the existing ones are destroyed, which is the opposite of how Terraform manages destructive changes by default. With proper health checks on the servers, we can ensure that Terraform does not begin the destruction of the old servers until the new servers are ready to serve traffic. That combined with the repeatability of the configuration means we can create a testing environment based on those images, verify our code runs as expected on the new images, verify our code meets performance expectations on the new images, and roll out the new servers to production all without users ever noticing.

From the perspective of Terraform and AWS, managing Windows servers does not differ all that much from managing Linux servers. The biggest difference might be what ports have to be open to allow for remote access to the servers. We still define servers in terms of number and type. We still host the servers behind load balancers. The difference comes when dealing with the nuances of the operating system, things like updates, taking snapshots, running anti-virus software, cost, coordinating NTP, and provision times. Because we build our own base server images on which we configure the software and tools we use to run our software, it takes up to three times longer for a new Windows server to become available than it does for a Linux server. The cost is about two times that of running Linux servers.

For that reason, we are exploring a transition to .NET Core running on Linux. (More on that in a separate blog post.) More than anything, it is our use of Terraform that has given us the confidence to transition one of our applications to running on Linux. Making and testing that change to our infrastructure will be repeatable. The pull request will be readable. The hardest part will be changing the actual application code, not standing up the infrastructure to run it.

A Gift and A Curse

Even with all of these benefits, Terraform does have its downsides. As a piece of software, it is still going through a lot of change. As of this writing, it is at version 0.8 with version 0.9 in beta. When we started using it less than a year ago, it was at version 0.6. It has its bugs, and with each new version, it’s had its breaking changes. All that leaves us cursing and praising Terraform sometimes in the same sentence.

For all the things at which Terraform excels, there are a few things we find ourselves having to work around. For example, the following code creates DNS entries for the items in the database list. Notice how the list is in alphabetical order. For readability, it sure would be nice to keep that list in alphabetical order, but we found that it is less painful to simply add new items to the end of the list. This is because of the way Terraform manages resources created in a loop when using the count parameter.

variable "databases" {
  default = [
    "database-a",
    "database-c",
    "database-e",
  ]
}

resource "aws_route53_record" "databases" {
  zone_id = "zone-id"
  count = "${length(var.databases)}"
  name = "${var.plumbing-environment}-${element(var.databases, count.index)}db"
  type = "CNAME"
  ttl = "30"
  records = [
    # don't put the endpoint's port in the DNS record
    "${element(split(":", "${aws_db_instance.sql-server.endpoint}"), 0)}"
  ]
}

Once this code runs, the Terraform state file will show that it created three aws_route53_records that it names aws_route53_record.database.0, aws_route53_record.database.1, and aws_route53_record.database.2 for tracking purposes. If we add database-b in alphabetical order, Terraform will destroy aws_route53_record.database.1 and aws_route53_record.database.2, recreate them with new names, and create aws_route53_record.database.3 with the value database-e. The reason Terraform destroys and creates is that the name for the given resource is changing and the AWS API does not allow us to change the name of a given record through the API. The other reason is that Terraform does not realize that the items in the list are simply moving around. It does not compare items in the list. The change logically looks like the following.

Before Terraform Apply

Before Terraform Apply

After Terraform Apply

After Terraform Apply

All that seems fine, right? We should end up with the proper DNS entries once it is all done even if Terraform went through more steps than we would expect. Unfortunately, Terraform applies changes in a parallel manner. The aws_route53_records.database.3 for database-e is created at the same time that existing aws_route53_record.database.2 is being deleted. The call to AWS is simply “create a record with this name” and “delete a record with this name”. When deleting the record, AWS simply looks through the records it has and deletes any and all records with the specified name. AWS knows nothing about what Terraform is trying to accomplish in terms of the final state. Terraform simply makes the API calls based on what it knows has to change. It completes its work without complaining, but we when we look at the state of AWS, we are very confused as to why the entry for database-e is no where to be found.

It’s little things like this that have come to bite us. Some of it stems from how Terraform manages state as a graph of dependencies. Others stem from the eventual consistency of the AWS API. Thankfully, Terraform has gotten better at managing the consistency problem. The worst issues come from failed Terraform runs. This can happen if we configure Terraform to do something that is perfectly legal from a Terraform perspective but that AWS does not allow.

For example, Terraform allows us to define a launch configuration for a server with an instance type of t2.medium and a spot price of $0.03 an hour. Terraform does not complain and happily creates an autoscaling group with this launch configuration. The apply succeeds and we can go to the AWS console to see the autoscaling group trying its hardest to create a t2.medium with a spot price of $0.03. Unfortunately, it will never succeed because specifying a spot price for a server of the t2 family is not allowed by AWS. In other words, using Terraform is not a replacement for understanding the rules of the cloud infrastructure we are using.

Where to Next

We are just getting started. Terraform has more to offer than just provisioning AWS resources. We have already started using it to define monitors in Datadog. Terraform has support for Docker, PostgreSQL, Bitbucket, GitHub, Pager Duty, and RabbitMQ. If we had to switch to Microsoft Azure or Google Cloud Services, Terraform supports those too. The list goes on and continues to grow. In an ideal world, one we hope to reach, all of our infrastructure needs from code repositories to monitors and alerts are created via Terraform. That is the true disaster recovery plan. Configure some variables for the accounts we own and where we store our backups and click a button. Just point Terraform in the right direction and watch our resources comes to life.

That will take time. If we were starting from scratch, it would certainly be easier. For now, we will make incremental progress toward the promised land. Hopefully, Terraform doesn’t leave us stranded, but with over sixty branches, more than eight hundred contributors, and hundreds of commits per month, it should be around for a while. As for AWS, I think it’s safe to assume they have a pretty long life ahead of them. We think Rally Health does too, and that’s why we chose Terraform and AWS.