This blog intends to capture Rally Engineering’s journey from a single monolithic web site to a SOA platform with 100+ micro-services. During that time, we scaled our platform from 10K registered users to 17.5MM and our WAU (Weekly Active Users) from 0.5K to 1.2MM. In the meantime, the engineering team grew from 20 engineers to 300. We hope that the community will benefit from our learnings and our challenges. We definitely have not solved all the problems and we are still actively investing in the effort. Our story is one way to reach the destination.
Also please let me make it clear that if you are not growing fast, some of the below will not apply to you. If your business is stable and your current app does the job, there is no reason to change it. I would not move to SOA as an experiment if there is no clear business and technical value.
The story - our Odyssey to SOA
The Rally Odyssey
Daunting reality - the initial landscape
Back in the day, the initial architecture of Rally Engage product was a monolithic web site. It ended up there after a few transformations during a period of 2-3 years and contained pieces of functionality and components that were re-used as initially written, in order to deliver for the business as quickly as possible. Everything was optimized for speed and independence between teams. To get a high level idea, here are some key characteristics:
- The site had a classic 3-tiered web architecture where the client is talking to the internet facing service layer. That layer is calling into a single service that is guarding the database (which contains the data)
- The connections were sticky between the client and the web tier
- There was one database for everything so any bad performing feature could potentially bring down the whole site
- Releases required to shut down the whole site because the code could not handle transient failures. Error handling of partial failures / retries / etc were limited or non existent
- The ops infrastructure was 100% manual, releases required five people working for 2+ hours in the middle of the night to get the code out
- Release coordination was required; we had to test all changes in our integration environment across all teams which took ~3 days before a release could get certified
- Instrumentation/monitoring/alerting was minimal, in fact we went live with our first customer with practically no visibility into the user experience from an engineering perspective except basic health metrics on the VM instance level (e.g. CPU, memory utilization, network bandwidth)
The list is actually long but the takeaway here is the level of manual intervention, inflexibility, interdependency and inability to quickly scale.
Business forces unleashed
The pressure from the business side increased dramatically. We had to scale the platform to ~20x more in a period of 9 months and ~40x in the following 3 months. Things definitely looked challenging! Our SLA guarantees required us to keep things running while re-architecting our platform.
The North Star
At that point, we had to take a step back and quantify how the ideal world would look for us. The key tenets we came up with are not surprising:
- Zero-downtime, push-button releases
- Releases of incremental features without coordination
- Maintain team efficiency and parallel development of features
- Push button environment management
- Do the above while scaling our user base, with 99.9% availability
We came to the decision pretty quickly that we had to get to a microservice architecture and break things off over time; our monolith was increasingly slow to build and difficult to release. Also, we recognized that it is important to put a stake in the ground and develop new products in a SOA manner to stop the bleeding/debt. In the initial phase, we took 3 parallel steps that were almost completely independent of each other.
We picked a new product and we architected it in the “way of the future”, following SOA principles. We ensured that the new set of services would run with no overlap with our legacy platform. The new ecosystem of services abided by all the principles outlined in the previous section. Think of it as 90% new vs 10% old. It was risky and to mitigate the risk, we put our best engineers (very small team initially) to get it right.
In the legacy stack, we started “chipping” off the rock by separating the most important component, our account system. That team was also staffed for success with folks that had knowledge of the entire legacy code base.
In the legacy product, we decided to deprecate the things that would not allow us to scale. We had to meet the business requirements so we could not leave the codebase in maintenance mode. We removed the sticky connections architecture, created separate databases for functionality/features that had similar traffic patterns. Last but not least, we deprecated code/features that were not in use anymore.
From a project management perspective, we set clear target dates where all efforts would be in production.
For the new product, things went pretty well in a sense that there were minimal dependencies to the old world so it was like building something new with all the previous knowledge. The team adopted best practices based on past learnings, picked new tech to bet on and moved fast. In fact, they built a product that went live in 6 months with 4 people. The new product scaled better than our legacy system, it was cheaper and more reliable. It was a real world test for our new technology choices.
Separating our account system into a stand-alone service had different challenges. The team maintained ~70% of the old technologies and changed ~30%. We had to swap from the old system to the new one without downtime. We utilized “dark mode” as a technique where the new and old system were live in production side by side. When a request came in, the code would hit both systems, the legacy one would be the source of truth and behind the scenes we had code that validated that both systems were consistent. When we hit no errors or inconsistencies in data for an extended period of time, we switched all the code to use the new account system and removed the legacy code/services.
As we emerged from this phase, we documented and set the bar for all new work to follow the microservices best practices across all of engineering. We got product agreement that new work had to be done the right way so everyone was on the same page with regards to implementation cost and what quality means.
The next round of changes focused on (A) the devops infrastructure and (B) breaking the monolithic app to a manageable set of microservices.
On the devops infrastructure side, we took a step back and came up with a design/architecture that would transform our devops to a PaaS offering. We defined clear requirements for CI, deployment, version management, and secret management that stemmed from our learnings in the past and industry best practices. Effectively some of us played the “product” role for devops and we treated it as a “product”. The development effort was split conceptually in two phases. In the early phase, the development was done in isolation from the product teams. The second phase started when the feature set was mature enough. We created a strike team with talent across devops and the team that would consume the new platform and we migrated the legacy ops scripts to the new PaaS product. Lastly, we also ensured that all new services would use the new platform for their devops needs.
For the second work stream in this phase, we partnered closely with product. The key was to include the architectural changes of separating the functionality of the site into independent microservices as part of the feature roadmap. To achieve that, we were clear on what the benefits would be (better performance, no downtime, and faster releases for example) and due to our growth in most cases, the discussion was easy and led to a quick agreement.
When re-architecting a component, it is critical to help the team step back, have clear technical requirements and a business deliverable as a forcing function to get to production. Without these three elements, it is very easy to go on a philosophical exploration during the design phase that leads nowhere. Developers at times will propose changes based on what they want to learn, what is new in the industry, and what they heard is “cool,” and it is important to have senior technical leadership alignment otherwise you will waste time because most likely you will have to re-write the component yet again.
At this point, we had laid the foundation not only to transform the legacy code base but also implement our future services in a decentralized way. In the next blog, we will explore what were the challenges we faced as we transitioned to SOA which got magnified by the fact that we also grew in size both in the number of engineers as well as lines of code and products.
Stay tuned for part two!