Back in 2012 when we were building the first features of Rally Engage, we decided to use Lift as the underlying web framework. At the time, Lift was the most popular framework for Scala and provided many useful features out of the box. As we scaled to more users, more developers, and more services, Lift quickly became a performance bottleneck that required replacement. In 2014, we decided to migrate from Lift to Play and, as of May 1st 2018, Rally is no longer running any Lift-based applications on production. This blog post will discuss how we made this transition and the benefits we’ve seen.
In Engage, Lift was used in two layers of the stack: at the “middle” layer (where a service called CareLift served up dynamic templates that made up the front-end of the application) and at the back-end application layer (where a REST service managed the business logic and database communication). We started the “Lift Off” process at the CareLift layer, since that’s where we hoped to gain quick wins. Moving the backend API off of Lift took considerably longer since it involved splitting out each feature into its own microservice.
One big challenge for this project was to make this transition without any downtime. We had strict SLAs to meet and we wanted to make this transition as seamless as possible. We had a several thousands of users on the site at all points of the day and we had to support every feature of the site while simultaneously changing the internal web framework. The analogy of swapping out an airplane’s engine mid-flight is apt for this effort. Read on to see how we accomplished it!
We began the process of moving our user-facing CareLift application to a Play-based service, called ZenPlay. Simultaneously, we swapped out our front-end framework to use Angular, moving towards a Single Page Application (SPA) architecture. The combination of these changes shifted more of the processing towards the user’s browser. The stateless architecture of Play also reduced load on our back-end servers. The transition spanned several months, during which some features were fully ported over to Angular components with a Play middle layer, other features were still using Lift Snippets, which required server-side HTML rendering. As mentioned earlier, we needed to keep the live site fully functional while making this transition. To make this work, we used NGINX to route traffic to the correct server, based on the route path. Each feature that was ported over to Play used routes that had a prefix of “/play”. All other routes were assumed to still be served by Lift and were routed to the Tomcat Lift servers. Static assets were served by a CDN. See figure below for more details.
This approach had several advantages. The REST endpoints could be modularized as independent Play microservices. There was no outward-facing coupling between the UI and the Play services (since they were just serving JSON). This also meant that we only needed to maintain one set of HTML templates/files. Additionally, any features that were not using Lift’s snippets were served statically, either with the layout or dynamically with Angular, allowing them to benefit from any performance gains we got from moving away from Lift.
Stage 2 Separation
We finished up the Lift Off project towards the end of 2014. However, we were still using Lift in a stateless, RESTful backend service (simply called API) that was shared across multiple teams and features. There were several issues with this service, though none were as glaring as CareLift’s. Each endpoint was written in a synchronous, blocking manner where each request caused a thread to be blocked until completion. This frequently caused issues, especially for endpoints that called out to an unreliable 3rd-party API. Having the middle layer running on Play and having the back end layer still running in Lift caused confusion among developers and required separate deployment processes. The usage pattern of this API was also different than CareLift: the API was called by other internal and external services. While any downtime in the API would eventually result in a bad user experience, it was much more subtle to detect than with CareLift, which failed more loudly since the application would be unusable without front-end assets.
Around this same time, Rally made the decision to move to a Microservices architecture, which required us to break apart API into separate, independently-deployable services. Each team moved over the feature(s) they owned into new Play services. We were also in the process of moving towards running each service in containers, which required further changes to the deployment process. Since the transition work was split across many teams, some features were transitioned earlier than others.
The actual transition of the RESTful API from Lift to Play involved the following steps:
- Implement the business logic for the service with new Play framework features
a. This involved modifying the logic from the old Lift service to use Play’s asynchronous, non-blocking Actions
b. This also involved updating the serialization/deserialization to use Play’s JSON library
- Write API tests that ensure that the contract supported by the API remained unchanged
- Deploy the new service to a lower environment and divert traffic to it for integration testing
a. Downstream services were required to update their configuration (usually just an update to the hostname) to allow them to reach the new service
- Run load tests against the new service to ensure no performance regressions were introduced
- Deploy the new service to the production environment in “dark-mode”, where no traffic was actually received by the service yet
- Once we confirmed that the new service was passing all health checks and sending metrics and logs correctly, we started routing traffic to the new Play service
- After we confirmed that the new service was fully functional and that no traffic was being served by the old Lift service, we brought it down, deleted the code and never spoke of it again (till now).
Independently transitioning a feature’s REST endpoints from the Lift framework to the Play framework allowed each team to make several structural improvements and clean up some tech debt items. These included:
- Modularizing certain common components, such as health checks and logging using Play Modules
- Implementing common authentication and authorization primitives using Play Action Composition
- Encapsulating common metrics gathering and header manipulation concerns using Play Filters
- Improving test coverage using Play’s ScalaTest+Play integration library
- Writing targeted, automated API tests that run as part of an external process to support a Continuous Delivery release cycle
Early on, we saw signs of trouble. CareLift was built as a stateful, monolithic service that used Lift’s Snippets for rendering views. It also managed a user’s session in memory - each time the user logged in, their credentials were saved in memory. All subsequent requests from this user went to the same server. This allowed us to not have to re-validate the user at every request, but limited our ability to scale up this service to support more users. At the time, CareLift was only able to support 40 users per node, there was a slow memory leak in production that required regular restarts of the service and our debugging efforts were mostly fruitless. We decided it was time to evaluate alternate web frameworks and decided on the Play Framework. We immediately realized three important gains:
Smoother performance measurements and fewer spikes in slowness and unreliability
The graph below shows the load average on the front-end servers before and after CareLift was decommissioned on production (around 11/27). The flat curve shows the increase in stability and predictability that was achieved with the Lift Off project.
Memory concerns alleviated
One of the biggest concerns with CareLift was memory usage. The graph below shows the amount of memory consumed by both CareLift (blue) and ZenPlay (green). With CareLift, we saw memory usage grow until we proactively restarted the servers. While ZenPlay did require more memory, it provided a more stable memory usage curve, which made capacity planning easier and out-of-memory issues less likely to occur.
Faster response times
In comparing round trip times (RTT) between CareLift and ZenPlay, we can see a sharp drop-off after Lift Off. Most of this is can be attributed to removing snippet/template rendering from CareLift and moving to a SPA. With more functionality off-loaded to the client browser, the servers have less work to do and response times are consistently fast.
The project to move from Lift to Play was a major architectural change that impacted every feature of Engage and required significant time and resources to implement. It was an important step in helping Engage scale, both in the number of users and developers. Amidst the transition, the application continued to function without any major downtime or user impact. At the time of this writing, more than 20 Play services and more than 20 developers are supporting the Engage product. We started out with a messy codebase that was frequently maligned as “Legacy” and we Lift-ed ourselves out and towards a much better Play-ce both architecturally and organizationally!