Part 2

New world, new problems

Like anything in nature or social structure or systems, the new SOA world has brought us a new set of challenges that we continue to work through as we evolve. The metrics from an engineering perspective show that we are in a better place since they all moved in the right direction. However SOA does not come pain-free and it does require continuous investment in order to maintain developer efficiency, agility and speed without compromising quality. Here is our experience.

Technology challenges

Repo hell

As we started breaking the code in microservices, we ended up with 300 repos in Github. That created productivity challenges ranging from “how to find all the places in the code base where a function that I am changing, is used” to “who owns each repo”. The biggest problem was the latter. Without clear ownership of a repo, you have no accountability. Engineers will evolve the code without necessarily paying attention to unit tests, and your API’s will evolve without strategy and morph into a labyrinth of overloaded calls. It is impossible to find out what to deprecate and what to maintain. We do have owners for our repos but we are still struggling with evolving our shared code.

Code sharing - libraries

In a monolithic app, things are easy. Changes in shared code can easily be fixed everywhere. With 300 repos when some of them contain shared code, this quickly becomes a huge challenge due to entropy. Circular dependencies and evictions of the desired version of a library were some of the problems that would hit us daily. Making changes across the code base was hurting productivity a lot. We have improved the biggest pain points with version pinning, breaking change communication, and build improvements for automatic version management. We are still battling issues around code sharing today, though, and we have a long way to go.

Integration environment squatting

With around 30 microservices trying to go to production at any time, getting the integration testing done efficiently is hard, especially when this involves external partner sign off where we do not control when they need access to the environment. We are addressing this by having the ability to spin up an environment that is identical to production in less than 10 minutes and then spin it back down. That is part of our devops infrastructure investments that I plan to cover in detail in another blog post.

Another direction we are exploring and looks promising (no results yet) is to implement contract driven testing for backend services. They can be fully automated and will not require validation on “integration” most of the time. Assuming the rules around backwards compatibility are not broken, one could argue that theoretically, they could always go to production without integration testing.

CI/CD breakdown

As the number of engineers grows and the number of microservices follow that growth, the CI infrastructure became a bottleneck. It is easy to imagine why this would become a problem. In our case, it was unreliable to begin with. The issues around Jenkins being slow or unavailable were in the order of 3-4/week. We addressed this by creating a team that focused on improving it and selecting it as one of the first candidates to migrate over to the on-demand devops platform.

Service functionality duplication

When things break in the pieces that make up a platform, teams have to move fast and you end up with more than one service that does the same thing. Some engineering managers will not even think that this is a problem. It is natural because “product said we have to do it”. Given that, you will end up with 3 recommenders, 2 different ways to track analytics, 2 libraries to manage secrets and so on. As an engineering leader you have to accept that this is normal and pick the strategic pieces to align the company. You cannot and you do not want to win every battle in this case. Speed is important and so is vision. Your judgment will guide you and will tell you where to insist vs where to let go. Since we are all human, you will probably make some mistakes. You will definitely upset your engineers because you are stepping in. As a leader, you need to get comfortable with the fact that people will not like you all the time. That is okay!

Client libraries - to be or not to be

In a single code base, having a client library to call a component is a no brainer. It improves productivity, abstracts the object model, and makes everyone faster. In a SOA environment, when a team owns a service, there is a philosophical debate around “do we need to write a client library or not”. Today at Rally we are in a hybrid world where for some services there is a client library and for some there is auto-generated API documentation that the service emits. In my opinion, client library maintenance long term does not work. The only exception to this would be if a service is doing something very sensitive (say secret management) and there is a need to make sure everyone codes the same way to reduce the probability of an erroneous usage pattern.

Versioning and metering

Version everything. From build numbers, to object model, API, schemas, the whole nine yards. And not only version, also track who is calling what. There is a value in tracking what endpoints on your microservice are called by which consumer in itself but also helps you “know yourself.” In this way you can manage upgrades and estimate impact better. Metering, especially, is one of the areas we are continuing to work on and improve.

Scalability planning and testing

This goes back to the need for metering but when you want to plan your capacity during peak traffic season, without metering, it is a time consuming theoretical exercise with a lot of complicated spreadsheets. In our case, we track a lot of metrics and we did since day one so due to the long historical data, it was relatively simple to create models. Predicting load for shared services (think authentication or authorization) is particularly complex to model in light of cross-product user flows and new traffic patterns emerging after every sprint. We have overcome this by manually over provisioning and baking the requirement for elastic expansion of the devops infrastructure in to our new platform. That does not mean we do not need to close our aforementioned metering gaps.

Environment consistency

We had handcrafted snowflake environments for every team. This issue was identified even in the monolithic world. Even the integration was different than the performance environment and both different than production! The new devops infra is solving this problem by providing an easy way to always spin up identical environments for a team and its dependencies. In this way every team has a golden image that it offers to other teams for testing and can spin up in less than 10 minutes.

New technology creep

You can imagine that when there is separation on the service/domain level, technology choices also decouple. In your old, single web site world, everything was easy to align when it came to new technology. In a distributed world, teams need to move fast so they may pick tools they like, they may pick a new DB, etc. The key here is to enable the right new technology, not any technology. At Rally we do have a body of our senior engineering staff reviewing (a) new technology adoption proposals and (b) decisions to upgrade to newer versions for existing technologies. In this way, the technical leadership gets a chance to ask the right questions early enough. To get there, we went through a growth phase where certain things slipped through the cracks. It is important to enable the teams to make their technology decisions quickly and encourage adoption of new tech while maintaining some sort of governance. Any new technology has a learning curve. If your team has expertise, the curve is going to be short. Most of the time, folks want to adopt based on a “paper” or “google research” and they definitely cannot become masters of tech because they read blogs or saw code snippets. We brought the live site down several times with new technologies that we did not yet understand well enough – it took us a long time to conform to industry best practices for MongoDB in terms of index creation and migration writing. You can adjust the control to your risk profile as a business. The choice is yours really.

Moving to the new version of “something” across the board

Think of the Big Bang! The start of the universe! The same chemical elements we have on Earth probably exist in 99% of all planets. The same thing happens when you break up one monolith into smaller pieces. In our case, we ended up with a large MongoDB footprint (arguably even for data sets that Mongo is not ideal) and everyone uses Play for the web tier. Still today, we cannot move as fast as we would like to make this happen across all teams. I will not offer any pieces of wisdom besides that we have tried a number of things and we are still struggling in executing coordinated upgrades in a predictable and efficient way. Although there is a process, there is alignment among engineering and product, and there is alignment on the team level, we still have a hard time orchestrating it the right way. We experience unforeseen hidden costs, code dependencies that we didn’t know existed, which always delays our releases. I promise to post an addendum to this blog when we solve it.

People challenges

Tech god syndrome

This issue comes with the growth where in the old world there was a small group of people who knew everything in the code base. As the services break out in smaller parts, they will want to be involved in every design and discussion because of the way things were. These folks are an encyclopedia of experience, they are assets and bring tremendous value in terms of historical reasons regarding why something was done in a certain way. They know what worked and what did not. However, having them be part of every single design conversation will not scale so as a leader you have to find the best match between their passion and business need and help them go “deep”. Breadth is typically present in a startup, depth is lacking and is required in the microservices phase.

Releasing control applies to you too as the leader of the organization. The question is not “if”, it’s “when” and you have to use your judgement on how to best implement it and who to rely on.

The balance between strategy and tactical execution

So imagine yourself in the middle of your SOA transformation. The teams are executing, everyone has embraced the vision and things move in parallel. You feel good, you have a bar for what quality means for a new microservice and things are in full swing. Suddenly in your horizon, the dark rain clouds of business pressure show up. Your developers will make compromises and that is the right thing to do. However, what you want to look for is (a) they are still marching towards the agreed blueprint and if not is there a good reason to change it and (b) they do not accumulate more debt that they tackle. The expression we often use for the latter is: “please, do not leave more garbage in the workshop when you finish your work comparing to when you started.”

In order to address the above, some of the processes/systems we adopted and helped were:

  • Adopt a way for cross team development. Whether you do your own thing or you go all Spotify tribes full swing does not matter, You can tailor it to your organization. Process has to fit people, NEVER the other way around!
  • Let the communication flow from the exec level all the way down to every individual contributor. Weekly status reports, monthly engineering-wide checkpoints, publishing of engineering metrics are some techniques we use. In addition, we do Ask-Me-Anything sessions which encourage dialog more than one way interaction. Knowing key directional points in the business helps everyone in their day to day decisions and ensures alignment.
  • Come up with a scalable architecture review process that works independently while keeping the bar at the same level for all teams. At Rally we have constantly been evolving it and we are proud to be on the 4th iteration! Although it seems to be working for the past six months, we understand as we evolve and grow, we will change it again!