In the past few months, I found myself presenting the Rally Engineering story. There was a lot of interest and curiosity on what was the secret sauce behind delivery and execution so I have decided to share the high-level points that define and characterize how we design, code, test, ship, and maintain our live site health. This is one implementation and definitely not the only way to do things. It is meant to offer ideas, and spark thoughts and insights to the interested reader.
This blog will cover the following high level topics:
- Development cycle at Rally
- How we use metrics at Rally
- Engineering organization and culture
Development cycle at Rally
The development cycle at Rally starts with the high level business requirements as they flow from the sales teams or the business owners based on industry trends and business development opportunities. The product teams take a look at these requirements and scenarios and map them to a set of existing feature enhancements or on rare occasions to a brand-new product. The next step is for the UX team to get involved to ensure an “awesome” user experience which happens in collaboration with product and engineering. The overall plan is mapped into a phased approach so that the new functionality can be built in phases and “go-live” incrementally. The working teams consist of representatives from all departments (eg. Product, Project Management, UI/UX, Engineering, QA, Devops, etc.), and they operate in sprints applying Agile methodology. The product team produces a well-defined specification that encompasses all the aspects of the feature. The working team then pulls stories from the backlog based on the agreed priority that is set by product for features and by engineering for necessary debt and infrastructure changes. One of the key elements of success is the great collaboration between product/UX, engineering, and project management when it comes to engineering investments that must happen to support the product. Engineering is an equal partner in the conversations and there is alignment on the importance of quality, stability, and performance across all disciplines – we pay down our debt.
Any code changes go through peer code review and pass through our CI system to ensure that nothing on the feature branch is broken. Once a feature is completed, the release candidate moves to the integration environment for final testing before being pushed to the production site. During this phase, we allow for partners to test risky changes that require extra attention and test scenarios and we validate existing functionality as well as new functionality. QA signs off after the full regression suite has passed. Our releases to production happen with the push of a button, they require no downtime, and they are invisible to the users of the live site. We have a true SOA architecture which allows for releases to go out quickly and the outage risk is minimized. Once the new code is out in the wild, we have to validate the health/success of the release. For that purpose, we use “synthetic” monitoring probes which simulate real users, we detect new errors in our log files(or changed error velocity), and we also have a set of automated alerts per service.
We believe in a very strong live-site culture. The health of our production site falls on the shoulders of the engineers who are closest to the code. Issues in production always take precedence and we have dedicated support in every team outside the sprints of the development cycle. That gives us flexibility to address issues quickly and minimize the impact to the users. Everyone is on live-site rotation on the engineering team as long as they have enough experience and knowledge. It is a dimension on the engineering career ladders and excellence in this area is acknowledged and celebrated across the company. In absence of live site issues, engineers are free to work on their own projects, technical debt / tools or anything that sparks their interest.
From an execution perspective, it is interesting to also cover briefly the scenario where execution spans across the boundaries of one team or product. There are two ways to solve the problem. Either execute in a vacuum and go through an integration phase OR build a team that consists of members across both teams and execute together. In the first scenario, there is a need for a lot of upfront coordination around scenarios, definitions of APIs, and generally setting the “interfaces” and “integration points”. It works well when a team is integrating with an external entity that has processes and moves in a more waterfall model. The second scenario is a more modern practice and works well when you can afford moving fast, adds incremental value and leads to higher engagement, ownership and pride in the final result. We try hard to run cross-product projects with virtual teams, following the latter approach.
How we use metrics at Rally
We believe in metrics, we use them a lot to make engineering decisions and set direction. Metrics is the unemotional reflection of where a team excels and where there is opportunity. Metrics open our eyes to reality and are mirrors for our blind spots. At this point, I will cross promote a great blog post written by Sam Freund, as it definitely captures our philosophy around tracking metrics: http://engineering.rallyhealth.com/process/engineering/2017/09/01/red-metrics-ok.html
Engineering organization and culture
Rally is organized in functions (e.g. all the Product Managers roll up to a Chief Product Officer) and Engineering follows the same pattern. However, the teams are formed based on what they work on and they work as one unit. They have a sense of pride, belonging, and ownership. The engineering organization sets goals on the company level around quality, stability, execution, handling debt, etc. and these get translated in each team’s context by the engineering managers. There is a lot of autonomy during that mapping and the goals become something concrete and relevant for each team rather than some high-power mandate that bears no connection to reality. The engineering managers are working with product managers on building one roadmap that marries the balance of new features vs. engineering investments. The engineering managers are technical to understand the architecture for their components/services/ecosystem as well as their internal and external dependencies and can represent the team as needed in discussions with partners or non-technical folks in the company. Most importantly though, among the engineering management there is solidarity and desire to help each other when one faces a challenge. The team operates in a truly collaborative way where we put the company first above anything else (e.g. team size). That is also reflected in compensation. Folks are getting rewarded/promoted etc because they delivered and scaled something that moved the needle, not because they increased their team size. Also from a compensation perspective, performance in previous years does not dictate future compensation.
As described, there is a lot of autonomy. That does not mean we do not have alignment as an engineering organization. We achieve alignment by two mechanisms, common processes/guidelines and technical forums that span across teams. More specifically, here are some of our common processes and guidelines that are adopted across the organization. Think of these as our axioms: Coding standards and best practices
- Code review process
- New service building guidelines
- New library building guidelines
- DevOps checklist guidelines
- New tech adoption process
In parallel, we operate the following forums. Each one consists of senior engineering folks based on interest and free cycles. Each one of these forums has a focused mission/vision and concrete goals and supporting metrics. We found that it also allows engineers to voice their technical opinions, collaborate more effectively across organizational boundaries, and cross-pollinate.
- UI Engineering: Ensures we have the same UI standards across all products, UI technologies and how/when they should evolve.
- DB Engineering & DevOps: Pretty self explanatory. The main goal here is to share learnings, help each other when there are challenges with scale/performance, evaluate new technology or upgrades to the existing one and keep pushing our Platform as a Service DevOps infrastructure across the company.
- Architecture Review Committee: This group mainly informs the rest of Engineering about new components that are being architected and/or modified. It is an opportunity for folks to ask for help if they need it and we have a specific process that every team follows. It covers the technical aspects of the SDLC and integrates the guidelines mentioned above into the flow. It is truly a review/infoshare of peers and not a verdict from astronauts.
- Tech Leadership: This is a small group that is focusing on strategic technical problems/investments, technical standards and overall technical direction. The output is pushed to the executing teams either through one team or through the engineering management team in order to get executed. Often, the output of this group causes updates to the architecture review process.
- Eng Management Leadership: This group is focused on solving management challenges across the organization. It also acts as a forum for information sharing and flow from the executive level down.
At Rally, we definitely do not have the solution to everything. All the above is really what worked for us so far and it will change as the business and the platform evolves. At the high level, stepping back, evaluating efficiency, and making necessary modifications to maintain the execution velocity is a constant exercise that everyone has to perform on a regular basis especially in phases of hyper growth. As a leader, you want to always stay ahead of the changes so you can manage them, instead of reacting to them.