When you think about microservice architecture, you may picture one of those diagrams generated in the early planning phases of a project. Nice little boxes representing services, and arrows representing dependencies. Most of the boxes are attached to a limited number of arrows, and most of the arrows are unidirectional. Beautiful.

Often, these diagrams do not remain so beautiful over the maturation of a project. “No plan of operations extends with any certainty beyond the first contact with the main hostile force.”[1] In the case of software systems, let’s just say the “hostile force” is reality.

At Rally, we actively maintain an architectural graph of all services across the company. All explicit dependencies are documented. For example, if service A uses service B’s API, then there’s an arrow pointing from A to B. Sometimes, arrows are added when new requirements dictate it. And sometimes, with great fanfare and revelry, arrows are removed.

Most of these dependencies are easy to track. When you start consuming a service’s API for the first time, the graph needs a new arrow.

What’s harder to track are the implicit dependencies that crop up over time between services, because implicit dependencies can exist in the absence of direct API calls! The magnitude of this implicit coupling is also very hard to track, even though almost every developer on the project can tell you a few places where it exists.

Non-documented Coupling

I wanted to quantify this insidious coupling. My focus was on the product I work on: Rally Connect, which follows a microservice architecture. As with many microservice systems, one of the design goals is increased cohesion through the matching of domain boundaries with service boundaries. This is related to the Single Responsibility Principle as described by Robert Martin: “each software module should have one and only one reason to change.”[2] In a different wording (also from Uncle Bob): “Gather together the things that change for the same reasons. Separate those things that change for different reasons.”

Ideally, services should be cohesive, and we shouldn’t have to change more than one service when we’re adding new functionality to the system or fixing a bug. Even in a healthy system, some updates may require changes to multiple services. But this should be the exception rather than the rule. Thus, if we can identify the services that frequently change together, perhaps we can track implicit (and potentially excessive) coupling.

This type of analysis is not novel. Michael Feathers has demonstrated a technique for analyzing source control history to plot what he calls “temporal correlation” between classes in a repository. (n.b. the similar term “temporal coupling” means something completely different.) As he describes, temporal correlation is the property arising from frequent “Shotgun Surgery”: “the code smell that you have [when] you find that adding features requires you to make changes spread across wide areas of the code base.”[3]

A similar analysis can be done across code repositories by identifying commits added for the same reason. Lucky for me, every commit affecting production code at Rally requires a ticket ID in the commit message. In theory, commits in different repositories with the same ticket ID represent changes for the same reason.[4] Shotgun surgery!

The Graphs

Connect’s 15 microservices are organized into their own Git repos. By breaking out a few of my fu’s (BASH-fu, Scala-fu, Javascript-fu), I turned those ticket IDs in our commit messages into an edge-list representation of a graph, and that graph into D3.js Force-Directed diagrams.

Graph 1: Static weights

In both graphs, node size is proportional to the total number of commits to the service, serving as a proxy for its size and complexity. In the first graph, each edge weight is the absolute number of common ticket IDs between the edge’s two services. The minimum value is 1 and the maximum value is ~400. This representation favors older, bigger repositories because they have more commits overall. (Note: these graphs are live D3.js visualizations. Click and drag on a node to reposition it.)

This graph indicates a core of three services that change together very frequently, and a set of satellite services that change together less frequently. To me, the most striking characteristic of the graph is its completeness. This really doesn’t come as much of a surprise though – we have a few libraries shared by nearly all of our services and updates to those libraries are occasionally propagated to every consuming service at the same time.

Graph 2: Dynamic weights

This graph is more interesting. In this one, each edge weight is determined by a percentage:

                               shared tickets
  weight = –––––––––––––––––––––––––––––––––––––––––––––––––––––––––––
             minimum( service A total commits, service B total commits )

A 100% weight would mean the service with fewer total commits is maximally coupled to the other service. 0% means no coupling.

At first glance you can see the strong ties between some of these services. Lots of those thicker edges indicate over 50% coupling, which would seem to be a big problem! Let’s take a closer look.

A Small, Old Service

Service 11 appears the least independent – it shares over 50% of its commits with eight other services. However, you can see in the static graph that this service doesn’t have that many common tickets overall. If this were a new service, then it would probably indicate we’re off to a bad start. As an older service that’s been in maintenance mode for a while, the coupling is more indicative of the many library and framework changes that have been rolled out to every service since active development on Service 11 came to an end.

What this highlights is the extra effort needed to keep this service up to date even though its business logic doesn’t change often. If possible, it may be worth rolling it into a closely related service to save some of that maintenance effort.

A Small, New Service

Service 03 is very small and has only been around for a few months, and so far we’ve managed to keep all coupling connections below 20%. Only two services rise above 5% coupling. Seems like a pretty good start.

Bigger, Active Services

The most interesting connections are between the larger services. Service 04 is one of the largest and Service 07 is middle-size, both are under active development, and they have very different purposes. As such, the 25% coupling connection between them is somewhat curious. A deeper analysis may reveal design decisions that haven’t evolved with the domain, resulting in concepts that cross service boundaries when they should remain in one service’s bounded context. For core services like these, excessive coupling may represent significant accidental complexity that slows down development and makes everyone’s life a little harder.

Meanwhile, the largest service (Service 05) doesn’t have any other service that’s especially coupled to it. That seems like a win.

Architecture Honesty

What a dev team does with this kind of information is up to them. There’s no such thing as perfect architecture, and no silver bullets in improving architecture. However, having a clear and correct view of the current design is a prerequisite for good decision making. This technique of comparing changes across services is hardly the only way to analyze the health of a system’s design, but it can nudge you in the right direction.




Footnotes

  1. https://en.wikiquote.org/wiki/Helmuth_von_Moltke_the_Elder
  2. https://8thlight.com/blog/uncle-bob/2014/05/08/SingleReponsibilityPrinciple.html
  3. http://michaelfeathers.typepad.com/michael_feathers_blog/2011/09/temporal-correlation-of-class-changes.html
  4. A complication here was that some teams make extensive use of sub-tickets, meaning multiple ticket IDs per user story, and thus breaking the obvious link between commits. To manage this issue, my ticket ID gathering script used the JIRA API to replace sub-ticket IDs with their parent ticket ID.