The previous installment in this series talked about anti-pattern solutions when updating data across two or more databases. This second installment in the series will discuss solutions to this difficult and common problem. The problem is important to solve correctly because data is an organization’s most valuable asset.

Reliable messaging

Reliable messaging is the foundational concept to eliminate the synchronous distributed transactions discussed in the previous installment. Reliable messaging is one way a client service can ask a source service to modify data the source service owns. Reliable messaging can also be used by a source service to signal a downstream service that data has changed so the downstream service should act accordingly.

Reliable messaging can be implemented over HTTP. It can be done with messaging middleware. Reliable messaging is a common application framework feature, or custom code can be added to make HTTP webhooks reliable. A cron job which retries failed requests might be a quick way to wire up reliable message delivery between two services.

There are many different ways by which to implement reliable messaging. A developer or architect will choose the mechanism based on three primary criteria (in priority order):

  1. Requirement trade-offs: What are the trade-offs and penalties for different failure scenarios between the different services?
  2. Service capabilities: What are the capabilities and data structures of the recipient service?
  3. Existing resources: What are the infrastructure and development tools available to the organization?

A developer or architect should first determine what the penalties and trade-offs are if data are not logically consistent across the different services. The developer should then investigate what the penalties and trade-offs are if the delivery of a message has failed or been delayed. Contractual obligations dictating service-level agreements are powerful requirements drivers. Failure to account for them in the early phases of a system’s design can cause expensive architecture redesigns after the organization has paid monetary damages and lost credibility with a client. User-interface designs expecting immediate data updates are frustrating to end-users when a service makes no such guarantee regarding immediacy. The option to serve stale data, and how acceptably stale the data can be, allows a system’s uptime not to be completely reliant on the uptime of a dependent service, as well.

After ascertaining the penalties and trade-offs, a developer can begin to investigate the capabilities and limitations of the service. A developer should be trying to answer questions such as (but not limited to):

  1. Typical use: What is the preferred way to submit requests to the service? Does the caller’s technology stack have good support for submitting requests to the service?
  2. Idempotency: Are requests idempotent? What is the behavior if a caller sends the exact same request to a service?
  3. Message IDs: Are there uniqueness criteria between requests? If so, does the caller generate the unique ID for a request or will the service provide a unique ID?
  4. Message status: Does the caller have to keep track of the transaction status or is there a way to request the status of any pending request?
  5. Messaging semantics: If the caller suspects that the service did not receive the request, should the caller send the message again? Callers should investigate if messages passed to the service should follow at-most-once or at-least-once semantics (because it’s impossible to have exactly-once semantics).
  6. Sync vs async: Is the service synchronous or asynchronous? If a caller submits a request, is that request immediately fulfilled or does the request transition to a “pending” status? For a really good discussion as to whether you might consider either using HTTP or a message bus to deliver messages see this post from Jeremy Miller
  7. Subscription options: Is the caller required to subscribe to callback-style updates to the request? If not required, are these updates available as options?

Armed with requirements and expectations, a developer can start architecting a solution given the constraints and resources available.

Messages versus commands versus events

A “message” in a network computing system are data delivered as a singular unit from one service to another service. Messages are an abstraction over lower level bytes and protocols that most developers don’t have to think about all the time every day. When developers are working with messages between services it’s helpful to categorize those messages as either “commands” or “events”.

Commands are messages that a caller sends when it would like for something to happen or would like for something to change. Commands should only be handled by a single bounded context, and they can be rejected by the receiving service. When talking out loud about them a developer will usually speak to them in the imperative: “Then the user registers for the mission”. In a larger system comprised of several services only one service should be responsible for accepting a given type of command. The service responsible has the ability to accept or reject the command: “No, the user cannot register for the mission because they failed to meet some criteria”. Callers need to respect the response of the handler service.

Events, however, are messages indicating that something happened in the past. Downstream services must account for the data in event messages because they are historical facts, not requests for change. Events are spoken about in the past tense: “The user registered for the mission on June 28th”. Event data should originate from the bounded context responsible for the data (i.e. similar to the single writer principle), but any other service should be able to account for those events in their own bounded contexts. Events can’t be rejected like commands can be, because events indicate that something has happened.

Here’s an example of a flow that demonstrates the give and take between services transmitting commands and events.

Sequence diagram walking through a multi-service transaction Sequence diagram walking through a multi-service transaction

  1. App -> ActivityAPI: The Rally app on a user’s mobile device detects that the user has walked an additional 1,000 steps today. It makes a request to the Rally Activity API indicating the steps the user has taken in the past hour. The Activity API is the service that owns the activity data for all of our users and is the only service that can ultimately accept or reject any such modifications to user activity data. The mobile app’s request to the Activity API is a command because the request can be rejected by the bounded context that owns the data.
  2. ActivityAPI -> MissionsAPI: The Activity API updates its records regarding the user’s activity. Those records show the user having walked over 12,000 steps today. The Missions API is subscribed to activity updates via a webhook subscription. The Missions API receives an update about the user’s 12,000 daily steps. The message published over the webhook is an event. The Activity API owns activity information so it keeps track of all the user’s activity and it provides updates to any other service. The Missions API cannot reject the facts about the user’s activity, only ignore the information if it’s not pertinent to any of its own workflows.
  3. MissionsAPI -> RewardsAPI: The Missions API determines that the user is enrolled in a daily step goal mission, which the user has met for today by walking over 12,000 steps. The Missions API then makes a request to the Rewards API informing the Rewards API about the step goal completion. In this case the Missions API created a derivative event from the original event. As before, the fact that the user both walked 12,000 steps today and met the criteria for succeeding with his or her daily step goal are unarguable events which happened in the past.
  4. RewardsAPI records data: The Rewards API determines upon completion of the daily step mission the user is rewarded with 10 Rally coins. The Rewards API records an update to that user’s account. The account balance of each Rally coin account is information that the Rewards API owns. There are no commands or events produced here. The Rewards API is bounded context that owns coin account information so it can just update its own database and optionally publish any updates.

Modeling to prepare for when things go wrong

The above example walks through a common scenario and nothing goes wrong. Updating two or more data sources is complicated because of what can happen when something doesn’t go well. A small list of things that could go wrong:

  1. Mobile connectivity: The user’s mobile device could be in a connectivity dead zone and unable to make the request
  2. Expired authentication: The user’s mobile device could have an expired authentication token and the user may need to provide a username and password again
  3. Bugs: A recent update to the Activity API could have introduced a bug which causes errors for some incoming requests.
  4. Backend connectivity: A network connectivity blip between the Missions API and Rewards API at the time of the webhook publish could prevent the event publish to the Rewards API.

When working within a service oriented system the developer needs to understand a workflow’s chain of custody. The chain of custody is a recipient service’s acknowledgement and guarantee it will not lose a message. A recipient service in a chain of custody takes responsibility for completing its portion of a distributed transaction and indicating success or failure. The chain of custody is an architectural technique allowing a large software system and large software team to scale. When this technique is used, the sending service can confidently stop spending its own resources wondering if it should do something about the message it sent. The sending service can be reasonably confident the rest of the workflow will continue to progress. If the transaction fails, it should be clear what service is at fault. Knowing the singular service at fault, the fault can be quickly remediated (e.g. the one hand to shake or one throat to choke principle). Receiving services should not acknowledge receipt of a message until it is safely recorded or completely handed off to another service (and acknowledged). Receiving services, when receiving custody of transaction, also need to take on the responsibility for ensuring failed transactions are retried or escalated, if necessary.

Common types of acknowledgement are:

  1. HTTP responses
  2. RabbitMQ client/server acknowledgements
  3. Published application-level events

If a service is a party to a complex transaction with a chain of custody involving other services, then the only thing left for a service to keep track of is the proper state of any transactions that aren’t complete. Bugs will occur if there are transaction states missing or not properly accounted for. As an example, consider the user interface for making a retail stock trade. The retail trading user interface will probably be a simple interface with no more than an option to “buy” or “sell” a particular quantity of stock shares. But when the user clicks “submit” on their trade, that order will then transition through some of the fourteen different possible order states trading systems use to keep track of the state of a single trade order, such as “pending new”, “partially filled”, and “rejected”.

Do I really need to do this all the time in my service oriented system?

Given the above process for thoroughly updating two or more databases, a developer might ask if all this design and up-front thinking is necessary. One of the earliest parts of this article asks the developer to determine the penalties and trade-offs for when the data are not correctly updated. The answer to that will drive how much time a developer should invest in this exercise – if losing a small amount of trivial data or ephemeral data is acceptable, then quickly skimming over this process is likely acceptable. If substantial money or market credibility is at stake then data need to be correct, and a process to detect and rectify incorrect data (i.e. reconciliation) needs to be present as well.

Developers should understand the difference between “essential” complexity and “accidental” complexity. In our example, essential complexity is the code and data storage required to track user activity and reward users for completing missions. This can be complex due to requirements from clients, edge cases, the sophistication of our product, the demands of the market, etc. An example of accidental complexity is the tax on an organization’s delivery speed when it prematurely moves from a monolithic code base to a service oriented system. The move isn’t to be taken on a whim. If the organization makes the leap too soon then the resulting complexity is accidental because it’s complexity not required by the product or domain requirements.

Links in this article