Overview

In a previous post, Rally Health’s very own engineer Jeff Sherman talked about Corporate Wellness Programs and the Rally Incentives Engine. If you consider the Rally Incentives Engine to be the “brain”, then in this post I will describe the arms and the legs, the eyes and the ears that make up the rest of the Rally Incentives Ecosystem.

Rally Incentives Ecosystem

As you may recall, Corporate Wellness Programs powered by Rally enable companies to reward their employees for taking healthy actions. Taken from the previous blog post:

Examples of activities include:

  • Complete a Health Assessment Survey
  • Get a physical examination this year
  • Obtain a body fat percentage of 27
  • Connect with a nutritionist
  • Participate in a company-sponsored fitness event

Example rewards include:

  • Deposit to employee’s Health Savings Account (HSA) or Flexible Spending Account (FSA)
  • Gift Card sent to employee

In his post, Commander Sherman talked about the rules that govern the administration of the reward plans, as well as an algorithm used for reward reversals. But how does user activity data make it to the Rally Incentives Engine, and once a decision is made to pay, how does Rally deliver the money to the user?

Fig. 1 - Rally Incentives Ecosystem

6or9

The process of getting data into and out of the Rally Incentives Engine poses some interesting challenges, which we have addressed through the Rally Incentives Ecosystem. The Rally Incentives Ecosystem consists of dozens of microservices, each with a specific role in processing an incentives event for the user. At the center of it all is the Rewards Event Processor, which is a daemon that listens on a message bus for events that come from internal sources (e.g. Rally Health Survey) and external sources (e.g. claims). The Rewards Event Processor performs preprocessing on each event, resolves the user’s identity and eligibility through the Rally Eligibility Service, and submits the event to the Rally Incentives Engine for processing. Should the incentives engine determine that a payment should be made, the Rewards Event Processor makes a credit request to the Rally Ledger service, which acts as the source of truth for all credits and debits and as a facade to financial fulfillment services.

In the example of “Get a physical examination this year and earn a $200 deposit into your Health Savings Account (HSA),” there are a number of checks that need to pass in order for the user to get paid. These checks are distributed throughout the Rally Incentives Ecosystem:

  1. Once the user leaves the doctor’s office, a claim is made to the insurance company for the visit.
  2. The insurance company sends the claim to Rally via file or API call, at which point the Claims Microservice validates that the claims code evaluates to an annual physical exam.
  3. The Rewards Event Processor listens for the annual physical exam event.
  4. The Rewards Event Processor resolves the user’s eligibility to help determine which Corporate Wellness program the user belongs to.
  5. The Rally Incentives Engine evaluates the annual physical exam event against the rules configured for the employer’s Corporate Wellness Program to make the pay/don’t-pay decision.
  6. The Rewards Event Processor submits a credit request to the Ledger service.
  7. The Ledger service records the credit and routes the request to the HSA fulfillment API.
  8. Funds are deposited into the user’s Health Savings Account.
  9. User is informed via email or push notification that funds have successfully been deposited by the bank.

Fault Tolerance

But what happens if one of these checks fails erroneously or if one of the internal or external systems is unavailable? In the world of distributed systems we have to assume that any given service is unreliable and any given data is faulty. A payment transaction could fail for any of the following reasons:

System outage (e.g. gift card API is down)

Data error (e.g. an incorrect insurance policy number was used to identify the member)

Error in configuration or business logic (e.g. we meant to pay the user $100 but we accidentally configured $50)

Categories #2 and #3 are handled by the reversal algorithm outlined in the previous blog post, combined with operational tools. While an entire blog post could be written about how we employ defensive programming techniques and business process to handle these second-level errors, the remainder of this post will focus on how we account for Category #1, system outages. It’s mission-critical that we deliver the funds to the user reliably and accurately. In order to do so, we use the following guiding principles:

Ensure that all operations are idempotent.

Employ immediate and delayed retries.

Provide automated and manual remediation.

Ensuring idempotency means making sure that each microservice can receive the same message twice with no undesired side effects. For example, the Ledger service can receive the same credit transaction twice without double-paying the user. We generally achieve this by defining a set of attributes that together define a unique transaction, and if two transactions exist with the same set of attributes we consider them to be duplicates. This enables us to rely on at-least-once delivery, which means that if we think an event may have failed to be processed downstream, we can retry it.

While we prefer automated solutions like automated retries for failed messages, it’s also important to have operational tools that we provide to our support team. One example is the Rewards Activity Finder tool, which provides a support agent visibility into each of the above checks throughout the Rally Incentives Ecosystem, and the ability to replay events where necessary.

Why All of This Matters

As a software engineer, in addition to playing with the latest technologies, I want to know that what I am building solves a problem that matters. We live in an age of rising healthcare costs, 86% of which is spent treating chronic conditions, of which much is preventable. Research has shown that we see better outcomes when we catch problems earlier and that people are more likely to do something when you incentivize them for it. For every X number of preventive screenings, Y% of them catch a problem early that could have developed into a chronic condition. For every X number of people who drop that smoking habit or lose that first 10 lbs, Y% of them end up avoiding a cardiovascular condition that they would have otherwise developed. Now imagine the impact the Rally Incentives Ecosystem can have on the tens of millions of people with access to Rally when we award hundreds of millions of dollars annually in the name of prevention and wellbeing.

Conclusion

The Rally Incentives Ecosystem solves interesting challenges in distributed systems in order to simplify the healthcare experience and provide value to the user, and as well solves a massive systems integration problem, aggregating data from partners in the healthcare system, applying business logic and seamlessly integrating with banks and fulfillment providers. Through idempotency, retries and operational tools, we ensure at-least-once delivery in a microservice architecture across potentially unreliable networks. All of this cool technology allows us to make a positive impact, using incentives to encourage healthy behavior and lets our users focus on prevention and wellbeing.