Releasing Microservices Efficiently and Reliably at Scale Here at HomeAway, we strive to provide a highly available hybrid cloud platform to ease the operations burden for product-focused developers. The platform currently supports three distinct runtime environments (test, stage, production), each containing a set of physically isolated data centers defined as regions. While maintaining an ecosystem with […]
Here at HomeAway, we strive to provide a highly available hybrid cloud platform to ease the operations burden for product-focused developers. The platform currently supports three distinct runtime environments (test, stage, production), each containing a set of physically isolated data centers defined as regions. While maintaining an ecosystem with such a large amount of physical and logical isolation enables important features such as high availability deployments and geo-aware routing, it is difficult to interact with such a distributed set of services.
As a quick end-user example, consider how one might handle deploying an active-active set-up across six availability zones within two regions. Each scheduler is logically restricted to its resource pool within the availability zone, and so six calls to the six unique container schedulers are required to deploy the requested application. Doing this manually for every new versioned release of a deployment would quickly lose viability, especially for development teams that release updates multiple times per day. Additionally, having each development team roll their own automation around this kind of management rapidly reduces the stability of the platform altogether.
In order to centralize some of the deployment-level management inherent to a cloud platform, we created the Ministry of Truth.
The Ministry of Truth (MoT) consists of a collection of microservices running in every region responsible for distributing and consolidating events pertinent to deployments and container orchestration within a hybrid-cloud platform. I’ll provide insight into the rules of the game and how we accomplish this task in a production-isolated infrastructure.
Given a central datastore, a central API, and several collections of microservices, provide conventions so that messages may be relayed between all three parts in an eventually consistent manner. This post will go into how MoT forwards user requests to regional agents, sends messages between microservices, and pushes the data to be stored through a persistence flow. We seek to avoid implementation specifics or application details, but may do so for the sake of providing a comprehensive example.
In general, we strive to build our agents to perform one simple, lightweight action, triggered by an event from a data source and potentially publishing the result to a corresponding data sink. For example, we gather data from Consul, our platform’s service-discovery component. In order to accomplish a part of the data gathering, the InstanceConsulStateAgent does the following:
In short, this lets us know if the Consul service data has changed, separating the filter from the downstream persistence components. We have similar flows for some of the other primary platform services, Marathon and Mesos. The MoT microservices are collections of semantically similar agents bundled into Dropwizard apps and grouped with other utilities or models shared among the agents. Some examples include:
The microservices themselves can be bundled as well. Let’s take a look at a high level overview of the three major MoT layers.
You can think of MoT as having three primary components with special communication channels between them:
Some flows such as configuration updates only ever reside in the service layer, as there is no need to interact with the regional services or deployments. Others, such as enabling traffic for an existing deployment heavily involve all 3 components of MoT.
In order to transfer data between the channels, we use Kafka topics as the primary event bus. Let’s take a brief glimpse at the rules of thumb, and then perform a deep dive into the flow of a deployment request to get an example of how it all integrates.
In a multi-region architecture with a centralized API, we must forward information across regions. We opted to incur the cross region penalty when consuming from topics in a remote cluster. More specifically, we follow the three rules below.
When an event has been processed by an agent, we want to push the result record to a topic as quickly as possible, and move on to the next record. This rule keeps us from making unnecessary connections to foreign regions and adding latency between each processed event.
There are many topics within the MoT architecture that service communication between microservices or agents in the regional Kafka cluster. If the data stream doesn’t require outside information from other regions, stay local.
Some messages must be propagated to the other regions. In this particular instance, we append a suffix of-
Let’s take a look at how creating a new deployment in test-us-east-1 takes place throughout the system.
Requesting a deployment:
As a short aside, let us clarify the reasoning for including the stage-us-east-1 Kafka cluster at all. We elected to mirror topics from the central production cluster to a pseudo-central cluster in the non-production environments. This limits the number of firewall exceptions that violate the production | non-production boundary. Furthermore, using mirrormakers at all violates our first rule of Kafka etiquette, as we must mirror the source record to a foreign region’s cluster. Breaking this rule from a central cluster to the non-prod central cluster limits the number of foreign regions for which we break Kafka etiquette.
Great! So now we’ve successfully taken a launch request from a user hitting an API in production-us-east-1 and piped it to a launch request against the regional scheduler in test-us-east-1. The record produced to the mot-deployment-complete-events topic marks the end of the deployment request flow as triggered by the user’s request. However, the story does not stop there. If we were to take a look at the deployment’s dashboard for the app, there would not be much to see. The only data persisted so far was a metadata shell storing some information pulled from the initial request in the DeploymentOperationAgent’s business logic. Let’s take a look at the persistence of instance data and deployment state based on data collected from the regional systems and services.
Persisting deployment state to the datastore-
Awesome! Now if a user were to do a GET on the supported deployment API endpoints, they would expect to see the aggregated result of any instance and deployment state collected and persisted by the above flow. Now we’ve seen patterns of both dispersing information from our central region to our regional services and collecting it from the regional services back into the central region.
Now that we’ve looked at how MoT propagates data over the production | non-production boundary, we can take a look at the simpler flow of requests between two production regions.
Production-to-production communication no longer necessitates the use of mirrormakers. Each time a record needs to be pulled between Kafka clusters, the destination region’s consumer can poll the source cluster and localize the data for further processing.
The data persistence flow between production regions is likewise simplified:
I hope this blog has served you well. In decoupling the persistence, service, and regional layers, we’ve allowed for reduced blast radii in outage scenarios and minimized the amount of data we must send between regions. Best of luck with your journeys in the hybrid cloud!
Supporting Multi-Region Deployments in the Hybrid Cloud was originally published in HomeAway Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.