In 2018, Etsy migrated its service infrastructure from self-managed data centers to cloud provisioning. (We blogged about it at the time.) The change opened up opportunities for improvements to technical processes across the company. For the Search team, the flexible scaling that comes with a cloud environment allowed us to completely reevaluate a somewhat cumbersome […]
In 2018, Etsy migrated its service infrastructure from self-managed data centers to cloud provisioning. (We blogged about it at the time.) The change opened up opportunities for improvements to technical processes across the company. For the Search team, the flexible scaling that comes with a cloud environment allowed us to completely reevaluate a somewhat cumbersome deployment process. Inspired by the existing architectural pattern of canary rollouts, we wrote a new custom tool to supplement our existing deployment infrastructure.
What we ended up with, after three months of effort, was a more scalable, more developer-friendly, and ultimately a more robust way to roll out changes to Search.
Historically, we deployed our stack on two separate sets of hosts, in what’s known as a blue-green strategy. At any one time, only one set of hosts is live; the other set, or “side,” is dark. Both sides were always fully scaled and ready to serve traffic, but only the live side was accessible to the public internet.
While simple, a blue-green deployment strategy has some very useful features:
We refer to the two sets of hosts as “flip” and “flop”, named after the circuit that is a fundamental building block of modern computers. We point our monolithic PHP web application to whichever side should be active via some lines of code in a configuration file.
A diagram of what our infrastructure previously looked like. One side (flop, in this example) was always live, and during a deployment we’d move all traffic at once to the other side (flip in this example).
This blue-green deployment method was “lifted and shifted” during Etsy’s migration to the cloud three years ago. The Search team moved the search application to Google Kubernetes Engine (GKE), and flip and flop became two separate production Kubernetes namespaces.
That change aside, things worked as they always had: deploying Search instantly triggered a redirect of all traffic from one namespace—the live side—to the same services running in the other namespace. To ensure the dark side would always be ready to go, we continued maintaining 200% capacity at all times (100% in each production namespace), just as we had done when we were on-premises.
This original deployment strategy was immensely useful to the team, especially for giving us a secure place to test and prepare major software updates and infrastructural changes. But it wasn’t without its painful aspects. Abruptly shifting all traffic between sides gave us no room to test changes on small amounts of production traffic before going all in. Even when things went right, deploying was stressful. And when things went wrong, engineers had to triage to decide whether to fully revert to the former side. On top of all that, having to permanently maintain at double capacity was expensive and inefficient.
Thanks to the flexibility provided by the cloud, once we were safely on GKE we had an opening to rethink our blue-green strategy and address these longstanding issues.
Our first thought was to adopt a canary deployment strategy. During a “canary rollout”, a small subset of traffic is sent to the new version of a service to determine it is “safe” before switching over all traffic to the new service.
Why the name? Coal miners used to use canaries to detect carbon monoxide at a level that could hurt a small bird, but not yet hurt a human. Software engineers have adopted a similar—albeit more humane—model to build confidence that new software is safe to serve traffic.
Although Kubernetes’ architecture and flexible scaling mean canary rollouts are a very popular deployment solution, the design of Etsy’s search system meant we couldn’t use any off-the-shelf canary release solutions. We had to build something new for ourselves, a sort of Canary Lite.
We had two key limitations when looking to re-architect our deployment process to incorporate a canary component.
First, we had no single load balancer or API endpoint where we could control the amount of incoming traffic going to flip versus flop. This made it impossible to do basic canary rollouts using Kubernetes labels on a single Kubernetes deployment for any search service, because Search is made of many disparate Kubernetes deployments. There was no place we could put routing logic to check the labels and route to the canary pods accordingly.
However, Etsy’s PHP web application is the search application’s only client. This is a common pattern at Etsy, and as a result, configuration load balancing is often managed directly within the web application itself. Any new deployment solution would either have to manage traffic from the web application to Search from within the web application, or implement some sort of entirely new mesh network (like Istio) to catch and direct all traffic from the web application to Search. Neither of these options were viable in the time frame allotted for this project.
The second limitation was that the search application assumes any single web request will be served by the same version of all search services in the request path. As a result, deployment of any new solution would need to ensure that in-flight search requests would finish being served by the old version of all search services. Even sophisticated canary rollout solutions like Istio require your application to handle version mismatches between different services, which we couldn’t guarantee.
So how could we create a gradual rollout for a new version of all search services, while simultaneously managing load-balancing from the web application to all parts of the rollout AND guaranteeing search services only ever talked to other search services of the same version? There were no off-the-shelf solutions that could solve such an Etsy-specific problem. So we built an entirely new tool, called Switchboard.
Switchboard’s primary function is to manage traffic: it rolls a deployment out to production by gradually increasing the percentage served to the new live side, and proportionally decreasing the amount going to the old one.
Deployment stages with predefined traffic ratios are hardcoded into the system, and when all pods added during the current rollout stage are fully created and healthy, Switchboard transitions to the next stage. It does this by editing and committing the new traffic percentages to a configuration file within the web application. The web app re-checks this file on every new request and uses the information to load balance search traffic between two different production Kubernetes namespaces, both still called flip and flop.
Example of a side switch using Switchboard. Smoke tests are running at 16:57 and 17:07.
Switchboard largely automates the migration of traffic from one search side to the other during a deployment. Smoke tests run at different phases of the deployment, sending both artificially-created and real historical search query requests to the new side. Developers just need to monitor the graphs to make sure the rollout went smoothly.
The engineer driving the deploy manages Switchboard through a user interface that shows the current state of the rollout and also provides options for pausing or rolling back the deployment.
With Switchboard, we largely rely on Kubernetes’ built-in autoscaling to scale the new cluster during the deployment. We have found that we only need to prescale the cluster to serve 25% of our current capacity before we start sending production traffic to it. Kubernetes’ built-in autoscaling is reactive, and therefore necessarily slower than if we force Search to scale before it needs the extra capacity. As a result, it helps to prescale the new live side so it responds faster to the initial shift as that side goes live and starts to receive traffic. From there, Switchboard lets Kubernetes manage its own autoscaling, simply monitoring the Kubernetes rollout to make sure all services are healthy at the current stage before making the decision to ramp up.
We designed Switchboard to improve the resource consumption of our Search system, and it has done that. But the stepped deployment approach has also resulted in a number of nice workflow improvements for developers.
Switchboard allows us to keep our overall search VM capacity at or close to 100% rather than the 200% capacity we’d been supporting before. We no longer need to provision double capacity when traffic to Search increases. It’s now much easier to adapt to variations in traffic, since any additional reactive (automatic) or proactive (manual) scaling only needs to reserve compute services for our true capacity instead of double. As a result, there was a noticeable improvement in our cloud VM utilization during the period in which we released Switchboard.
Cloud costs per search request (cloud billing total/number of requests) over several months showing our improved utilization efficiency post-Switchboard.
The second big win from Switchboard is that it has made deploys to our staging environment consistently faster. Our first attempt to move away from the legacy double provisioning approach was to fully scale down the unused search cluster between deploys and and then preemptively rescale it as the first step in the next deploy. One problem with this approach was that developers had to wait for all the services inside our Search system to be scaled up enough to accept traffic before they could test in our staging environment.
As you can see in the graph below, deploys to staging have become less bursty since we adopted Switchboard. Switchboard’s stepped scaling means we can send staging traffic to the new live side much faster. In the worst-case scenarios, provisioning a completely new fleet of nodes in the cloud was taking 20 minutes or more. That is 20 minutes that a developer needs to wait before being able to see their changes in staging.
Time elapsed per staging environment deploy. Each individual line is a single deploy.
Overall, after implementing Switchboard we saw similar increased utilization to our intermediate solution, but without having to compromise on slower deploy times. Switchboard even improved on the utilization efficiency of the intermediate solution.
It’s also easier now to spot and respond to issues during a deploy. Search deploys technically take longer than they did when we maintained two fully scaled clusters, but that additional time is caused by the gradualness of the automated traffic rollout process. A human search deployer typically passively monitors rollout stages without interacting at all. But if they need to, they can and will pause a deploy to examine current results. Search deployers use Switchboard at least once a month to pause a rollout. This is an option that simply wasn’t available to us before. Due to Switchboard’s gradual rollouts and its ability to pause, individual deploys have become safer and more reliable.
In the end, rearchitecting our blue-green deployment process to include a canary-like gradual traffic ramp-up via Switchboard allowed us to make our system more scalable and efficient while also designing for a better developer experience. We were able to successfully adapt our search application’s architecture to take advantage of the flexibility of our Kubernetes and cloud environment.