Service Level Objectives for On-call Peace of Mind

From Unsplash My team, Developer Experience (DEx), owns the Continuous Integration (CI) experience at Gusto. When I joined the team about a year ago, we were getting pressure from our customers (Gusto’s Product Engineers) to make our CI tool “better”. We framed our work to make CI “better” as an experiment with defined questions and […]

Service Level Objectives for On-call Peace of Mind
From Unsplash

Service Level Objectives for On-call Peace of Mind

My team, Developer Experience (DEx), owns the Continuous Integration (CI) experience at Gusto. When I joined the team about a year ago, we were getting pressure from our customers (Gusto’s Product Engineers) to make our CI tool “better”.

We framed our work to make CI “better” as an experiment with defined questions and metrics from the beginning. Could we use metrics to improve the experienced quality of CI? Once we knew that we were meeting some threshold of quality, could we use these metrics to prioritize future work? We gave ourselves a month to answer the first question and hoped that we’d have a while to test the second.

Advantages of where we ended up

By defining SLOs and monitors around the performance that we expected from our CI setup in Buildkite, we:

  • Ensured that we would know about potential degradation in performance before our customers saw it
  • Removed the need for a human to look at a dashboard
  • Automated prioritization of new features vs “keeping the lights on”
  • Gave ourselves a needle that we could see moving up and down with changes, as opposed to needing customer feedback more often

Defining “better”

We took on the task to understand what issues in CI were actually frustrating Gusto’s engineers. To gather this information, we opened up a couple of channels for feedback.

We asked for opinions in our #engineering Slack channel. We also scheduled live feedback sessions with product engineers from each group or mission. In these conversations, we asked:

  • What is good about our CI?
  • Is speed an issue?
  • When are you frustrated with our CI?

Answers to our questions were grouped into buckets, and the most prominent were:

  • Builds fail because something goes wrong with docker or mysql or something else outside my control as a Product Engineer.
  • I hate when I have to wait for 15 minutes before the build even starts.
  • We have so many flaky tests that I assume any rspec (ruby test suite) failure is caused by them.
  • The test suite just takes too long to run.

With all of this in mind, we knew what “better” would look like for our customers, but we weren’t sure that we were the only team who should be working on solutions to all of those problems.

Service Level Objectives for On-call Peace of Mind

We drew some lines around what our team should fix and around what the developers themselves should fix. In this case, we defined the boundary as such

  • Individual tests are under the control of Product Engineering. Therefore, flakey tests and slow test suites should be sped up by the teams who know the code best.
  • The machines that run the tests are our responsibility. Therefore, we needed to make sure that CI builds didn’t fail because the machines were set up wrong or were too slow to stand up.

If you haven’t heard of SLIs before, Google has written about them, a couple of times. The basics are that an SLI is the measurable ratio of good things to bad things. It goes down when performance degrades, and it goes up when performance improves. We turned our areas of responsibility into these SLIs:

Service Level Objectives for On-call Peace of Mind

Note that for stability, we were not measuring the total number of failed builds. We specifically caught error statuses that were not 1 (which implies a test suite failure) or 0 (which isn’t a failure at all). Similarly, we were intentionally not interested in the total time for a build to finish, just the amount of time a build waited to start. This was the distinction we made because we had control over how many machines were idle and ready to pick up new jobs. If the number of machines was too low, we needed to scale smarter.

Implementing Service Level Objectives

With these SLIs, we negotiated with the customer to define Service Level Objectives (SLOs) for our two themes – speed and stability. Put simply, an SLO is a target value or range for an SLI or SLI ≥ target. In essence, we were deciding what percent of the time our SLIs needed to be above our target.

SLO targets should be below 100% because you won’t have a perfect product. We live in a world with humans, weather, and uncertainty, which means downtime is inevitable, so we should plan for it. By accounting for downtime in the product experience, we also buy learning and development time for ourselves when we can take risks without ruining our SLI.

After talking with customers, we decided to focus on stability first and settled on these SLOs:

  • 99.5% of jobs via Buildkite Jobs in the past 30 days should not experience “infra failures”
  • 95% of the time, the Average Buildkite Job Latency in the past 30 days should be less than 4 minutes

We gave ourselves one month to experiment. At the end of the month, whether we had reached our goals or not, we planned to sit down with customers again and see if sentiment toward Buildkite stability and speed had increased. Afterall, the point was to make engineers more productive. Short of tracking their time, the best we can do to measure productivity is talk to them.

Make Experimentation Easy

Now that we knew what we were aiming for, we needed to get real data and make it easy to see progress. At the time, Datadog had just introduced an SLO widget, allowing us to turn the words we were using above into cohesive dashboard pieces.

The centerpiece of our dashboard was the SLOs themselves. At the time, our speed SLO was green for “the past 7 days” and “the past 30 days”. Our stability SLO was red across the board, as anticipated. Here’s a screenshot of what these widgets look like today, almost a year after the experiment began:

Service Level Objectives for On-call Peace of Mind

Further down the board, we added more granular measurements for specific error types. We made these viewable in just about every way possible – a total count of each error type, ratios of individual failure types to the total number of failures, change week over week for each error type, a graph of errors over time, etc.

Until we interacted with the data, we did not know which view was most helpful. So we set up multiple ways to see the same data, and as we found one useful, we removed the other three. By the end of the experiment, we had a concise and easy-to-understand board that has made debugging easier since the experiment.

Experimenting

Change small things first

Once we were confident that we were counting everything our SLI deemed “bad”, we were ready to experiment. Some of these “experiments” were low-hanging fruit like upgrading our version of Buildkite, changing environment variables, and removing redundant setup code.

Regroup when needed

At the start of the experiment we thought we understood how our failures should be grouped. As we continued increasing stability, though, we realized that the biggest bucket of error types could be broken down further. Once we rearranged buckets by identical failure messages, not just the same exit codes, we gained clearer priorities.

Roll with the punches

One of our changes was purely reactionary. While we were working on another failure cause, a binding.pry slipped into a commit on a feature branch. You may think this should not be a big deal since it didn’t get out into the development or master branches. But since all of the machines for a queue are in the same pool, it tied up machines, caused other jobs on those machines to fail when the machines were marked unhealthy, and drove wait times through the roof. The fix was simple – set a global environment variable (DISABLE_PRY=1).

Big swings

A bigger change that we wanted to tackle was upgrading our machines from Ubuntu 14 (Trusty) to Ubuntu 18 (Bionic). By the time we were considering this change, we were green for our 7 day and 30 day stability SLO, so we had space to take bigger swings.

The short story is that this change didn’t work out within a month. Though we didn’t complete this big upgrade within the experiment time, we were able to try and fail because we had SLOs that kept us accountable to our customers. Without that built-in downtime, we might have been too scared to take a big risk and get feedback so quickly. As it was, we were able to identify the scaling and shut down problems quickly.

After the experiment

The experiment ended with us reaching our goals across the board –  all of our SLOs were met for Buildkite’s speed and stability.

A few months after the experiment, we saw a dip in our stability SLI below the target objective. We didn’t hear a single complaint from engineers about stability in Buildkite while we were investigating the cause. After a week of this, we decided to talk to our customers and decrease our SLO target to 95% instead of 99.5%. We made this decision with the assumption that we had set the bar too high for ourselves in the first place. Other issues were more variable and more intrusive to engineers, so we could ease back without causing undue slowness.

On a related note, we haven’t changed the speed SLO at all since implementing it. This SLI has only slipped under its target on rare occasions, which were quickly remedied and received little notice from customers.

Concluding Thoughts

If there’s a place your customers want improvement, start with SLIs and SLOs for your product. This allows you to focus on moving a needle you can see, as opposed to the one that is unmeasurable or just in someone’s head. This also allows you to prioritize issues automatically. For example, if all SLOs are met, the complaint that one build is slow can be put on the back-burner and not distract from regularly-scheduled work.

If you feel like you’re drowning in maintenance or operational work, embrace SLOs.

Source: Gusto