At Curalate, our service and API traffic is fairly tightly coupled to e-commerce traffic, so any increase is reasonably predictable. We expect an increase in request rate towards the beginning of November each year, with traffic peaking at 10x our steady rate on Black Friday and Cyber Monday. Why Load Test? Curalate works directly with […]
At Curalate, our service and API traffic is fairly tightly coupled to e-commerce traffic, so any increase is reasonably predictable. We expect an increase in request rate towards the beginning of November each year, with traffic peaking at 10x our steady rate on Black Friday and Cyber Monday.
Curalate works directly with retail brands to drive traffic to their sites. The holiday shopping period is the most important time of the year for most of them, and we need to ensure that our experiences continue to operate at a high standard throughout.
More generally, though, load testing is critical for services and APIs, especially in cases where load is expected to increase. It uncovers potential points of failure, during business hours, and hopefully prevents people from needing to wake up at 2 a.m. on a weekend.
In cases of expected load increases, it’s important to understand as much as possible before diving into it. There are a few questions to ask:
In our case, there were four main services that we were interested in testing against expected load, separated into on-site (APIs and services that are called directly from our client’s sites), and off-site (our custom built and Curalate-hosted services). This distinction works well for us, because we expected a 10x increase in on-site experiences, but 2-3x increase to off-site ones – brands focus on driving traffic to their own e-commerce site.
Now, there are many tools out there for load testing. For our purposes, I used Vegeta, for its robust set of options and extensibility. It was easy to script around to allow a steadily increasing request rate to either a single target or lazily generated targets. The output functionality is also well thought out. It supports top line latency stats along with some basic charting capabilities.
Let’s assume we had a service that we wanted to test up to 1000 RPS, both against a single target, and against multiple targets – to work around any caching in place.
The setup was fairly simple:
Spin up a couple of AWS EC2 m3.2xlarge instances.
SSH to the instances and create a
load_testing folder, and fetch the Vegeta binary.
Put together a simple, quick script to handle steadily increasing the request rate, and then hold steady at the max rate.
#!/bin/bash target=$1 maxRate=$2 rateInc=$3 incDuration=$4 startAt=$5 currentRate=$startAt hitType=$6 while [ $currentRate -le $maxRate ] do if [ $currentRate -eq $maxRate ] then echo $target | ./vegeta attack -rate=$currentRate > reel-$maxRate-$currentRate-$hitType-test.bin else echo $target | ./vegeta attack -rate=$currentRate -duration=$incDuration > reel-$maxRate-$currentRate-$hitType-test.bin fi currentRate=$((currentRate+rateInc)) done
Basically, if it hasn’t yet hit the max rate, run vegeta at the current rate for the specified duration, then increase the rate by the increment, and loop again. If the max rate is hit, don’t specify a duration – run until manually killed. The multi-targets script is similar, but reads from a
#!/bin/bash maxRate=$1 rateInc=$2 incDuration=$3 startAt=$4 currentRate=$startAt hitType=$5 while [ $currentRate -le $maxRate ] do if [ $currentRate -eq $maxRate ] then ./vegeta attack -rate=$currentRate -targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin else ./vegeta attack -rate=$currentRate -duration=$incDuration -targets=targets.txt > reel-$maxRate-$currentRate-$hitType-test.bin fi currentRate=$((currentRate+rateInc)) done
Aside: I was unable to get the
-lazy flag to work properly with Vegeta, so I went with brute force and just generated a ton of targets to a file. I’m convinced it could have been more elegant, but sometimes the easy solution works just as well.
With the setup complete, it’s as simple as setting up whatever monitoring you want on a display or two, and fire off the scripts.
sh ./rate_increasing_multi.sh 1000 50 120s 50 uncached
Which says to increase up to 1000 RPS, 50 at a time, for 2 minutes at each rate, starting at 50 RPS.
For each results file generated,
./vegeta report -inputs "out.txt" will output something like (this example is for 250 RPS)
Requests [total, rate] 66177, 249.98 Duration [total, attack, wait] 4m24.783548697s, 4m24.731999487s, 51.54921ms Latencies [mean, 50, 95, 99, max] 64.885905ms, 57.516245ms, 107.88721ms, 730.867162ms, 2.309337436s Bytes In [total, mean] 943011144, 14249.83 Bytes Out [total, mean] 0, 0.00 Success [ratio] 100.00% Status Codes [code:count] 200:66177 Error Set:
As different tests are kicked off and rates increase, it’s necessary to keep an eye on any monitoring dashboards, or alerts that may fire, and bail out of the test early. From there, logging should help in diagnosing what failed, and tickets can be filed each step of the way. After those issues are resolved, you can pick back up testing until you hit your goal, and maintain it for long enough to be comfortable with the test.
It should go without saying, but when testing against a live, production environment, it’s always nice to give the current on-call engineers a heads up, and keep them in the loop the entire way through.
As for Curalate’s load testing, on Cyber Monday we experienced record-breaking traffic numbers – even exceeding our 10x estimates slightly – to our services, and the on-call engineers slept soundly through Thanksgiving weekend.