Load Testing: an unexpected journey

At FiftyThree, we love to inspire. Currently we do this by featuring the great work that our community creates on platforms like Facebook, Twitter and Tumblr. But we wanted to bring inspiration even closer to the creative process, so we built a Made With Paper stream directly into Paper. After working in stealth mode for […]

At FiftyThree, we love to inspire. Currently we do this by featuring the great work that our community creates on platforms like Facebook, Twitter and Tumblr. But we wanted to bring inspiration even closer to the creative process, so we built a Made With Paper stream directly into Paper.

After working in stealth mode for several months to build and test the stream, we still had to ensure that it could scale. This is the story of our load testing adventure, through the peaks and valleys of endless graphs, to the stream of new creations that you see in Paper every day.

This is how the stream works. So easy, even your dog can use it.

The Goal

When we began we had no idea how much traffic we would get. So we did what any experienced engineering team would do when faced with potentially huge load and millions of dollars on the line: we guessed.

Based on the total number of Paper users, downloads per month, average time spent in the app, and our API polling frequency, we estimated that we could face peak traffic of around 1000 requests per second, and meltdown traffic of over 5000 requests per second. These numbers became our goal.

Fortunately, our initial stream required a minimal, read-only API that serves only one, infrequently-changing set of data. This simplified things for us considerably, but we still had our work cut out for us.

Our Stack

For the technically inclined, our web service is built on Node.js and runs on Heroku. It provides a REST API over HTTPS, which is what Paper consumes.

We persist our data in a high-availability Neo4j cluster, and we load-balance that cluster with HAProxy, all deployed to Amazon EC2 instances.

At the start of this process, we also fronted Neo4j with Apache for HTTPS and authentication, and we ran a single Redis instance for some basic caching.

Our starting stack. A blueprint for success!

To gain visibility into our systems, we use a number of instrumentation and monitoring tools. At the Neo4j and HAProxy levels, we employ Nagios and Datadog for ops trends and alerting.

With Node.js, we integrate Nodetime and New Relic for application-level performance metrics. We view Heroku’s dyno runtime metrics through log2viz and aggregate in Librato via l2met for additional platform data.

For debugging, we send our logs to both Loggly and Papertrail.

Smooth Start: The Database and Load Balancer

We began by testing our database for the baseline performance of our foundational layer. Using open-source tools like siege and httperf, we recorded dismal metrics under 200 reqs/sec. We worked with the Neo4j support team, however, to fine-tune our Neo4j configuration. We were then able to achieve close to 2000 reqs/sec across the cluster.

We were encouraged by a quick but important discovery: for our type of traffic, high-CPU instances were much more effective than high-memory ones. Our database fits into memory and will for quite a while, even at scale. Neo4j runs in a JVM with a memory size fixed at service start, so we had maximized memory settings to use all available RAM. Our experiments showed, however, that we could get the same effect with less memory and more CPU.

Once we knew the baseline numbers for requests directly to our database, we tested with HAProxy in front and saw no performance loss. Since we knew our caching layer would only improve our ability to handle traffic, we were satisfied with our core components.

Special thanks to Neo4j engineer Max De Marzi for lending us his expertise.

Moving Parts: The App

Fueled by the success of our database tests, we moved on to testing our API, this time via distributed load testing services Blitz.io and Loader.io. To our disappointment, however, we saw our throughput flatline at just 250 reqs/sec — a fraction of what we knew our database could handle!

After digging into our metrics and finding no obvious problems, we reached out to Heroku support for ideas. It turns out that when you enable HTTPS over a custom domain on Heroku, your deployment gets its own Amazon ELB, and Amazon ELBs need to be pre-warmed to handle bursts in traffic. Heroku took care of pre-warming our ELB for us, and our results instantly began to look more reasonable.

We learned our first valuable lesson here: get support involved early and often with third-party systems like Heroku. It’s easier to make progress and keep context in an ongoing discussion than to labor alone.

Special thanks to Heroku engineers Brett, Chris, and especially Phil, for their time and help.

Et Tu, Redis? The Cache

Seeing progress, we continued with another round of testing against our API. Again, we encountered timeouts and errors, this time at 400 reqs/sec.

It took digging, but we finally found something interesting in our Nodetime metrics. Our Redis cache was slowing down significantly at load. We were avoiding the inconvenient possibility that Redis was our bottleneck, but when we tried a test with Redis caching disabled, we were amazed to see our results improve to almost 1000 reqs/sec.

Rather than spend a great deal of time working with our Redis provider to debug the cause, we simplified and switched to in-memory caching within each dyno. This worked for us since we didn’t need a shared, consistent cache for our stream. We plan on returning to Redis, but memory caching met our needs beautifully for this release.

An Afterthought: The Proxy

With Redis out of the picture and memory caching in its place, we thought our API would hum, but we were still seeing an unacceptable error rate.

On a whim, to rule out our code as a cause, we conducted an isolated load test against a static “hello world” endpoint. No database queries, nothing to cache, just an immediate response.

We expected that endpoint to fare better, but the results were orders of magnitude better. By correlating multiple data points, we realized that the failure wasn’t at a specific request threshold. Instead, it happened when individual dyno caches expired and hit the database for new data.

Deeper analysis into our Neo4j/Apache/HAProxy monitoring revealed that this was due to CPU spikes caused by Apache, so we switched to Nginx. Moving from Apache to a leaner proxy was something we wanted to eventually do anyway, so we were happy to get it out of the way.

Unusual Suspects: The Tools

We knew we were close, but we had yet to face what would ultimately be our biggest mystery.

We could now get to roughly 2,000 reqs/sec, but anything beyond that yielded timeouts and errors again. Worse, the errors looked similar to the ones we saw originally — before we even got Heroku support involved.

Exasperated, we tried throwing everything at the problem: we maxed out to 100 dynos; we pre-warmed our ELB for ridiculous levels of traffic; we even upped our cache time to hours. Nothing worked.

Eventually, our methodical nature prevailed, and we conducted a series of scientific tests across three variables we knew could shed insight: our actual stream API vs. the “hello world” endpoint, our custom domain vs. Heroku’s domain, and HTTP vs. HTTPS.

And like true scientists, we recorded the results rigorously. Here they are, verbatim:

Resource Protocol Domain Hits Errors Timeouts Avg Resp Time (ms) (Timeout + Error) rate
Hello World HTTP Heroku 289,421 15 0 63 0.01%
Hello World HTTP FiftyThree 287,344 16 0 48 0.01%
Hello World HTTPS Heroku 98,404 9 0 4,120 0.01%
Hello World HTTPS FiftyThree 92,774 7 0 4,130 0.01%
Stream HTTP Heroku 159,749 2,789 844 798 2.22%
Stream HTTP FiftyThree 165,201 3,174 585 1,140 2.22%
Stream HTTPS Heroku 43,046 8,871 0 9,750 17.09%
Stream HTTPS FiftyThree 39,283 9,286 1 10,870 19.12%

Right away, we found the smoking gun: HTTPS. We expected it to be a bit slower since it adds some overhead, but seeing the “hello world” response time increase by a factor of 100x wasn’t normal. More importantly, it was causing an increase in errors for our stream rather than timeouts.

We reported these results to Heroku. They ran their own tests and concluded that it might be the tools. While Blitz.io and Loader.io had worked well for us up until this point, they might have run into resource limits. The Heroku team volunteered to reach out to both to investigate further.

In the meantime, Heroku suggested we give BlazeMeter a try. We did, and our API was now sustaining the heaviest load we’d seen yet! Finally, we had hit our biggest breakthrough and had reason to celebrate.

We learned another valuable lesson here that the tools themselves can influence test results. Use a mixture of tools to expose their weaknesses.

T-Minus: The Pre-Launch

With our launch imminent, we pre-warmed our ELB and scheduled one last test. Unlike the others, we sustained this test for an hour, and we randomized the input to mimic real-world data.

We set the controls to maximum and watched the numbers fly upward… accompanied by a 2% error rate. It wasn’t the end of the world, but we knew we had done better. What happened?

We investigated our logs again and, after some head-scratching, we found one pattern: all of our errors were coming from the same two dynos.

May 08 15:43:53 … heroku/router: at=error code=H12 desc="Request timeout" … dyno=web.9 connect=1ms service=30000ms status=503 bytes=0

May 08 15:43:54 … heroku/router: at=error code=H12 desc="Request timeout" … dyno=web.9 connect=0ms service=30000ms status=503 bytes=0

May 08 15:43:54 … heroku/router: at=error code=H12 desc="Request timeout" … dyno=web.9 connect=1ms service=30002ms status=503 bytes=0

A snippet of our logs from that test.

We reached out to Heroku with this information, and they saw that those dynos had run out of memory and were swapping. From this, we learned to watch our logs for R14 “Out of Memory” errors and restart the incriminating dynos.

We ran our now very familiar tests again, held our breath…

…And hit upward of 10,000 reqs/sec with sub-100ms response times!

Mission accomplished.


A New Hope: The Launch

Having finally reached a point where we could exceed our target throughput with real-world traffic, we declared our quest a success. We made contingency plans for traffic spikes and failures, such as increasing cache times and adding more database and load balancer instances. And then we launched…

…to much less traffic than we expected. We had over-prepared. Fortunately, that’s turned out to be a benefit: we’ve had no fires to put out so far, and we can serve Paper worldwide from just a few Heroku dynos.

Our final stack. Moar internet!

Looking back, our journey led us through every part of our stack, and required us to correlate memory, CPU, network I/O, response time, error rate, connection reuse and more. We isolated wherever we could, and we tested even the simplest hypotheses — especially when we ran out of ideas.

Today, our API runs with a sub-0.01% error rate and a sub-50ms median response time. We still have plenty of work on our plate as we look to expand this service, but we’re proud of where we stand.

We hope this post inspires you to persevere as you build and test your own web services. We welcome your comments and questions at @FiftyThreeTeam.

Source: Fiftythree