Here is a tale of how we leverage redundant datacenters, redundant code, and multi-tiered fallbacks in the quest for uptime. But Why? High availability is important for any site operating at scale, but at bitly it is particularly important; people expect bitly links to work, no matter what. We have enterprise customers who rely on […]
Here is a tale of how we leverage redundant datacenters, redundant code, and multi-tiered fallbacks in the quest for uptime.
High availability is important for any site operating at scale, but at bitly
it is particularly important; people expect bitly links to work, no matter
what. We have enterprise customers who rely on them for key metrics, users who
share them on social networks, and websites with custom short domains that
trust us to serve requests with their name on them. A bitly link not working in
any of these scenarios would make our users look bad, so it is something we take
No matter how redundant, distributed, and fault tolerant your main
infrastructure is, things can always go wrong. Recently Google Apps and Search, probably
the most distributed infrastructure in existence,
There are unknowns
everywhere, and ultimately you have to plan for any part of your infrastructure
breaking for unknown reasons. Under failure, a distributed system should
degrade gracefully, not suddenly. This is why we created Z Proxy.
Z Proxy is an application that serves decodes (this is what we call redirecting from a
short bitly link to its long URL, and what happens every time you click on a bitly
link) without relying on any other part of the bitly infrastructure. This means
that it does not use our primary database of urls, or any of our other servers, to do
lookups. So how does it work?
Z Proxy is essentially a self contained wrapper around S3, written in Go. When
all of bitly is running properly, every time a link is shortened, a message is
put on NSQ, which a queuereader later grabs. A queuereader then
writes the short and long urls into S3 so that Z Proxy can perform lookups against S3
by short url, get the long url, and serve a 301 or 302 redirect. To the browser,
nothing different happened.
There are multiple host running Z Proxy in EC2. This location provides proximity to S3,
high availability, and most importantly different availability from the main decode
infrastructure, which exists outside of AWS. EC2 and S3 can have problems, but the chance of this
happening at the same time as our other datacenter is extremely low, and most importantly
gives us flexibility.
Each host has a local memcached instance used to cache the slow S3 lookups.
Usually there are many more steps
involved with decodes, but Z Proxy skips most that are not critically essential, such as
spam checking. Because it has fewer
features than the main decode path, and because it is written in optimized Go,
this is a lightweight way to serve our decodes (thousands a second)
in a failure scenario. We keep sufficient capacity on these systems
to be ready for a failure at any time.
Because we use NSQ, even if the primary infrastructure is down, hosts running Z
Proxy (we call these “lastresort” hosts) can create and queue messages
corresponding to each decode request. That means when everything is back
up and running, the primary infrastructure will process messages from these
hosts. Info+ pages will be updated with clicks that happened when everything
was down, ratelimits will be adjusted, realtime will find new
trends based on these clicks, and more.
Z Proxy also records metrics for internal use. It sends data to graphite
recording response times, types of requests, etc., but of course since it makes
no assumptions about anything in our infrastructure working, graphite included,
it also aggregates some stats locally.
Normally our DNS points to our load balancers, which send requests off to
frontend webservers. Nginx on each frontend webserver is configured to handle local
timeouts and failures by transparently retrying the request against a lastresort host. Nginx on
each lastresort host then sends the request to one of a few local Z Proxy
instances. This is great because it allows failovers on a per request basis,
but if our frontend servers or load balancers are taken out (ie: we loose datacenter connectivity),
it doesn’t help. In this case, we can point DNS for all of our domains directly at the
The trust-nobody approach of Z Proxy makes it very stable, but ultimately it
could still break, so even the Go app isn’t enough.
To have an additional level of safety, the S3 key is the short link, but the
value isn’t actually the long url itself. The S3 value is an HTML blob
containing a meta refresh to the
destination url. This allows Z Proxy to parse out the long url,
but also allows nginx on lastresort hosts to proxy the 200 responses directly from S3 if Z
Proxy goes down.
This multi-tier approach to failures gives us increasing
levels of availability with decreasing levels of features, metrics, performance, and
This system gives us confidence that we can serve decodes with high availability,
and in the event of an outage or failure, it gives us options for where to send traffic.
Because our link resolution dataset is immutable, S3 is an invaluable tool. While we
might take slightly different approaches with dynamic data, providing layers of
fallbacks from S3 and transparent retrying across datacenters is simple and effective
at providing high availability.