Remedying the API gateway

To expose our internal services to the outside world, we use what is known as an API Gateway. This is a central point of contact for the outside world to access the services Envato Market uses behind the scenes. Taking this approach allows authors to leverage the information and functionality Envato provides on its marketplaces […]

To expose our internal services to the outside world, we use what is
known as an API Gateway. This is a central point of contact for the
outside world to access the services Envato Market uses behind the
scenes. Taking this approach allows authors to leverage the information
and functionality Envato provides on its marketplaces within their own
applications without duplicating or managing it themselves. It also
benefits customers who want to programmatically interact with Envato
Market for their purchases instead of using a web browser.

The old API gateway

The previous generation API gateway was a bespoke NodeJS application
hosted in AWS. It was designed to be the single point of contact for
authentication, authorisation, rate limiting and proxying of all API
requests. This solution was conceived one weekend as a proof of concept
and was quickly made ready for production in the weeks that followed.

This solution worked well and allowed Envato to expose a bunch of
internal services via a single gateway, removing the need to know which
underlying service it was connecting to and how to query it correctly.

Here is an overview of how the infrastructure looked:

Whilst building a Ruby client for the Envato API I noticed some
niggling issues that I fixed internally however throughout the whole
process, I was getting intermittent empty responses from the
gateway. This was annoying but at the time I didn’t think much of it
since my internet connection could have been to blame and there wasn’t
any evidence of this being a known issue.

March 2016 saw Envato experience a major outage on the private API
endpoints due to a change that incorrectly evaluated the authorisation
step, resulting in all requests getting a forbidden response. You can
read the PIR for full details however during this outage we had
many of our authors get in touch and conveyed their justified
frustrations. Due to this incident, we implemented a bunch of improvements
to the API and created some future tasks to address some issues that
weren’t user facing but would help us answering some questions we had
about the reliability of our current solution.

Following on from these discussions, in April a couple of our elite
authors
got in touch regarding some ongoing connectivity issues with
the API. They were experiencing random freezes in requests that would
eventually just time out without a response or warning. During the
conversations, they also mentioned they would see an occasional empty
body in the responses. We spent a great deal of time investigating these
reports and working with the elite authors to help mitigate the issue as
much as possible. We finally managed to trace down some problematic
requests and begin trying to replicate the issue locally.

Even though we were able to eventually reproduce the issue locally, it
was very difficult to isolate the exact cause of the problem for a
number of reasons:

  • The single API gateway application had so many responsibilities and
    tracing requests showed it crossing concerns at every turn.
  • We were using third party libraries for various parts of functionality,
    however the versions we were running were quite old and included many
    custom patches we added along the way to fit our needs.
  • The proxying functionality (used for sending requests to the backends)
    didn’t perform a simple passthrough. There was a great deal of code
    covering discrepancies in behaviour between backends and the content
    was rewritten at various stages to conform to certain expectations.

All of the above points were made even more difficult since we have very
little in-house support for NodeJS and those who are familiar with it
are primarily working on the front end components, not the backend so
this was a new concept for them too.

After spending a few weeks trying to diagnose the issue, we realised
we weren’t making enough headway and we needed a better strategy. We got
a few engineers together and starting working on some proposals to solve
this for good. During the meeting we decided that going forward, NodeJS
wasn’t going to work for us and it needed to be replaced with a solution
that handled our production workload more effectively and we knew how to
run at scale.

The meeting created the following action items:

  • Throw more hardware into the mix with the aim of reducing the chance
    of hanging requests by balancing the load over a larger fleet of
    instances. While this didn’t solve the issue entirely, it would allow
    our consumers hit this issue less often.
  • Find a replacement solution for the NodeJS gateway. It needed to be
    better supported, designed in a way that allowed us to have better
    visibility, be highly scalable and fault tolerant.

The new API gateway

Immediately after the meeting we scaled out the API gateway fleet and
saw a drop off in the hanging requests issue. While it wasn’t solved,
we saw significantly fewer occurrences and eased the pressure.

We started assessing our requirements for the new API gateway and came
up with a list of things that we set as bare minimums before a solution
was considered viable:

  • Must isolate responsibilities. If a single component of the service
    was impaired, it should not impact the rest.
  • Must be able to be managed in version control. This was important for
    us since we are big fans of infrastructure as code and all of our
    services take this approach to ensure we can rebuild our
    infrastructure reliably each time, every time.
  • Must be able to maintain 100% backwards compatibility with existing
    clients so that our consumers don’t need to redo their whole
    applications to fit our changes.
  • Have great in-house support. If something goes pear-shaped, we have
    the skills to solve the problems.

Following some trialling of PaaS and in-house solutions we landed on
AWS API gateway. This met all of our criteria and employed many
AWS products we were already familiar with which made the transition far
smoother. However, a problem for us was that much of the functionality we
needed was still under development by AWS and for a long time, we were
building against a private beta of the service and hit various bugs that
were still being addressed by the AWS teams.

We finally managed to ship a private beta of the service to a select few
elite authors in late November and after ironing out a few bugs we
found, we dark launched the new gateway to public use in January.

Here is what the infrastructure and request flow looks like (as of this
writing):

This new infrastructure has allowed us to meet all the requirements we
set out to while also removing a bunch of the confusion around which
components are associated with which responsibilities. When we go to
perform changes to a piece of this infrastructure, we know exactly what
the impact will be and how to best mitigate it. The move has also given
us a bunch of improvements around scalability and resiliency. Now if we
experience a request surge the gateway infrastructure is able to scale
to meet the needs instead of throwing errors because all the available
resources have been exhausted.

While it’s still early days, we are far more confident in the API
Gateway’s reliability. Since the move we have full visibility into each
component, which was lacking before and a major cause of frustration.
Consequently we are able to measure the availability and act quickly
when a component fails.

P.S. If you haven’t already, why not check out the Envato API?

Source: Envato