How we tracked down Ruby heap corruption in amongst 35 million daily requests

Back in November 2015, one of the Envato Market developers made a startling discovery – our exception tracker was overrun with occurrences of undefined method exceptions with the target classes being NilClass and FalseClass. These type of exceptions are often a symptom that you’ve written some Ruby code and not accounted for a particular case […]

Back in November 2015, one of the Envato Market developers made a
startling discovery – our exception tracker was overrun with occurrences
of undefined method exceptions with the target classes being
NilClass and FalseClass. These type of exceptions are often a
symptom that you’ve written some Ruby code and not accounted for a
particular case where the data you are accessing is returning nil or
false. For our users, this would manifest itself as our robot error
page letting you know that we encountered an issue. This was a
particularly hairy scenario to be in because the exceptions we were
seeing were not legitimate failures and replaying the requests never
reproduced the error and code inspection showed values could never be
set to nil or false.

It’s worth noting that during the assessment of these exceptions we were
able to confirm that the security of Envato Market and that of our
community’s data was not impacted. This was a notion we continuously
challenged and ensured was still true throughout every step of our
investigations and if at any stage that was not clear, we stopped and
tightened third party checks and monitoring to make sure we were
certain.

These exceptions were harder to track down than the regular errors that
manifest in a single point of the application and our error tracker
showed the errors had begun sometime in October, although we could not
isolate a specific deployment or change that matched that timeframe.
Initially we tried upgrading and downgrading newly introduced gem
version changes and when none of these stopped the errors, we also
rolled back our Gemfile to previous months with no success. We also
read close to every article on the internet related to Ruby heap
corruption available and tried all the proposed solutions without any
luck.

To add to the problem this issue was only occurring in production. Any
attempts to replicate this locally or in our staging environments never
created the error – even attempting to replay the production traffic
through another environment didn’t create the issue. In production this
was very difficult to create a reproducible test case as we did not know
exactly where the problem originated from. About mid investigation we
did think we had a script to reproduce it however it turned out that it
was just running into the exceptions much like our usual traffic was.

Our suspects

Premature garbage collection

From the early investigations, we suspected a premature GC due to valid
pointers disappearing mid call. This became more suspicious as upgrading
to MRI 2.2.x and introducing generational GC made our situation worse.
To dive further into this we instrumented Ruby GC stats using
Graphite and watched for any unusual
change in object allocations and heap health. However, we could never
find anything that pointed us to a GC problem despite spending a large
amount of our time investigating and tuning this Ruby’s GC behaviour.

Unicorn

Envato Market runs unicorn as the
application server and suspicion was raised after we traced a series of
requests back to the same parent process. Since unicorn uses forking and
copy on write we thought
there could be a chance that the parent process was becoming corrupt and
then passing it onto its children processes. We lowered the worker count
to 1 and attached rbtrace to the
parent and child processes but came up with nothing that looked like
corruption and never managed to capture a segmentation fault in the
watched processes so we were able to rule this out.

C extensions

We use a handful of gems that rely on C extensions to perform the low
level work in a speedy and efficient manner. Using C is a double edged
sword – using it well results in excellent performance whereas using it
poorly (especially unintentionally!) results in very difficult to
diagnose issues unless you are well versed in the language and
environment. For the most part we relied on
valgrind and manual review to identify anything
that we thought could be the cause. To get a list of all gems that we
needed to inspect we use the following snippet which returns a list of
the gems which have a ext directory.

1
puts `bundle show --paths`.split("n").select { |dep| File.directory?("#{dep}/ext") }.map { |dep| dep.split('/').last }.join("n")

Once we had a list of potential culprits we set off reviewing each one
individually and running it through valgrind looking for memory leaks
and anything that didn’t quite look right that might lead us to a heap
corruption scenario. Our investigations here
didn’t solve our specific issue however we did submit
some
upstream
patches to the projects that
we identified potential issues with.

Our Ruby build process

Before the Envato Market application got onto MRI 2.x we ran a custom
Ruby build with a series of patches that required us to build and
maintain our own fork of Ruby. Ruby is in a far better place now so we
rarely need to patch it but we still maintain this process as it allows
us to package up our version of Ruby and ship it to our hosts as a
single package and eliminate attempting to download and build on
individual hosts.

During the investigations there were concerns that the version of Ruby
we were building and deploying was being corrupted at some stage and we
were seeing that as the segmentation faults in production. To
troubleshoot this further, we tried to build our custom version with
additional debugging flags but quickly discovered they were not being
applied as we thought they were in the past. After spending a bit of
time digging into the issue we identified the cause was the position we
were passing in the flags and inadvertently ended up getting trampled on
by ruby-build default options
instead. We fixed this within our packaging script and shortly after we
were able to verify that our version of Ruby matched the upstream Ruby
once it had been packaged up manually and deployed to our hosts.

Getting help

After a few months of trying and eliminating everything we could think
of, we reached out for external help. We had spent a bit of time
analysing our C extensions but we didn’t feel we went deep enough with
our investigations. To get more insight into the C side of things we
teamed up with James Laird-Wah and started
going through our core dumps and gems that relied on C extensions.

After many long hours of debugging and stepping through core dumps, we
had found a smoking gun in the form of a Nokogiri
release
.
Nokogiri RC3 introduced a
patch
for a nasty edge case where if you have libxml-ruby and Nokogiri
loaded in the same application, you could encounter segmentation
faults due to the way some fields were managed. Updating to Nokogiri
RC3 saw us drop the number of segmentation faults to a third of their
previous counts! Looking at this behaviour further we identified the
cause of this edge case was the way libxml-ruby managed the fields
despite the ownership on them. In order to address the remaining
segmentation faults, we got together with James and formulated a patch
that would ensure libxml-ruby would only manage the fields it
explicitly owned. We tested it and deployed it to production where we
monitored it closely for 24 hours. Lo and behold, 0 segmentation
faults! We’d finally found the cause to our issues! Excited with our
discovery we pushed the patch upstream and it’s now available at
xml4r/libxml-ruby#118.

Ensuring this doesn’t happen again

Like most regressions, proactive monitoring and insights into your
normal application behaviour are the best solution for avoiding long
running issues like this. To make sure we are not stung with this again,
we are taking the following measures:

  • Implementing alerting based on any occurrences of segmentation faults
    within our applications. Our applications should not ever be hitting
    segmentation faults and if they are, we need to mark them as a high
    priority and assign engineers to resolve them.
  • Dependency graph review on each bundle update to see when we have a
    lingering dependency and possible removal. The gem we were using that
    leveraged libxml-ruby has since been refactored out but the
    relationships and dependencies were never cleaned up once it was
    removed.
  • Better monitoring and roll ups of exceptions on a team by team basis.
    We are aiming to better integrate our developer tooling with our
    exception tracker to ensure we can quickly identify an increase in
    exceptions and work on a resolution before they become a bigger issue.

Source: Envato