Catching bugs on the client-side: how we developed our error tracking system Our team develops several products, Badoo and Bumble, two of the world’s largest dating and connection applications. For both, we have a web version (desktop and mobile) and mobile applications (Android and iOS). With more than millions of users, it’s important for us to […]
Our team develops several products, Badoo and Bumble, two of the world’s largest dating and connection applications. For both, we have a web version (desktop and mobile) and mobile applications (Android and iOS). With more than millions of users, it’s important for us to gather client-side errors, and for this we use a system of our own code-named Gelato. For the last two years, I have been involved in developing its server-side and throughout this time I have discovered a lot of new things about the world of error tracking systems development that I would like to share with you in this article.
What we will cover:
Firstly, and most obviously — we track errors in production. Nobody is safe from these errors, hence the importance of tracking them, of finding out how many users have been affected, and operatively fixing the most critical of them.
Secondly — we conduct error analysis.
At Bumble we release new versions of applications quite often:
Error analysis is always one of the steps in releasing any new version of the application. For this, the release-manager needs a summary report listing errors in that version. This enables them to decide whether it is safe to deploy the build on production or to see if the build contains any bug that eluded our QA, in which case the report will make it clear that the broken feature needs removing from the release.
Thirdly — having all the error information available in one place simplifies the work of developers and QA engineers.
Historically, we used two systems to collect client errors: HockeyApp for collecting crash reports from native applications, and our own system for collecting JS errors (written in PHP).
HockeyApp met our needs perfectly until it was acquired by Microsoft in 2014. Microsoft changed HockeyApp’s usage policy and began encouraging people to switch to their new system AppCenter. AppCenter at that time did not meet our requirements. Still in active development, some of the functionality we needed was missing: in particular, the deobfuscation of Android application stack traces using DexGuard mapping files, without which error grouping is impossible.
I’ll look at deobfuscation in detail later but if this is the first time you’ve come across it, having read this article will hopefully have proved useful to you.
A deadline was set: October 16th, 2019, by which date all HockeyApp users should have migrated to the AppCenter, but the support of DexGuard mapping files would only be added to the AppCenter at the end of December 2019, a few months after the official termination of HockeyApp.
In addition to this, we encountered the problem of incorrect calculation of the total number of errors in the HockeyApp. And since no further development was to be done on HockeyApp we had to start duplicating this information into our internal analytics system to see the real number of errors.
As for our self-written system for collecting JS errors that we developed in-house, for many years it worked flawlessly despite having only basic functionality.
The architecture was quite simple:
In 2017, our frontend development team approximately doubled in size. The system began to be used more actively and the developers soon became increasingly aware of its limitations.
Having collected and analysed all the requirements our team would ideally like to have, we realised that it was going to require more than just a little hard work to improve the current solution. Developed back in 2014, the system was now obsolete, and the cost of refactoring would exceed the cost of implementing a new solution.
So, the decision was made to gradually switch to a new system that would cover all the existing functionality and meet all our requirements.
Of course, before writing our solution we analysed existing systems on the market.
There are a lot of SaaS solutions out there for tracking and monitoring errors, and this is not surprising: fast detection and fixing errors is a key aspect of modern development. Among the most popular services are Bugsnag, TrackJS, Raygun, Rollbar and Airbrake. All of them have rich functionality and generally meet our requirements, but we did not consider cloud solutions. Migration to a new solution is a rather complicated and lengthy procedure and we were concerned that the pricing and usage policies could as well change over time, as happened with HockeyApp.
With open-source systems, things were not so rosy. Most of them either stopped developing or never emerged from the development stage and were not recommended for use in production.
In fact, only Sentry continued to evolve and had most of the functionality we needed. But at that time (early 2018), the eighth version of the service did not suit us for the following reasons:
In July 2018, the ninth version of Sentry was released. It introduced integration with issue trackers and laid the foundation for key improvements (in my opinion) — the transition to ClickHouse for storing events (I recommend this series of articles on this). But unfortunately, at the time of our research, none of this even figured in the plans. Therefore, we decided that the best option in our case would be the implementation of our own system, customised for our processes and therefore easy to integrate with other in-house tools.
So, a system codenamed Gelato (General Error Logs And The Others) was born, the development of which is discussed further below.
As they say, it is better to see once than hear a hundred times, so first I will show what our system can do now so that it becomes clear how we work with errors. This is important for understanding the architecture of the system: how data is used determines how it should be stored.
The main page contains a list of applications and general error statistics for a given criterion.
By clicking on a particular application, we are taken to a page with its release statistics.
By clicking on a particular version, we are taken to a page listing of error groups
Here we can see what errors occurred, how many there were, how many users were affected, when the error first occurred, and its most recent occurrence. Also, we can sort the data by most of the fields and create a ticket in Jira for any error.
This is how this page looks for native applications:
By clicking on a particular error, we are taken to a page giving detailed error information.
Here you can see general information about the error (1), a graph of the total number of events (2), and various analytics (3).
It also contains information about specific events, which is mainly used to analyse the problem.
Another interesting feature that I have never seen in similar systems is releases comparison. This makes it quite easy to detect errors that have appeared in the new version, and those that were fixed in previous releases but then, later on, began to appear again (regression).
Select releases to compare:
And we get to a page with a list of errors that are in one version but not in the other:
As you may have noticed, we have implemented a basic set of functions that cover most of the use cases. But we do not intend to stop here but shortly will be adding many useful features that expand the capabilities of the system, including:
Now let’s go under the hood to see how everything works. The scheme is pretty standard and consists of three stages:
This can be depicted schematically as follows:
Let’s get started with data collection.
We proceed from the assumption that the developers of the client application have already taken care of error handling on their side, and all that is required from our service is to provide an API for sending error information in a certain format.
What does the API have to do?
Why do we need an intermediate queue?
If we know we have a fairly low EPS (errors per second), and that all parts of our system will work in a stable fashion all the time, then we can significantly simplify the system and make the whole process synchronous.
But you and I know, that this is not the real world, but that at any stage, at the most inopportune moment, something unexpected can happen. And our system has to be ready for this. So, an error in one of the external dependencies of the application will mean it begins to crash, which will lead to an increase in EPS (as was the case with the iOS Facebook SDK on July 10, 2020). As a result, the load on the entire system will increase significantly, and with it the processing time for one request.
Or, for example, the database might become temporarily unavailable — so the system will simply not be able to save the data. There can be many reasons for this: problems with network equipment, a data centre employee accidentally touching a wire — so the server switches off, and the disk space runs out.
Therefore, to reduce the risk of data loss and make data collection as fast as possible (so that the client does not have to wait a long time for a response), we save all incoming data to an intermediate queue, which is processed by a separate script in our cloud.
What can be used as a queue?
Here there are two questions we need to answer: “Where to store?” (database) and “How to store?” (data model).
When implementing a prototype of the system, we settled on two options: Elasticsearch and ClickHouse.
Among the main pros of this database, I would highlight the following:
Of course, like any system, Elasticsearch also has cons:
The pros of this database are:
But at the beginning of 2018, ClickHouse was missing some of the functions we needed:
Actually, we could have circumvented all the above restrictions (and there were even several articles on this topic on the Web (for example), but we wanted to implement a prototype of our system at a minimal cost. For this, we needed a more flexible database so we opted for Elasticsearch.
Of course, in terms of write performance, Elasticsearch is inferior to ClickHouse, but for us, this was not critical. Much more important was the support of the functionality we needed and scalability out of the box. The fact that we already had an Elasticsearch cluster, which we were using to collect logs from daemons, was also significant — this meant there was no need for us to set up the infrastructure.
Now let’s talk a little about how we store events.
All our data is divided into several groups and stored in separate indices:
Data is isolated for a specific application (separate index) — this allows us to customise the index settings depending on the load profile. For example, we can keep data of unpopular applications on warm nodes in a cluster (we use a hot-warm-cold architecture).
In order to store both JS errors and crash reports of native applications in the same system, we moved to the top-level everything that is used to compute general statistics (error occurrence time, in which release it occurred, user information, grouping key) and what is unique for each type of error is stored in the nested field attributes with its mapping.
The actual idea was borrowed from Sentry and slightly modified during operation. In Sentry, an event has base fields, field tags for data that needs to be searchable, and the extra field for all other specific data.
So, now we come to what I consider to be the most interesting thing in developing a system for collecting client errors — data processing. Without it, the information that we collected in the previous step will be useless and we will be unable to receive anything except a signal that something went wrong in our application. But our goal is to be able to track and fix the most critical errors as quickly as possible.
Let’s start with a simpler case.
To reduce the size of the application as much as possible, it is customary in the Android world to use special utilities during the build process, which:
You can learn more about this from the official documentation.
There are several popular utilities today:
If the application is built using obfuscation mode, then the stack trace will look something like this:
o.imc: Error loading resources: Security check required
at o.jij$c.apply(Unknown Source:0)
at java.lang.reflect.Method.invoke(Native Method)
Not much can be understood from it, except for the error message. To extract useful information from such a stack trace, it first needs to be decrypted. The process of decrypting obfuscated classes and methods is called deobfuscation; this requires a special file called mapping.txt, which is generated at the time of building the application. Here is a snippet of such a file:
AllGoalsDialogFragment -> o.a:
java.util.LinkedHashMap goals -> c
kotlin.jvm.functions.Function1 onGoalSelected -> e
java.lang.String selectedId -> d
AllGoalsDialogFragment$Companion Companion -> a
54:73:android.view.View onCreateView(android.view.LayoutInflater,android.view.ViewGroup,android.os.Bundle) -> onCreateView
76:76:int getTheme() -> getTheme
79:85:android.app.Dialog onCreateDialog(android.os.Bundle) -> onCreateDialog
93:97:void onDestroyView() -> onDestroyView
Therefore, we need a service to which we could feed the obfuscated stack trace and mapping file — and get the original stack trace at the output.
We were not able to find suitable ready-made solutions in the public arena (maybe we did not look very hard), but fortunately for us, ProGuard engineers (and we use DexGuard for obfuscation) were looking out for developers and made the ReTrace utility publicly available, which implements all the necessary functionality for deobfuscation.
Using this, our Android developers wrote a simple service in Kotlin which:
Crash reports from the iOS application contain quite a lot of useful information, including the stack traces of all threads launched at the time of the crash (read more about the crash report format here and here). But there’s a catch: stack traces contain only information on the memory addresses where classes and methods are located.
0 libsystem_kernel.dylib 0x00000001bf3468b8 0x1bf321000 + 153784
1 libobjc.A.dylib 0x00000001bf289de0 0x1bf270000 + 105952
2 Badoo 0x0000000105c9c6f4 0x1047ec000 + 21694196
3 Badoo 0x000000010657660c 0x1047ec000 + 30975500
4 Badoo 0x0000000106524e04 0x1047ec000 + 30641668
5 Badoo 0x000000010652b0f8 0x1047ec000 + 30667000
6 Badoo 0x0000000105dce27c 0x1047ec000 + 22946428
7 Badoo 0x0000000105dce3b4 0x1047ec000 + 22946740
8 Badoo 0x0000000104d41340 0x1047ec000 + 5591872
The process of mapping a memory address to a function name is called symbolication. To symbolicate a crash report you need special archives with debug symbols (dSYM), generated at the time of building the application, along with software that can work with these archives.
What can be used for symbolication?
We opted for the last option and wrote a service in Golang which under the hood interacts Symbolic via cgo.
Another aspect that we need to look at is error grouping because the better errors are grouped, the more quickly you can detect the most critical errors among all other events.
Someone unfamiliar with how error handling systems work might imagine they use some kind of complex algorithms to determine string similarity. But, in reality, all popular systems use fingerprint for grouping because it is easy to implement and covers most cases. In the most basic case, it can be a hash from the error message and stack trace. But this is not suitable for all types of errors, so some systems allow you to explicitly specify which fields you want to use to calculate the grouping key (or you can pass the key explicitly).
We decided not to complicate our system and settled on grouping by hash:
The journey from an idea to a fully-fledged transition to a new system took us almost two years, but we are pleased with the result and already have plans to improve the system and integrate it with our other internal products.
If you are planning to start collecting and processing client errors and don’t know which tool to use, then I highly recommend taking a closer look at Sentry, since this service is actively developing and is among the market leaders.
But if you decide to follow our example and develop your own system, then this article gives you the main points you need to bear in mind.