Migrating Mountains of Mongo Data

At Addepar, we’re building the world’s most versatile financial analytics engine. To feed the calculations that give our clients an unprecedented view into their portfolios, we need data — from as many sources, vendors, and intermediaries as possible. Our market and portfolio data pipelines ingest benchmarks, security terms, accounting, and performance data from hundreds of integration partners. […]

At Addepar, we’re building the world’s most versatile financial analytics engine. To feed the calculations that give our clients an unprecedented view into their portfolios, we need data — from as many sources, vendors, and intermediaries as possible. Our market and portfolio data pipelines ingest benchmarks, security terms, accounting, and performance data from hundreds of integration partners. Behind this data pipeline is a database. And like every database, ours requires maintenance and care.

Maintaining an obsolete database instance is challenging due to lack of support, inferior performance, and a dwindling developer community. As our dataset grew and we faced increased stability and performance requirements, the engineering group at Addepar decided it was time to upgrade our venerable Mongo 2.4 (TokuMX 2.0) database to the latest and greatest Mongo 3.4. Every organization has at least one database saga, and we’re excited to share one of ours.

Dependency management extends to databases

Upgrading a database is a tricky business. Like other dependencies that support a product, databases have a tendency to fall out of date. Upgrades are deferred until that imagined future where everything is stable, clients have exhausted their feature request lists, and there’s not much to do in the office other than play foosball and exchange memes. It’s an understandable decision! Databases are complicated, leaky abstractions that inevitably form an implicit extension of our application logic. Their data types, atomicity guarantees, and transactionality semantics define the constraints we place on data and drive how we store application state. And because these guarantees (or lack thereof) tend to change (in ways that are sometimes undocumented!) between database versions, moving to the latest release is a risky proposition.

Motivated by our need for an extremely reliable datastore that could handle complex and evolving schemas, we chose Mongo as the sole database for Addepar’s data ingestion pipeline five years ago. We use Mongo not only as a document store for persisting state, but also as a blob-store for capturing intermediate, heterogenous results from the pipeline. As each transformation stage in the ETL pipeline succeeds, its result is decorated with indexable metadata and placed in Mongo. This use-case does not require transaction isolation but benefits from high write volumes, flexible schema definition, multiple indices, easy replication, and automatic failover. However, when we initially decided to use Mongo, the database only supported very coarse locks at the collection (i.e. table) level, limiting write throughput. We were not the only ones frustrated by Mongo’s limited support for concurrent operations. TokuMX, an open-source fork with MVCC semantics and index and document level locking, had been developed to fill some of Mongo’s gaps and we were happy to adopt it.

As our daily data intake grew, it became clear that TokuMX, long since deprecated by its new parent company, was no longer meeting our needs. Conflicting access patterns necessitated indices that were less than optimal for most queries. TokuMX’s index range locks would freeze large chunks of our collections, blocking critical jobs and yielding a worrisome number of operation timeouts. These performance issues made it even more critical that we be able to explore query plans and identify bottlenecks, but diagnostic tools for inspecting the database engine were limited in early versions of the software. Nobody enjoyed wondering why certain operations would sometimes take ten times longer than usual. We needed a new database.

Fortunately, since our adoption of TokuMX, the mainline Mongo distribution had made impressive gains. The newest version of the database offered a much-improved replacement database engine with optimistic document-level locking and radically improved support for server-side aggregation and transformation of documents. I was thrilled to have access to more bulk-write options that could improve the performance of some of our most “chatty” database operations. Upgrading would resolve some of our performance concerns in the short term, resolve bugs that had long-since been patched in newer versions of the software, and allow us to continue scaling with our clients.

Planning the great migration

The decision made, we got started with preparation for our great migration. We needed to be sure that our highly-parallel data ingestion pipeline would continue to operate correctly with the new database. We approached the problem from several directions. To catch low-hanging fruit, we beefed up our existing unit tests around our application-level data access abstractions and checked for compatibility with the new database version. To verify expected behavior of the entire pipeline-database system at a small scale, we repurposed our existing integration test suite, which provides very broad coverage of our data pipeline and utilizes multiple concurrent clients.

We also decided to run both TokuMX and Mongo 3.4 in parallel in a production-like environment. Using Java’s convenient dynamic proxy support, we transparently intercepted each application-level database request, performed the operation against both databases, and compared both the document results and operation times using a separate thread pool so as not to block the application threads on additional computation. With the help of a few scripts and our log aggregation provider, Sumo Logic, we built a detailed view of the results (correctness) and performance characteristics (latency) of all of our database operations, broken down by service, data provider, and other attributes. Pairwise comparison of all requests showed that Mongo 3.4’s operations were on average faster, with less variance in their duration. While (expected and infrequent) transient failures complicated this simple approach, we could confidently proceed with the changeover knowing that the new database would perform at least as well as the old.

The portion of our product reliant on the Mongo database is not client facing or driven by external requests. We were able to take advantage of a weekend maintenance period to migrate our database offline. TokuMX’s on-disk format is incompatible with Mongo 3.4’s, ruling out an in-place swap. While it is possible to use a combination of database snapshots and the Mongo family’s “oplog” (akin to the binlog for our MySQL friends) to perform the migration with only several minutes of downtime, looser business requirement allowed us to use the simpler offline process. Still, we needed to be sure that our database would be back up and running before the end of the weekend. Without the database there would be no pipeline, and without the pipeline, our clients would not have their data in time for Monday’s market open.

Rehearsal makes execution all that much easier

We set up a production-like environment with a fresh Mongo 3.4 cluster, loaded data into that environment, and rehearsed the entire migration procedure several times. This let us experiment with different settings for extracting the data from TokuMX to serialized BSON and re-importing it to Mongo. Tweaking the import configuration for our environment shaved several hours off the whole procedure. (We found that turning off the journal, disabling index restoration, and deferring the creation of the replication set until after the data had been restored reduced time to completion by half.) Careful rehearsal also helped us weed out surprises that would have been quite unwelcome during the production migration. Writing out precisely how the collection index specifications needed to be adjusted for Mongo 3.4 (the primary index is now implied to be unique and the concept of clustering no longer applies) was as important as determining how long it would take to move data between machines and which service users would be authorized to take each action.

When the migration weekend rolled around, the collaboration between our software engineering and DevOps teams paid off. We rolled up our prep work into a detailed script with expected run-times and pre-considered contingencies for possible hiccups. The migration went off with only a few bumps (remember to adjust your configuration management tool to avoid having services unexpectedly restart!). We started processing data again with time to spare until the next business day and with confidence that everything would proceed smoothly. We retired a deprecated, yet critical, dependency, strengthened the relationship between two of our teams, and set the stage for further performance improvements for our clients.

Our upgrade work set the stage for us to server larger clients more reliably than we could before. Features in the latest Mongo distributions are allowing us to tighten up our operational standards while also evolving our models to help our clients make the most of their financial data. We’re looking forward to persisting our next billion documents, recording data on hundreds of billions of dollars of assets, even faster and more reliably than ever!

Migrating Mountains of Mongo Data was originally published in [email protected] on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Addepar