Building a Version-Controlled Data Aquarium

Benchling is a cloud platform for scientific research. Our customers generate a tremendous amount of scientific data, and with all of this data in one place, they’re able to answer questions to inform the next steps in their scientific process — questions like, “Where are the bottlenecks in my scientific process?” or, “Which of these drug candidates […]

Benchling is a cloud platform for scientific research. Our customers generate a tremendous amount of scientific data, and with all of this data in one place, they’re able to answer questions to inform the next steps in their scientific process — questions like, “Where are the bottlenecks in my scientific process?” or, “Which of these drug candidates are most effective?”.

In addition to the scientific data itself, we also care about data about the data. In order for our Product and Engineering teams to build a best-in-class product, we need to know how our customers are interacting with the product as they create and analyze their data. As a mature SaaS company, we also have to ensure that our business is operationally efficient, and provide visibility to our Sales, Marketing, and Finance teams. We need to be able to answer questions like, “How has usage of our molecular biology product changed over time?”, and “Is customer X using all of their purchased licenses?”.

To answer these questions, we set out to build a robust data platform to capture information from a variety of sources: event data about actions that users take within the product, data from Salesforce about our Marketing and Sales funnels, and more. In this post, we’ll talk about how we designed our “data aquarium” to handle all of this data at scale.

Out with the old…

In our early days, we really only cared about a few simple product metrics. We gathered these metrics by ingesting customer event data into our warehouse via a simple data platform, and we would point BI tools at this warehouse to generate usage reports.

Our original data platform

This was sufficient at the time; our product surface area was significantly smaller, and we didn’t have the complex operational needs that we have today. Our original data platform was built in service of answering simple questions like, “How many customers do we have?”. As the product expanded and our business matured, these questions became more involved — now we wanted to know “How many Notebook entries were created by customer X?”, which then grew into “How does Notebook entry creation improve license utilization for customer X?”.

As our questions grew more complex, so did the data we ingested. The number of data sources increased, the data ontology became more complicated, and queries became unwieldy. Over the course of 2020, we identified opportunities for improving our current infrastructure by boosting reporting performance, optimizing dashboard development velocity, and improving dashboard health.

Reporting performance
When evaluating our reporting performance, we found two key areas for improvement:

  1. Our BI tool operated on top of raw tables. Because it wasn’t operating on top of a coherent data model, it performed a lot of redundant joins and aggregations on the fly. This meant that a lot of reports were bottlenecked on a few tables.
  2. Our warehouse was a cluster that coupled storage and compute. This meant that these slow report-generating queries also competed for resources with ad-hoc queries made by folks across the business. This resource contention exacerbated the effects of these slow queries, resulting in unacceptable reporting performance.

While we could have temporarily resolved these issues by adding more compute resources, the number of schemas and tables in our warehouse was growing too quickly to do this in a sustainable way, as our product broadened and customer volume increased.

Dashboard development velocity
The addition of these new data sources resulted in a decrease in dashboard development velocity. Developers were left with the tricky problem of aggregating information about our customers across event tracking, application database data, sales data for account management, and other data sources. There was a clear opportunity here to leverage a different data transformation framework in order to have more consistent and maintainable code.

Dashboard health
Because we had no testing framework or version-controlled code, it was difficult to evaluate the impact of code changes to our dashboards. Without tests, it wasn’t clear if a code change would cause a regression, and when a regression inevitably surfaced, it was nearly impossible to identify which code change caused it without version control. Our mounting data integration use cases made it clear that we needed to further invest in our release process so that we could easily roll back code changes to lessen the “worst case” impact of an outage.

In service of these goals, it became clear that we needed to add some new components as the number of data sources and data integration tasks in front of us rapidly increased.

…in with the new!

As we set out to design our data aquarium, we had three guiding principles in mind:

  1. Buy rather than build. Speed was of the essence (and still is!) given where we were as a company, and the opportunity cost of rolling out our own solution was too high.
  2. Analysts ought to own the end-to-end implementation of their pipelines. As the engineers owning the underlying data infrastructure, our job was to increase the analysts’ velocity.
  3. It was imperative to ensure our platform complied with a strict set of best governance practices. This meant ensuring strong role-based access controls, strong authentication, and auditability.

We worked with our data analysts, data scientists, and their internal stakeholders to define the goals of our data aquarium. We categorized our problems into the following areas:

Here, we’ll focus on the storage and compute components of our data platform, the continuous integration layer, and how both of these together make version control tractable. (Stay tuned for our next post on data resource management and the data development process!)

Scaling our Data Platform with Snowflake

A key decision was which technology to leverage for our data platform. We selected Snowflake, which was advantageous for a variety of reasons. Snowflake separates its storage and compute; data is stored in Amazon S3, and compute units can be scaled vertically. This means we can designate different workloads to use an appropriate amount of resources — for instance, user queries won’t need to compete for resources when transformation workflows kick off.

Our initial Snowflake setup roughly mirrored Fishtown Analytics’ approach.

Data Version Control

Snowflake makes data version control possible via its “time travel” feature. We now have the ability to access historical checkpoints at both a temporal level and a transactional level. The transactional-level view of the warehouse state is particularly interesting, as it provides us with a way to attribute state snapshots of the warehouse to queries run by our pipelines. Now, if a data regression is found, it’s possible to replay the warehouse state, transaction by transaction, to figure out which query was responsible for it.

Continuous Integration

While version-controlled data was a big step up from our previous implementation, we couldn’t leverage it to its fullest potential without having version-controlled code. Without any kind of continuous integration, we had raw data dumps into Snowflake happening in parallel with data transformations. Not only did we have tables being created from scheduled ETL, we also had many one-off tables created by analysts to answer ad-hoc questions. This led to a few problems:

  • We couldn’t guarantee what version of code generated the tables and views.
  • It was difficult to know whether a change in the transformation code would have consequences on downstream tables.
  • Testing changes to the transformation code was not possible without affecting the data that powered our reports.
  • Analysts didn’t really have a development environment for testing code — they were running ad-hoc transformations on data being directly brought into Snowflake.

For application engineering, the answer is clear — we need separate development, staging, and production environments for our data pipeline code. But what does a development or staging environment look like? In a traditional web application like the Benchling product, staging and production containers are generally identical with different configurations. In data engineering, however, everything is transformable data. This is in part due to the complexity of having multiple data sources: there is no concept of “staging” vs. “production” environments for the operational data sources such as Salesforce (it doesn’t really make sense to have “staging” revenue numbers), so testing the code that integrates these data sources needs to be done on a copy of the data brought into our warehouse.

Luckily, Snowflake allows us to create zero-copy clones of data. This means that we can have duplicate objects without creating any physical copies or incurring additional storage costs. Now, we have the ability to think about separate environments by creating copies of our data across each environment.

With this solution in hand, we designed a snowflake warehouse that ingests the source data into a “raw” database, from where it is cloned into three separate databases daily: “dev”, “staging”, and “stable”. All of these databases live within our snowflake warehouse and are subject to the high standard of data access control that we have for all data at Benchling.

Our continuous integration pipeline

Now, analysts can develop and test their code against data within the “dev” database, which is then released into the “staging” database. From there, after integration tests are run against the affected tables, the code is re-deployed to our production pipeline executor, which generates a “stable” database that our BI tools consume from. While this release pipeline doesn’t completely guarantee that the “stable” database never breaks, it does dramatically reduce the risk of data quality issues, as we can now check our code against known data before pushing that code into production.

The New and Improved Data Aquarium

Our new and improved data aquarium

Zooming out, we can see that our data aquarium isn’t all that different structurally from our original data platform. By making just a few key changes to our data infrastructure, we were able to design and build a system that can support the scale and complexity of our data, and dramatically improve the health of our reports:

  1. Separating storage and compute in our data platform makes it easy to scale our data workloads.
  2. Data versioning not only helps with disaster recovery, but also enables us to have version-controlled data at a transactional level.
  3. Use of a coherent data model means that our dashboards are more performant because we don’t perform a lot of redundant joins and aggregations on the fly.
  4. Creating three separate data environments allows us to test code changes before release.

As a result, we’ve seen noticeable improvements in query performance in our reports, greatly increased our dashboard development velocity, and removed the brittleness and instability of our dashboards — enabling us to better answer questions and provide new insights about our customer data.

Stay tuned for our next post, where we’ll discuss how we were able to leverage dbt, Airflow 2.0, and ECS Fargate in our data development process!

We’re hiring!

If you’re interested in solving data engineering problems at scale, check out our careers page or contact us!

Building a Version-Controlled Data Aquarium was originally published in Benchling Engineering on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Benchling