How Airbnb Built “Wall” to prevent data bugs

Gaining trust in data with extensive data quality, accuracy and anomaly checks As shared in our Data Quality Initiative post, Airbnb has embarked on a project of massive scale to ensure trustworthy data across the company. To enable employees to make faster decisions with data and provide better support for business metric monitoring, we introduced Midas, […]

Gaining trust in data with extensive data quality, accuracy and anomaly checks

As shared in our Data Quality Initiative post, Airbnb has embarked on a project of massive scale to ensure trustworthy data across the company. To enable employees to make faster decisions with data and provide better support for business metric monitoring, we introduced Midas, an analytical data certification process that certifies all important metrics and data sets. As part of that process, we made robust data quality checks and anomaly detection mandatory requirements to prevent data bugs propagating through the data warehouse. We also created guidelines on which specific data quality checks need to be implemented as part of the data model certification process. Adding data quality checks in the pipeline has become a standard practice in our data engineering workflow, and has helped us detect many critical data quality issues earlier in the pipelines.

In this blog post we will outline the challenges we faced while adding a massive number of data checks (i.e. data quality, accuracy, completeness and anomaly checks) to prevent data bugs company-wide, and how that motivated us to build a new framework to easily add data checks at scale.

Challenges

When we first introduced the Midas analytical data certification process, we created recommendations on what kind of data quality checks need to be added, but we did not enforce how they were to be implemented. As a result, each data engineering team adopted their own approach, which presented the following challenges:

1. Multiple approaches to add data checks

In Airbnb’s analytical data ecosystem, we use Apache Airflow to schedule ETL jobs or data pipelines. Hive SQL, Spark SQL, Scala Spark, PySpark and Presto are widely used as different execution engines. However, because teams started building similar data quality checks in different execution engines, we encountered other inherent issues:

  • We did not have any centralized way to view the data check coverage across teams.
  • A change in data check guidelines would require changes in multiple places in the codebase across the company.
  • Future-proof implementations were nearly impossible to scale. Teams kept re-inventing the wheel and duplicated code spread across the codebase.

2. Redundant efforts

Different teams often needed to build tools to meet their own requirements for different data checks. Each Data Engineering (DE) team started to build data check tools in silos. Although each of these teams were building solid tools to meet their individual business needs, this approach was problematic for a few reasons:

  • We started to build multiple frameworks in parallel.
  • Data check frameworks became costly to maintain and introduced operational overhead.
  • Missing features and lack of flexibility/extensibility made these frameworks difficult to reuse across the company.

3. Complicated Airflow DAG code

Each check was added as a separate task in Airflow as part of the ETL pipeline. Airflow DAG files soon became massive. The operational overhead for these checks grew to the point that it became hard to maintain, because of a few different factors:

  • There was no support for blocking vs non-blocking checks. Minor check failures or false alarms often blocked the SLA of critical data pipelines.
  • ETL logic and data checks became tightly coupled and not reusable.
  • Maintenance became operationally challenging, as we tracked the dependencies manually, which also made it difficult to add more checks.

Defining the Requirements

To address these tooling gaps, we set out to build a unified data check framework that would meet the following requirements and ensure greater usability overtime:

  • Extensible : Unify data check methodologies in use at Airbnb
  • Configuration-driven: Define the checks as YAML-formatted files for faster development
  • Easy to use: Provide a simplified interface to promote faster adoption company wide

Introducing Wall Framework

Wall is the paved path for writing offline data quality checks. It is a framework designed to protect our analytical decisions from bad data bugs and ensure trustworthy data across Airbnb.

Wall Framework is written in Python on top of Apache Airflow. Users can add data quality checks to their Airflow DAGs by writing a simple config file and calling a helper function in their DAG.

  • Wall provides most of the quality checks and anomaly detection mechanisms currently available in the company under a common framework, making data checks a lot easier to standardize.
  • It supports templated custom SQL-based business logic, accuracy checks, and an extensible library of predefined checks.
  • Wall is config driven — no code is required to add checks.
  • Checks can be used in the ETL pipeline in a Stage-Check-Exchange pattern or as standalone checks.
  • The framework is extensible — any team can add their team-specific checks to Wall quite easily following the open source model (as per the Data Engineering Paved Path team’s approval).
  • Business users can easily add quality checks without creating any airflow DAG or tasks for each check.
  • Wall takes care of SQL-based checks and anomaly detection task creations. It also takes care of stage and exchange task creations and setting the appropriate dependency on the checks in a decoupled manner. Hence, after migrating to Wall, ETL pipelines were drastically simplified and we’ve seen cases where we were able to get rid of more than 70% of DAG code.

Wall Architecture

Following our key requirements, this framework was designed to be extensible. It has three major components — WallApiManager, WallConfigManger and WallConfigModel..

Wall internal architecture

WallApiManager

The Wall Api Manager is the public interface to orchestrate checks and exchanges using Wall. Wall users only use this from their DAG files. It takes a config folder path as input and supports a wide variety of ETL operations such as Spark, Hive etc.

WallConfigManager

The Wall Config Manager parses and validates the check config files and then calls the relevant CheckConfigModels to generate a list of Airflow tasks. Wall primarily uses Presto checks to generate data checks.

CheckConfigModel

Each Wall check is a separate class that derives from BaseCheckConfigModel. CheckConfigModel classes are primarily responsible for validating check parameters and generating Airflow tasks for the check. CheckConfigModel makes the framework extensible. Different teams can add their own CheckConfigModel if existing models do not support their use cases.

Key Features

Wall framework provided the following key features to address the requirements we mentioned above.

Flexibility

  • Wall configs can be located in the same repository where teams are already defining their data pipeline DAGs — teams or DAG owners can decide where they’re located. Teams can either use a separate YAML file for each table or a single YAML file for a group of tables to define checks.
  • Each check config model can define an arbitrary set of parameters and it can override parameters if needed. The same check configs can be orchestrated and run differently based on running context. i.e. as part of ETL’s stage-check-exchange or as pre/post checks.
  • A check property can be hierarchical (i.e. it can be defined at team level, file level, table level or at check level). Lower level property values override upper level values. Teams can define their team level defaults in a shared YAML file instead of duplicating the same configurations and checks in different YAML files.
  • In the case of stage-check-exchange checks, users can specify blocking and non-blocking checks. It makes Wall more flexible while onboarding new checks.

Extensibility

  • It’s easy to onboard a new type of check model. Wall is able to support commonly used data checks/validations mechanisms.
  • Each check config model is decoupled from each other and it can define its own set of params, validations, check generation logic, pre-processing etc.
  • Check config models can be developed by the data engineering community with the collaboration of the Data Engineering Paved Path team.

Simplicity

  • Easy to copy-paste to apply similar checks in different tables or contexts.
  • Check models are intuitive.
  • Checks are decoupled from DAG definition and ETL pipeline so that they can be updated without updating ETL.
  • Easy to test all the checks at once.

Adding a Wall check

At the every high level, users need to write a yaml config and invoke Wall’s API from their DAG to orchestrate their ETL pipeline with data checks.

High level diagram of how users interact with Wall.

As an example of how easy it is to add a new data quality check, let’s assume you’d like to add a data quality check — verifying that a partition is not empty — to a table named foo.foo_bar in the wall_tutorials_00 DAG. It can be done by following these two steps:

  1. Decide on a folder to add your wall checks configs i.e. projects/tutorials/dags/wall_tutorials_00/wall_checks. Create a check config file (i.e. foo.foo_bar.yml ) with the following contents in your wall check config folder:
primary_table: foo.foo_bar
emails: ['[email protected]']
slack: ['#subu-test']
quality_checks:
- check_model: CheckEmptyTablePartition
name: EmptyPartitionCheck

Update the DAG file (i.e. wall_tutorials_00.py) to create checks based on the config file.

from datetime import datetime
from airflow.models import DAG
from teams.wall_framework.lib.wall_api_manager.wall_api_manager import WallApiManager
args = {
"depends_on_past": True,
"wait_for_downstream": False,
"start_date": datetime(2020, 4, 24),
"email": ["[email protected]",],
"adhoc": True,
"email_on_failure": True,
"email_on_retry": False,
"retries": 2,
}
dag = DAG("wall_tutorials_00", default_args=args)
wall_api_manager = WallApiManager(config_path="projects/tutorials/dags/wall_tutorials_00/wall_checks")
# Invoke Wall API to create a check for the table.
wall_api_manager.create_checks_for_table(full_table_name="foo.foo_bar", task_id="my_wall_task", dag=dag)

Validate and Test

Now if you check the list of tasks of the wall_tutorials_00 you’ll see the following tasks created by the Wall Framework:

<Task(NamedHivePartitionSensor): ps_foo.foo_bar___gen>
   <Task(SubDagOperator): my_wall_task>

Wall created a SubDagOperator task and a NamedHivePartitionSensor task for the table in the primary DAG (i.e. wall_tutorials_00). Wall encapsulated all the checks inside the sub-dag. To get the check tasks list you would need to look at the sub-dag tasks i.e. run list_tasks for the wall_tutorials_00.my_wall_task dag. It returns the following list of tasks for this case:

<Task(WallPrestoCheckOperator): EmptyPartitionCheck_foo.foo_bar>
   <Task(DummyOperator): group_non_blocking_checks>
      <Task(DummyOperator): foo.foo_bar_exchange>
<Task(DummyOperator): group_blocking_checks>
   <Task(DummyOperator): foo.foo_bar_exchange>
<Task(PythonOperator): validate_dependencies>

Note: You probably noticed that Wall created a few DummyOperator tasks and a PythonOperator task in the sub-DAG. It was required to maintain control flows i.e. blocking vs non-blocking checks, dependencies, validation etc. You can ignore those tasks and don’t need to take dependencies on these tasks since they may change or can be deleted in future.

Now you can test your check tasks just like any airflow tasks i.e.

airflow test wall_tutorials_00.my_wall_task EmptyPartitionCheck_foo.foo_bar {ds}

Wall in Airbnb’s Data Ecosystem.

Integrating Wall with other tools in Airbnb’s data ecosystem was critical for its long term success. To allow other tools to integrate easily, we publish results from the “check” stage as Kafka events, to which other tools can subscribe. The following diagram shows how other tools are integrated with Wall:

Wall in Airbnb’s data ecosystem

Conclusion

Wall ensures a high standard for data quality at Airbnb and that the standard does not deteriorate over time.

Through enabling standardized but extensible data checks that can be easily propagated across our distributed data engineering organization, we continue to ensure trustworthy, reliable data across the company. As a result all of Airbnb’s critical business and financial data pipelines are using Wall and we have hundreds of data pipelines running thousands of Wall checks every day.

If this type of work interests you, check out some of our related positions:

Senior Data Engineer

Staff Data Scientist- Algorithms, Payments

and more at Careers at Airbnb!

You can also learn more about our Journey Toward High Quality by watching our recent Airbnb Tech Talk.

With special thanks to Nitin Kumar, Bharat Rangan, Ken Jung, Victor Ionescu, Siyu Qiu for being key partners while evangelizing this framework.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only. Use of these names, logos, and brands does not imply endorsement.


How Airbnb Built “Wall” to prevent data bugs was originally published in The Airbnb Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Airbnb