The path to cloud efficiency begins with a cost data foundation by Anna Matlin and Tamar Eterman Introduction Business profitability and sustainability are powerful reasons to invest in infrastructure efficiency, but it is easy to feel lost about how to actually reduce costs. A foundation of robust and actionable data is essential for a successful efficiency […]
Business profitability and sustainability are powerful reasons to invest in infrastructure efficiency, but it is easy to feel lost about how to actually reduce costs. A foundation of robust and actionable data is essential for a successful efficiency program. At Airbnb, building this foundation made it possible to prioritize savings opportunities and ushered in a wave of cost reductions, summarized in a previous post.
More importantly, cost data has become a lever for long-term control. The team can react quickly before a spike wreaks havoc on the monthly bill and plan ahead when a big new project could become expensive. At the company scale, visibility into cost and usage has sparked a cultural shift. When savings can be measured, they can be recognized, and cost efficiency projects become exciting opportunities. As of early 2021, the most viewed dashboard at Airbnb is a dashboard of AWS costs.
We hope that sharing our approach will enable more companies to achieve AWS cost savings. Though Airbnb’s cost data foundation is built with one cloud provider in mind, our learnings from building a pipeline, defining metrics, and designing visualizations apply regardless of the cloud provider.
In the early days of Airbnb’s cost efficiency efforts, the team relied on the Cost Explorer dashboard in the AWS console. Cost Explorer represented a significant improvement over the monthly invoice because it was possible to see data before the end of the month, but it did not provide detailed insights because it was not connected to Airbnb’s data tools. Most teams at Airbnb rely on the data warehouse (i.e., Apache Airflow, Apache Hive, Apache Spark) and extensive analytics infrastructure (i.e., Minerva, Apache Druid, DataPortal, Apache Superset, SLA monitoring) to make data-informed decisions. To take full advantage of the available resources, our team built a pipeline on top of the AWS Cost & Usage Report (CUR), a rich source of raw data.
The pipeline transforms and modifies the CUR data, as illustrated below. We call this pipeline the “Airbnb CUR Pipeline”, and the resulting tables are collectively called the “Airbnb CUR Foundation.” This is because the pipeline enriches CUR data with Airbnb-specific business logic and naming conventions.
The loading and transformation of raw CUR files into the Airbnb CUR Foundation is performed in an Airflow pipeline, which runs daily. We describe this pipeline in more detail below.
Here are some suggestions from our experience that ensured the Airbnb CUR was robust and accurate:
Design with downstream use cases in mind. Before building anything, establish the requirements for your pipeline. How will the pipeline service-level agreement (SLA) align with the lag of the raw data? What are the top-line metrics from a financial and engineering perspective, and how will these metrics be interpreted? What are the dimensions, or grouping variables, that will be used to cut and categorize these metrics? We reduced the number of dimensions from ~200 in the raw CUR to the ~30 most useful ones for the Airbnb CUR Foundation. This simplicity makes the downstream tables more usable.
Build for retroactive adjustments. Usage and cost data change retroactively over the course of a monthly billing cycle. This constraint informed architectural decisions. We designed a data model with two types of tables: one that is overwritten with retroactive adjustments and one with immutable historical snapshots. The first kind of table underlies the cost program dashboards, while the second kind of table ensures reproducible calculations for anomaly detection and attribution.
Study the options for obtaining raw data. There is a menu of options for creating a new Cost & Usage Report in the AWS Console. We configured several reports before finally identifying which settings worked best for our downstream requirements. Airbnb’s CUR report includes refreshes, versions, hourly data, and resource IDs. The file format was important to successfully ingest data into the warehouse (via Spark), but companies using Amazon Redshift and Amazon Athena can ingest data without additional processing.
We recognize that not every company will want to build and maintain a cost data pipeline. There are also many third-party vendors that perform analytics using the CUR. Airbnb’s decision to build versus buy was motivated by the availability of internal resourcing, the need to incorporate custom logic (e.g., discounting), and the opportunity to integrate with the internal data tooling.
Thanks to a close partnership between data science, finance, technical program management, and engineering, the Cost Efficiency Team developed a set of key metrics and dimensions that are immediately actionable when they are surfaced in charts and dashboards. Aligning on important definitions enabled weekly monitoring, capacity purchasing, budgeting, opportunity sizing, and savings measurement. In the section below, we will describe our approach to structuring cost data for maximum insight and impact.
The best metrics for cost efficiency work are simple and well-understood by partner teams.
Top-Line Metrics: The primary metric of the Airbnb CUR data is Cost, in dollars, which incorporates amortization, discounting, and blending as described above. Cost per booking captures the impact of AWS costs on business margins.
AWS Product-Specific Usage Metrics: Unlike cost metrics, usage metrics differ from one product to another. For example, we have defined a vCPU-Hours metric which measures compute usage at the fleet level, accounting for instance size. Usage metrics often reveal growth trends that are not apparent in cost data because of pricing terms. This is especially true for S3 storage, which we measure in terms of GB/Month. Pricing for cold storage options such as Amazon Glacier and Deep Glacier is much cheaper than for Standard Storage, so looking at only cost data could lead us to overlook usage growth in these cold storage categories.
Program Success Metrics: Our Percent Savings Plan Coverage Utilized metric highlights excess or insufficient compute usage compared to the pre-committed Savings Plan amount. This coverage metric is also relevant for other AWS products with reserved instances, such as Relational Database Service (Amazon RDS).
Below are some examples of dimensions that surface meaningful insights from the Airbnb CUR Foundation and metrics.
Other dimensions which we have found to be valuable include Instance Type Family, Instance Type, Usage Type, Storage Class, and Operation. Some highlight general trends, while others are useful for deeper data exploration. For more information about these dimensions, please visit the AWS CUR Documentation.
Below are three notable cost data visualizations, with simulated data.
The Line Item Description field is useful for cost data detective work. In the chart below, grouping the Cost metric by the Line Item Description dimension revealed that a spike in CloudTrail costs was due to data events rather than log data. This finding directed us to look at S3 request patterns and started a conversation with the team owning this data. Ultimately, this investigation reduced daily CloudTrail costs significantly.
Though having a data foundation opens a world of opportunities, just having the data is not enough. Below we have included a selection of tips to get value out of the data once it exists.
Developing a trustworthy and interpretable cost data foundation set Airbnb up for long-term success in cloud cost management. But data alone is not enough to achieve cost savings. Leadership commitment to savings goals, effective program management, contract management, and technical excellence across Airbnb made the success of the program possible. Engineers pop into office hours to ask about their costs on the company-wide dashboard, and teams proudly share the results of their efficiency projects.
We hope the foundational details and learnings shared in this post will demystify this domain and inspire practitioners at other companies to pursue a data-informed path toward cost efficiency.
Are you passionate about cloud efficiency, or inspired by unique data challenges? We’re always looking for talented individuals to join the team!
The Airbnb CUR Foundation was made possible with the support of many people. We are grateful to Stephen Zielinski, Krishna Bhupatiraju, Tingting Ma, Jinyang Li, Jian Chen, Jon Tai, Yi Chen, Yuhe Xu, Melanie Cebula, and Mingzhu Liu for their technical contributions and architectural advice. Thanks to David Morrison for his thoughtful and constructive feedback reviewing this post. We were fortunate to have support from many managers who have championed this work: Ari Siegel, Jen Rice, Guang Yang, Swaroop Jagadish, Reid Andersen, Brian Wallace, Jason Sobel, and Bharat Rangan.
We would like to express our gratitude to the AWS account team, who have worked with us at every step on our cost efficiency journey: Dan Facchetti, Amulya Sharma, Nathan Perry, Jeff Maxin. Thank you as well to the many cost efficiency practitioners at other companies who were generous to share their experiences.
Amazon Web Services, EC2, Amazon RDS, Amazon Redshift, Amazon Athena, Amazon Glacier, Amazon Elastic Compute Cloud, AWS CloudTrail and Amazon S3 are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.
Apache Airflow, Apache Hive, Apache Spark, Apache Druid, Apache Superset, and Apache are either registered trademarks or trademarks of The Apache Software Foundation in the United States and/or other countries.
Kubernetes is the registered trademark of The Linux Foundation in the United States and/or other countries.
All trademarks are the properties of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship or endorsement.
Achieving Insights and Savings with Cost Data was originally published in Airbnb Engineering & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.