by Jen Rice and Anna Matlin Introduction Airbnb was born in the cloud. In 2008, while many companies were operating data centers, a few clicks on the AWS console brought Airbnb to life. From our first Ruby on Rails app, to our more recent adoption of service-oriented architecture, the ability to instantly spin up compute and […]
by Jen Rice and Anna Matlin
Airbnb was born in the cloud. In 2008, while many companies were operating data centers, a few clicks on the AWS console brought Airbnb to life. From our first Ruby on Rails app, to our more recent adoption of service-oriented architecture, the ability to instantly spin up compute and storage has enabled our teams to move quickly and meet the growing demands of our business. However, the business value this nimbleness affords can quickly be offset by cloud computing costs, unless the organizational capability to efficiently use these resources is developed.
In the early days of Airbnb, our primary goal was growing the business. Technology teams were focused on growth, and we paid little attention to the cost of running our infrastructure. Several years ago, we noticed AWS monthly cost growth was outpacing revenue growth. We had a problem, but we lacked an in-depth understanding of how teams use AWS resources, and how planned architectural and infrastructure changes would impact our future AWS costs.
After this realization, our technology teams aligned on several areas of work.
In partnership with finance, our technology teams have made tremendous progress toward our operational efficiency goals, and continue to build amazing products in service of the Airbnb business.
Recognizing we had a problem was the easy part. Deciding what to do proved more challenging. You cannot improve what you do not measure, so we started tracking our monthly “Cost of Infrastructure”. We set a goal to hold our infrastructure “costs per night booked” steady, and we brought together technology leaders to regularly review the data. We thought building awareness would solve our problem. Several quarters passed, and we watched as our costs continued to grow. A more focused effort was clearly needed to help us identify actionable steps we could take to change our trajectory.
Knowing your culture is an important consideration before starting any major change. Airbnb’s engineering culture is one of “you build it, you operate it”, and we pride ourselves on making data informed decisions. This made two things clear. First, adding significant friction for our engineers would be met with heavy resistance, and second, we needed more investment in our AWS cost and attribution data to develop actionable insights.
Where is the money going?
When starting out, we relied on existing systems to enable our program. We have an internal employee directory as the source of truth for teams. System ownership is defined in an internal tool, Scry. We use Apache Superset, a data exploration and visualization platform designed to be intuitive and interactive. We leverage Terraform as our configuration-as-code solution, which supports most of our tagging by ensuring an AWS Resource is attributed to a Project. Some of our AWS Resources are not created via Terraform, and for these, we created an alternative mechanism directly in our codebase.
We began to ingest the Cost and Usage Report (CUR), the most comprehensive source of AWS billing data available. Building on top of Airbnb’s robust data warehouse infrastructure, the team combined the cost and usage data with teams data and system ownership data to develop an evolving picture of Airbnb’s cost footprint which we call the “Airbnb CUR”. The Airbnb CUR powers a suite of Superset dashboards and metrics which support every pillar of the cost efficiency program, and also the downstream world of consumption attribution. We will share our technical approach for building this data in an upcoming post.
Over the last 18 months, our finance and technology teams, in partnership with AWS, have developed a trusted working relationship that ensures we are purchasing what we need, and we use what we purchase. This importance of this was highlighted during the collapse of travel due to Covid-19 when AWS was a great partner during a tumultuous time. The centralized nature of the team is also key to effective purchasing, as individual teams often do not have broader context. By looking at our footprint in aggregate we can identify overall needs, even as variations in our footprint change up or down in individual areas. Getting your purchasing strategy right may be easier at a small company with fewer services, but at Airbnb we have hundreds of services. A centralized cost efficiency team with a birds-eye view of the entire Airbnb ecosystem can observe changes and make centralized purchasing decisions accordingly.
AWS announced their Savings Plan in late 2019. We have realized its benefits and now have most of our compute resources covered under this arrangement. We monitor our Savings Plan utilization regularly to minimize On Demand charges and maximize usage of purchased Savings Plan. It can be challenging to predict our compute needs, so flexibility is essential. Today we have a set of prepared responses which move certain workloads on and off Savings Plan to keep utilization healthy. We enhanced the capability of our Continuous Integration environment to leverage spot instances. With a small configuration change, we can easily dial up our use of spot instances if we observe On Demand charges. When we are under-utilizing our Savings Plan, we move over EBS in our data warehouse to EC2 to ensure we stay close to maximum utilization of our savings plan.
The nimble engineering culture at Airbnb enables engineers to build and improve services autonomously, especially as AWS introduces more advanced offerings. We purchase a 3 year convertible savings plan to give us flexibility to migrate to new instances types. For us, this flexibility offsets the potential savings from instance specific savings plan purchases. In addition to Savings Plan for our compute capacity, we leverage Reserved Instances for RDS & ElastiCache.
Purchasing the right amount of Savings Plan requires ongoing communication and evaluation. Before having a cost efficiency team, there was minimal evaluation into whether a spike increase was due to a short term usage increase or a permanent increase we should factor into our purchasing strategy. As a result, it was easy to make uninformed purchases. We now project overall usage before making savings plan purchases by keeping in touch with dozens of engineering teams. Knowing ahead of time if major services are going to turn down or up helps ensure we don’t over or under purchase. The efficiency team also works with engineering teams to stagger operations that require temporary compute so that the total usage doesn’t create high On Demand costs. Constant vigilance is critical for capacity planning success.
With tagging and attribution maturing, we were able to identify our highest areas of spend. Amazon S3 Storage costs have historically been one of our top areas of spend, and by implementing data retention policies, leveraging more cost effective storage tiers, and cleaning up unused warehouse storage, we have brought our monthly S3 costs down considerably.
When choosing the most cost effective storage tier, you need to consider the access pattern for the data along with the file size and number of objects in the S3 bucket, as there can be unexpected costs. Take Glacier, as an example. For each object stored in Glacier, S3 stores an additional 32KB data in “Standard” storage class. So if you store an object to Glacier, with 1 KB in size, S3 will put an extra 32KB in Standard, both charged at corresponding prices. So while Glacier is only 10% the cost of Standard storage class, the total cost can be higher than simply storing the data in Standard.
Compute costs are the single largest line item on our monthly bill and cost efficiencies in this area have a big impact on our bottom line. While working to control our AWS costs, we are concurrently building new capability and improving our technology stack for the future. As part of this modernization we are moving to Kubernetes. During our effort to eliminate waste, we found a number of large services not using horizontal-pod-autoscaler (HPA), and services that were using HPA, but in a largely sub-optimal way such that it never effectively scaled the services (high minReplicas or low maxReplicas). A focused effort around service tuning improved our utilization, and also maximized the impact of the cluster auto-scaling work, which will be discussed next.
Before our migration to Kubernetes, each service was manually provisioned to have the necessary compute capacity available. Teams would often increase capacity in response to external traffic or bugs that consumed unnecessary resources, but actual usage was not monitored closely and capacity was rarely de-provisioned even when it was no longer necessary. With our move to Kubernetes we were able to leverage the Cluster Autoscaler, a tool that automatically adjusts the size of the Kubernetes cluster when there are nodes in the cluster that have been underutilized for an extended period of time. By tuning the resource requests of each pod to more closely match actual usage, and then autoscaling clusters, we were able to make a step function improvement in our compute costs. Integrating this capability into our infrastructure took about 6 months, but has saved a tremendous amount of money eliminating unused compute resources.
We have a robust data warehouse and mature data visualization tools, so while we briefly looked at third party vendors we opted to build our attribution and reporting capabilities internally. This aspect of the program was a huge body of work, and should not be under-estimated. Our approach to consumption attribution was to give teams the necessary information to make appropriate tradeoffs between cost and other business drivers to maintain their spend within a certain growth threshold. With visibility into cost drivers, we incentivize engineers to identify architectural design changes to reduce costs, and also identify potential cost headwinds.
We started with a dashboard providing a view into how Airbnb’s overall AWS spend is distributed across different services. This enabled our monitoring track of work, discussed next. Our initial quick and dirty attribution was aimed at identifying high cost areas where efficiency opportunities could have the most impact. This approach was effective for the first 9–12 months of our cost saving work. It became clear, however, that for the long term, we needed a consistent pipeline architecture and scalable attribution approach so that all services could plug into a generalized attribution framework. Additionally, the first version focused on identifying the direct cost of operating our systems. The second version focused on how resources were consumed across systems to operate our site. This helped unlock key insights. For example the best way for a team to reduce their costs may not be to micro-optimize their resource usage, but to work with an upstream caller to call them less frequently.
Good data is invaluable, but only if you look at it. Having operational rigor that is looking for changes and following up on unexpected costs is one of the most critical parts of a successful cloud efficiency program. With a solid foundation of cost data in place, we have a small group of people with broad subject matter expertise who meet weekly to review the entire cost footprint. Having the right people involved in this area is very important. Our monitoring work is successful due to the curiosity, willingness to dig deep into root causes, and personal accountability these people demonstrate.
In the earlier days of the cost efficiency team, consumption monitoring meetings involved considerable firefighting. The data would surface an anomalous spike in cost for a particular usage type and the monitoring group would begin a quick investigation to understand the root cause, reaching out to other teams to learn more. Over time, the group developed relationships with other teams at Airbnb and built a knowledge base of common pitfalls in cost management. Though spikes still happen, they are smaller in magnitude and less frequent than before.
As our program matures, we are also designating AWS Cost champions through all product development organizations to replicate the operational review forums and efficiency efforts at the local level with the central cost team supporting their efforts.
When developing new product capabilities, we want to ensure our AWS spend is still within the allotted budget. Our ability to accurately forecast anticipated costs of new capabilities is one of our least mature areas, and where we will need to make a concerted effort toward improving in the future. Much of our success to date has been around executing on efficiency projects, and reducing the response time on cost incidents. This is reactionary, though, and we need to move toward more proactive management of our costs.
In addition to the various technical and organizational efforts to manage AWS costs, we saw a profound cultural change toward cost awareness and management. This shift was both top-down and grassroots. Leaders mentioned the company-wide cost goal during all-hands meetings. The finance team created a company-wide award for financial discipline, presented by the CFO, which recognized employees who had driven important cost savings initiatives. In scrappy Airbnb style, the Infrastructure organization held a cost savings hackathon which spawned a number of impactful efficiency projects. Engineers learn best practices from one another and discuss new savings opportunities in a Slack channel. Upon launch, the AWS Attribution Dashboard became the most viewed dashboard at Airbnb and has since remained in the top list. Seeing this cultural change, we are optimistic that the recent cost reductions Airbnb achieved are not a one-off, but rather a new muscle that we will only strengthen with time.
In the nine months that ended on Sept. 30, Airbnb saw a $63.5 million year-over-year decrease in hosting costs, which contributed to a 26% decline in Airbnb’s cost of revenue. The changes stemmed from better contract management and utilization of our third-party cloud services.
At our scale, cloud efficiency is a massive cross functional and cross-organizational effort. It requires technologists, data scientists and finance experts to collaborate, develop shared goals and track progress continuously. We maximized our effectiveness by developing a core team dedicated to developing a centralized view of cloud efficiency. However, this program would not be successful with only a core team. Our continued success depends on distributing responsibilities for cost efficiencies to individual teams who are closest to the cost/benefit tradeoffs.
This work was only possible through a massive amount of support across our entire organization. Special thanks to our AWS cost champions, and the core cost team — Anna Matlin, Ari Siegel, Bharat Rangan, Jian Chen, Jon Tai, Liyin Tang, Melanie Cebula, Stephen Zielinski, Swaroop Jagadish, Tamar Eterman, and Xinrui Hua.
This work, and many exciting things are always happening at Airbnb. If you want to join us, check out our Airbnb Careers page.
Amazon Web Services, EC2, Amazon RDS, ElastiCache, and Amazon S3 are trademarks of Amazon.com, Inc. or its affiliates in the United States and/or other countries.
“Ruby on Rails” is the registered trademark of David Heinemeier Hansson.
Apache Superset, Apache, and Superset are either registered trademarks or trademarks of The Apache Software Foundation in the United States and/or other countries.
Terraform is the trademark of HashiCorp.
Kubernetes and K8s are the registered trademarks of The Linux Foundation in the United States and/or other countries.
All trademarks are the properties of their respective owners. Any use of these are for identification purposes only and do not imply sponsorship or endorsement.
Our Journey Towards Cloud Efficiency was originally published in Airbnb Engineering & Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.