How Airbnb Measures Future Value to Standardize Tradeoffs

The propensity score matching model powering how we optimize for long-term decision-making By Mitra Akhtari, Jenny Chen, Amelia Lemionet, Dan Nguyen, Hassan Obeid, Yunshan Zhu At Airbnb, we have a vision to build a 21st century company by operating over an infinite time horizon and balancing the interests of all stakeholders. To do so effectively, we […]

The propensity score matching model powering how we optimize for long-term decision-making

By Mitra Akhtari, Jenny Chen, Amelia Lemionet, Dan Nguyen, Hassan Obeid, Yunshan Zhu

At Airbnb, we have a vision to build a 21st century company by operating over an infinite time horizon and balancing the interests of all stakeholders. To do so effectively, we need to be able to compare, in a common currency, both the short and long-term value of actions and events that take place on our platform. These actions could be a guest making a booking or a host adding amenities to their listing, to name just two examples.

Though randomized experiments measure the initial impact of some of these actions, others, such as cancellations, are difficult to evaluate using experiments due to ethical, legal, or user experience concerns. Metrics in experiments can be hard to interpret as well, especially if the experiment affects opposing metrics (e.g., bookings increase but so do cancellations). Additionally, regardless of our ability to assess causal impact with A/B testing, experiments are often run only for a short period of time and do not allow us to quantify impact over an extended period.

So what did we build to solve this problem?

Introducing Future Incremental Value (FIV)

We are interested in the long-term causal effect or “future incremental value” (FIV) of an action or event that occurs on Airbnb. We define “long-term” as 1 year, though our framework can adjust the time period to be as short as 30 days or as long as 2 years.

To use a concrete example, assume we would like to estimate the long-term impact of a guest making a booking. Denote the n1 number of users who make a booking within a month as i ∈a1and the n0 number of users who do not make a booking in that time period as i∈a0. In the following year, each of these users generates revenue (or any other outcome of interest) denoted by y. The naive approach to computing the impact of making a booking would be to simply look at the average differences between users who made a booking versus those that did not:

However, these two groups of users are very different: those who made a booking “selected” into doing so. This selection bias obscures the true causal effect of the action, FIV(a).

Our goal is to exclude the bias from the naive estimate to identify FIV(a).

The Science Behind FIV

To minimize selection bias in estimating the FIV of an action, we need to compare observations from users or listings that are similar in every way except for whether or not they took or experienced an action. The well-documented, quasi-experimental methodology we have chosen for this problem is propensity score matching (PSM). We start by separating users or listings into two groups: observations from those that took the action (“focal”) during a given timeframe and observations from those that did not (“complement”). Using PSM, we construct a “counterfactual” group, a subset of the complement that matches the characteristics of the focal as much as possible, except that these users or listings did not take the action. The assumption is that “assignment” into focal versus counterfactual is as good as random.

Figure 1. Overview of methodology behind FIV

The specific steps we take for eliminating bias from the naive method are:

  1. Generate the Propensity Score: Using a set of pre-treatment or control features describing attributes of the user or listing (e.g., number of past searches), we build a binary, tree-based classifier to predict the probability that the user or listing took the action. The output here is a propensity score for each observation.
  2. Trim for Common Support: We remove from the dataset any observations that have no “matching twin” in terms of propensity score. After splitting the distribution of propensity scores into buckets, we discard observations in buckets where either the focal or complement have little representation.
  3. Match Similar Observations: To create the counterfactual, we use the propensity score to match each observation in the focal to a counterpart in the complement. Various matching strategies can be used, such as matching in bins or via nearest neighbors.
  4. Results: To get the FIV, we compute the average of the outcome or target feature in the focal minus the average in the counterfactual.

Evaluation

In a supervised machine learning problem, as more data becomes available and future outcomes are actualized, the model is either validated or revised. This is not the case for FIV. The steps above give us an estimate of the incremental impact of an action, but the “true” incremental impact is never revealed. In this world, how do we evaluate the success of our model?

Common Support: One of the assumptions of using PSM for causal inference is “common support”.

where D = 1 denotes observations in the focal group and X are the controlling features. This assumption rules out the possibility of “perfect predictability” to guarantee that observations with the same X values have a positive probability of belonging to both groups and thus can be matched together to provide valid comparisons. Plotting the distribution of propensity scores for the focal and the complement group allows for a visual inspection of this assumption. Interestingly, in the case of causal inference with PSM, a high Area Under the Curve (AUC), a desirable feature for most prediction models, means that the model is able to distinguish between focal and complement observations too well, reducing our matching quality. In such cases, we assess whether those control features are confounders that affect the output metrics and eliminate them.

Matching Evaluation: Observations are considered “similar” if the distributions of key features in the focal closely match the distributions of those in the counterfactual. But how close is close enough? To quantify this, we compute three metrics to assess the quality of the matching, as described in Rubin (2001). These metrics identify whether the propensity score and key control features have similar distributions in the focal and counterfactual groups. Additionally, we are currently investigating whether to apply an additional regression adjustment to correct for any remaining imbalance in the key control features. For instance, after the matching stage, we could run a regression that directly controls for key features that we want an almost exact match for.

Past experiments: Company-wide, we run experiments to test various hypotheses on how to improve the user experience, potentially leading to positive outcomes such as a significant increase in bookings. These experiments generate a source of variation in the likelihood of guests making a booking that does not suffer from selection bias, due to the randomization of treatment assignment in the experiment. By tracking and comparing the users in the control group to users in the treatment groups of these experiments, we observe the “long-term impact of making a booking”, which we can compare to our FIV estimate for “guest booking”. While the FIV estimate is a global average and experiments often estimate local average treatment effects, we can still use experimental benchmarks as an important gut check.

Adapting FIV for Airbnb

While PSM is a well-established method for causal inference, we must also address several additional challenges, including the fact that Airbnb operates in a two-sided marketplace. Accordingly, the FIV platform must support computation from both the guest and the listing perspective. Guest FIV estimates the impact of actions based on activity a guest generates on Airbnb after experiencing an action, while listing FIV is from the lens of a listing. We are still in the process of developing a “host-level” FIV. One challenge in doing so will be sample size: we have fewer unique hosts than listings.

To arrive at a “platform” or total FIV for an action, we cannot simply add guest and listing FIVs together because of double counting. We simplify the problem and only count the value from the guest-side or the listing-side depending on which mechanisms we believe drive the majority of the long-term impact.

Another feature of our two-sided market is cannibalization, especially on the supply-side: if a listing gets more bookings, some portion of this increase is taking away bookings from similar listings. In order to arrive at the true “incremental” value of an action, we apply cannibalization haircuts to listing FIV estimates based on our understanding of the magnitude of this cannibalization from experimental data.

The Platform Powering FIV

FIV is a data product and its clients are other teams within Airbnb. We provide an easy to use platform to organize, compute, and analyze actions and FIVs at scale. As part of this, we have built components that take in input from the client, construct and store necessary data, productionize the PSM model, compute FIVs, and output the results. The machinery, orchestrated through Airflow and invisible to the client, looks as follows:

Figure 2. Overview of FIV Platform

Client Input

Use cases begin with a conversation with the client team to understand the business context and technical components of their desired estimate. An integral part of producing valid and useful FIV estimates is establishing well-defined focal and complement groups. Additionally, there are cases when the FIV tools are not applicable, such as when there is limited observational data (e.g., a new feature) or small group sizes (e.g., a specific funnel or lever).

The client submits a configuration file defining their focal and complement groups, which is essentially the only task the client does in order to use the FIV platform. Below is the config for the FIV of “guest booking”: a visitor who booked a home on our site (focal) versus one who did not book a home (complement).

https://medium.com/media/585c5f1c3b0f0789f6369b41006ad5c1/href

The cohort identifies the maximum set of users to consider (in this case, all visitors to Airbnb’s platform), some of which are removed from consideration by the filter_query (in this case, users who also booked an Airbnb experience are removed). From the remaining set of users, the action_event_query allocates users to the focal with leftovers automatically assigned to the complement.

After the client’s config is reviewed, it is merged into the FIV repository and automatically ingested into our pipelines. We assign a version to each unique config to allow for iteration while storing historical results.

We have designed the platform to be as easy to use as possible. No special knowledge of modeling, inference, or complex coding is needed. The client only needs to provide a set of queries to define their groups and we take care of the rest!

Data Pipeline

The config triggers a pipeline to construct the focal and complement, join them with control and target features, and store this in the Data Warehouse. Control features will later serve as inputs into the propensity score model, whereas target features will be the outcomes that FIV is computed over. Target features are what allow us to convert actions from different contexts and parts of Airbnb into a “common currency”. This is one of FIV’s superpowers!

Leveraging Zipline as our feature management system, we currently have approximately 1,000 control features across both guests and listings, such as region, cancellations, or past searches. Though we have the capability to compute FIV in terms of numerous target features, we have a few target features that give us a standardized output, such as revenue, cost, and bookings.

Figure 4. Steps to compute the raw data needed for FIV, after taking in client input

The version of the config is also used here to automate backfills, significantly decreasing manual errors and intervention. There are multiple checks on the versioning to ensure that the data produced is always aligned with the latest config.

Modeling Pipeline

Because the focal and complement groups can be very large and costly to use in modeling, we downsample and use a subset of our total observations. To account for sampling noise, we take multiple samples from the output of our data pipeline and feed each sampling round into our modeling pipeline. Sampling improves our SLA, ensures each group has the same cardinality and allows us to get a sense of sampling noise. Outliers are also removed to limit the noisiness of our estimates.

The PSM model is built on top of Bighead, Airbnb’s machine learning platform. After fetching the sampled data, we perform feature selection, clean the features, and run PSM to produce FIVs in terms of each target feature before finally writing our results into the Data Warehouse. In addition to the FIVs themselves, we also collect and store evaluation metrics as well as metrics such as feature importance and runtime.

Figure 5. Modeling steps needed to compute FIV, after the raw data has been generated

On top of the modeling pipeline we have built the ability to prioritize actions and rate limit the number of tasks we launch, giving us a big picture view of the resources being used.

FIVs!

Next we pull our FIVs into a Superset dashboard for easy access by our clients. FIV point estimates and confidence intervals (estimated by bootstrapping) are based on the last 6 months of available data to smooth over seasonality or month-level fluctuations. We distinguish between the value generated by the action itself (tagged as “Present” below) and the residual downstream value (“Future”) of the action.

Figure 6. Snapshot of the dashboard as seen by clients

FIV as a Product

Airbnb’s two-sided marketplace creates interesting but complicated tradeoffs. To quantify these tradeoffs in a common currency, especially when experimentation is not possible, we have built the FIV framework. This has allowed teams to make standardized, data-informed prioritization decisions that account for both immediate and long-term payoffs.

Currently, we have scaled to work with all teams from across the company (demand-side, supply-side, platform teams like Payments and Customer Support, and even the company’s most recent team, Airbnb.org) and computed over 150 FIV action events from the guest and listing perspective. Use cases range from return on investment calculations (what is the monetary value of a “perfect” stay?) to determining the long-term value of guest outreach emails that may not always generate immediate output metrics. We have also used FIV to inform the overall evaluation criteria in experiments (what weights do we use when trading off increased bookings and cancellations?) and rank different listing levers to understand what to prioritize (what features or amenities are most useful for a host to adopt?).

In the absence of a centralized, scalable FIV platform, each individual team would need to create their own framework, methodology, and pipelines to assess and trade off long-term value, which would be inefficient and leave room for errors and inconsistencies. We have boiled down this complex problem into essentially writing two queries with everything else done behind the scenes by our machinery.

Yet, our work is not done–we plan to continue improving the workflow experience and explore new models in order to improve our estimates. The future of FIV at Airbnb is bright!

Acknowledgments

FIV has been an effort spanning multiple teams and years. We’d like to especially thank Diana Chen and Yuhe Xu for contributing to the development of FIV and the teams who have onboarded and placed trust into FIV.


How Airbnb Measures Future Value to Standardize Tradeoffs was originally published in The Airbnb Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Airbnb