Using Sentiment Score to Assess Customer Service Quality

How AI-based Sentiment Models Complement Net Promoter Score By Shuai Shao, Mia Zhao, Yuanyuan Ni Net Promoter Score (NPS) is a well-accepted measurement of customer satisfaction in most customer-facing industries. We leverage NPS at Airbnb to help measure how well we serve our community of guests and hosts through our customer service. But NPS has two major […]

How AI-based Sentiment Models Complement Net Promoter Score

By Shuai Shao, Mia Zhao, Yuanyuan Ni

Net Promoter Score (NPS) is a well-accepted measurement of customer satisfaction in most customer-facing industries. We leverage NPS at Airbnb to help measure how well we serve our community of guests and hosts through our customer service. But NPS has two major drawbacks: 1) NPS is sparse, given only a fraction of users respond to the survey, and 2) NPS is slow. It takes at least a week for results to show up. Airbnb uses A/B testing heavily across our core products and customer service offerings. In the A/B testing world, the longer it takes to see results and interpret experiments, the longer it takes to iterate on the quality of our customer service. This is why we needed a much more sensitive and robust metric.

To address these limitations, Airbnb has developed an AI-based sentiment model to complement NPS. Sentiment models process messages users send to customer support (CS) representatives to extract signals reflecting users’ sentiment. Compared to NPS, the sentiment score has the following advantages:

  • Higher coverage: we are not limited to those who submit a survey, and therefore more users in a given experiment register a value for this metric;
  • Better sensitivity: it takes much less time to reach statistical significance while running an experiment;
  • Causal relationship with long term customer loyalty: we can ‘translate’ user sentiment scores into long term business values.

This blog post provides insights on how we developed the sentiment model and the metric which aggregates the raw sentiment scores to measure customer sentiment. We leveraged Entropy Balancing (Hainmueller, 2012) to create a counterfactual group, in order to detect the relationship between the sentiment metric and future revenue. From our study, we show great results of sentiment metric compared to NPS.

Sentiment Model Development

Sentiment analysis is a great method to gauge consumers’ feeling of a particular product or service. In Airbnb’s customer support, sentiments from our guests and hosts are important signals for us to build better products and services, and ship changes with our community in mind. .

There are two main challenges we face when developing sentiment models in the customer support domain.

  • Skewed Data: Most text inputs are negative in sentiment. Unlike when leaving reviews or messaging with hosts, guests typically contact customer support when they are experiencing an issue with Airbnb.
  • Multilingual Input: More than 14 languages are supported by Airbnb’s customer service. Hosts and guests might be communicating in different languages in the same support ticket.

To make a sentiment model tailored to our use case, we developed customized rating guidelines for customer support messages to make our model aware of domain-specific knowledge and contextual information. Examples below illustrate how the same messages are labelled differently when presented as a CS message versus a Social Media post or App Store review. In the CS domain, we focus on how well customers “think” the issue gets solved as a positive indication and how frustrating they “feel” the issue is as a negative indication.

We address data skewness via multiple iterations of sampling data for human annotations using ML model and retraining model using newly labelled data. The first round of annotation is performed based on random sampling, while subsequent annotation datasets are stratified on existing model predictions. This leads to a more balanced dataset for training.

We built and tested two deep learning architectures, both support multilingual inferences:

  • WIDeText uses a CNN-based architecture to process text channels, while all categorical features are processed through the WIDe channel.
  • XLM-Roberta uses a transformer-based architecture and leverages a pre-trained multilingual model to have CS messages trained in 14 languages. .
WIDeText Architecture
Transformer Architecture

Transformer-based models achieve slightly better performance on English sentiment analysis and much better performance on less frequently used languages. We chose transformer-based classifier for production inference pipeline.

Sentiment Metrics Development

From the raw sentiment scores, we developed the sentiment metric aiming to optimize the following criteria:

  • Strong correlation with NPS
  • Sensitivity in experimentation
  • Demonstrable causal relationship with long-term business gains

Correlation with NPS

Despite the limitations of NPS, it is still considered to be the gold-standard of users’ sentiment. It is desirable to make the sentiment metric, now more sensitive and robust, correlates well with NPS. We tested various ways to design the metric by aggregating the message-level raw sentiment scores (e.g., mean, cutoff, slope) to correlate with NPS.

The two charts below illustrate that sentiment scores and NPS correlate well for guest and host sentiment.

NPS (green) vs Sentiment Metrics (orange) on guest sample

NPS (green) vs Sentiment Metrics (orange) on host sample

Sensitivity in Experimentation

We revisited two types of past experiments (Scenario 1 and 2) to compare the sensitivity in experimentation between NPS and sentiment metric. The goal was to determine if sentiment metric can provide quicker or more accurate feedback in response to a shift in user sentiment.

Scenario 1

In the first type of experiments, a new product/service feature hurt the user experience from user research (e.g., the service required extra steps to contact a support agent), yet these features did not show any statistically significant changes in NPS.

For example, in one of our Interactive Voice Response (IVR) experiments, we successfully reduced contact rate by adding more questions to our automated phone messaging system. However, this also increased friction for users trying to reach customer support. At the end of the experiment, NPS trended negative but was not statistically significant after running for 30 days.

When we applied sentiment metrics to this experiment, we were able to detect that the change in new sentiment metrics reached statistical significance within 5 days.

Scenario 2

The second type of experiments have features in product/service that hurt the user experience and did impact NPS in a statistically significant way. For example, one of our chatbot experiment decreased both NPS and sentiment metrics but NPS reached statistical significance at day 10, while sentiment metric converged much faster, detecting a shift by day 5.

Relationship with Long-term Customer Loyalty

As a low-frequency marketplace, one of the challenges in Airbnb’s experimentation framework is the difficulty of evaluating long-term customer loyalty such as user churn rate and future booking revenue in product iterations. For customer support teams, our products have an especially large impact on users’ experience. The experimentation should help the decision makers to answer the question “Should we launch a product/service feature if it reduces cost but hurts users’ satisfaction levels?”

Our third assessment quantifies the future booking impact of customer service using the sentiment score metric.

It would be very expensive, if possible at all, to run A/B tests with two distinct pools of agents who provide different standards of service to different groups of users. Instead, we use a novel causal inference technique to detect sentiment effects on a user’s future one-year booking revenue with observational data.

We divide the users into two groups: a control group, with comparatively lower sentiment scores, and a treatment group, with higher, more positive sentiment. We need to control for the fact that these two groups may be fundamentally different from each other in many ways, such as their tolerance to different levels of service quality, loyalty to our platform, and historical booking experience.

Analysis workflow of establishing relationship between sentiment score and future revenue

In order to evaluate more reliable long-term effects of providing good customer service, we established a procedure to: 1) find confounding factors, 2) control these covariates using entropy balancing, and 3) evaluate treatment effects using weighted data.

Confounding Variable Selection

It took several rounds of iteration before we were able to narrow down the appropriate confounding variables and generate the covariate matrix. We listed all possible confounding variables that should be taken into account. This covered multiple disciplines including user account information, previous booking behaviors, customer contact habits, etc. We then selected related variables that correlated with both sentiment and future booking. For example, users with more previous bookings tend to book more and are more positive when communicating with customer support agents. Finally, we cross checked correlations among all variables to remove redundant ones. This helped us to select a short list of confounding variables.

Entropy Balancing

We use Entropy Balancing to achieve covariate balance. Entropy Balancing is a maximum entropy reweighting scheme to create balanced samples that satisfy a set of constraints. Here are two most important features in the scheme:

1. Equalized moments of the covariate distributions. By assigning weight wi to each sample unit, we want the moments of the covariate distribution (e.g., mean, variance, and skewness) between the treatment and the reweighted control group to be equal (defined in equation 2). A typical balance constraint is formulated with mr containing the rthorder moment of a given variable Xj from the treatment group, whereas the moment functions are specified for the control group as cri(Xij)

2. Minimized distance from base weights. We also want to minimize the. distance between estimated weights wiand base weight qi(usually set as 1/n0 , uniformly distributed) to retain information as much as possible (defined in equation 3).

Compared to more frequently used Propensity Score Matching, Entropy Balancing has several proven advantages:

  • It is good at balancing results even to high degrees of moments. In contrast to most other preprocessing methods that involve multiple rounds of manual adjustments on both model and matching until reaching balanced results (which often fails on high dimensional samples), entropy balancing directly searches for weights that can achieve exact covariate balance in finite samples. It significantly improves the balance that can be obtained by other methods, which are validated by an insurance use case Matschinger (2019).
  • It retains valuable information without discarding units. Entropy Balancing retains valuable information by allowing the unit weights to vary smoothly across units, so that we don’t have to throw away any unmatched data.
  • It is versatile. The weights we get can be used to almost any standard estimation of treatment effects such as weighted mean and weighted regression.
  • It is computationally inexpensive. It only takes a couple seconds to get balanced results for over 1M records.

Evaluating Treatment Effects

We were able to reach balanced results for all confounding variables after using entropy reweighting:

With the weighted results, we found that guests on Airbnb with higher sentiment (potentially good CS experiences with sentiment metrics >= 0.1) produce significantly more revenue in the subsequent 12 months. This result can be applied to trade-off analysis whenever we see an opposite result in cost and user CS sentiment score and help us make the right launch decision taking long-term revenue into consideration.


In this blog post, we provided details of sentiment model development and the framework of assessing sentiment metrics.

For ML practitioners, the success of a sentiment analysis depends on domain-specific data and annotation guidelines. Our experiments show transformer-based classifiers perform better than CNN-based architectures, especially in less frequently used languages.

For customer service providers who struggled with the pain of NPS, sentiment analysis has provided promising results to describe customers’ satisfaction levels. If you have user communication text, exploring sentiment analysis may solve the long lasting pain of NPS. However, if you only have phone call recordings, exploring audio to text transcription could be a good start before exploring emotion detection in audio.

For data analysts and data scientists, the framework of metrics development from a new signal (model output) is reusable: considering many user feedback metrics are either slow or sparse, data professionals can assess the new signals from coverage, sensitivity, and causal relationship with business values. For causal analysis challenges, it is worth spending some time to explore the new Entropy Balancing techniques, which may save you time from Propensity Score Matching.

If this type of work interests you, check out some of our related positions:

Senior Data Scientist — Analytics, Support Products

Staff Data Architect, Community Support Platform

and more at Careers at Airbnb!


Thanks to Zhiying Gu and Lo-hua Yuan for providing important knowledge support on causal inference. Thanks to Mitral Akhtari and Jenny Chen for knowledge sharing on Airbnb’s Future Incremental Value system. We would also like to thank Bo Zeng for sentiment modeling guidance, Mariel Young for the metrics iteration, and Aashima Paul, Evan Lin, and Keke Hu for their hard work on labelling the sentiment data. Last but not least, we appreciate Joy Zhang, Nathan Triplett, and Shijing Yao for their guidance.


  1. Jens Hainmueller (2012) Entropy Balancing for Causal Effects: A Multivariate Reweighting Method to Produce Balanced Samples in Observational Studies, Political Analysis, 20:25−46 doi:10.1093/pan/mpr025
  2. Herbert Matschinger, Dirk Heider, Hans-Helmut König (2020) A Comparison of Matching and Weighting Methods for Causal Inference Based on Routine Health Insurance Data, or: What to do If an RCT is Impossible,Gesundheitswesen, 82(S 02): S139-S150 DOI: 10.1055/a-1009–6634

Further Reading

WIDeText: Multimodal Deep Learning Framework, and its application on Room Type Classification goes into the details of Deep learning framework used in Airbnb

Bighead: A Framework-Agnostic, End-to-End Machine Learning Platform goes into the details of the Airbnb Machine Learning Infrastructure. DSAA’2019

How Airbnb Measures Future Value to Standardize Tradeoffs goes into details of how Airbnb optimizes for long-term decision-making through the propensity score matching model

Using Sentiment Score to Assess Customer Service Quality was originally published in The Airbnb Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Airbnb