By: Enrique Cruz At Foursquare, we pride ourselves on empowering our community to explore the world around them. Our consumer app, Foursquare City Guide, is a location-based recommendation engine and exploration app. One of the primary actions for our users is to write tips (or a short, public blurb of text) attached to a venue that […]
By: Enrique Cruz
At Foursquare, we pride ourselves on empowering our community to explore the world around them. Our consumer app, Foursquare City Guide, is a location-based recommendation engine and exploration app. One of the primary actions for our users is to write tips (or a short, public blurb of text) attached to a venue that often serves as a quick review or suggestion. Over the years, Foursquare users have written more than 95 million tips. While these tips are valuable, they provide a ton of information for users to sift through. This is why determining which tips are “better” than others for a given venue is an important task within the Foursquare app ecosystem.
A few months ago, we revamped our strategy to select the best tips for a given venue. Our new ranking model greatly improves on our prior approaches and leverages contextual, text-based and social signals , which allows us to select the tips that provide our users with the most informative, relevant, and high quality content. In this post, we’ll go over our new methodology as well as how the model’s introduction yielded significant positive results as measured by various A/B tests across different use cases.
Historically, Foursquare has used a few different mechanisms for sorting and selecting the best tips at a venue — but we felt none of them were fully satisfactory on their own.
Let’s discuss a few of the most prominent ranking strategies previously used and review their challenges:
Popularity: This is a measure of the positive interactions a tip has garnered since its creation, such as “upvotes”. Generally, showcasing content that is relevant or useful to users tends to favor content that is old or stale, leading to a feedback cycle where highly-ranked tips are more prominently exposed (thus gaining even more popularity). Continuously showing old tips can make our apps appear outdated, failing to leverage the very active user community we have that continuously provides us with awesome new tips.
Recency: This is a measure of the amount of time that has passed since the tip was created. This measurement does a great job at showcasing the vibrancy of the Foursquare community, yet it offers no guarantee of quality or relevance.
For our new tip ranker, we wanted to build on the successes of prior approaches and develop a system that not only balanced popularity and recency, but also allowed us to factor in other nuanced signals that help differentiate a bad tip from a great one.
In addition to popularity and recency as defined above, we included the following features in our revamped tip ranking model:
Language Identification: This is a language classifier built using an ensemble of open source and home-grown solutions in order to avoid serving tips in languages that a user does not understand.
Content Richness: These are several signals that track more general attributes and metadata about the tip beyond the actual information contained within the tip itself. Among these factors is the presence or absence of a photo, links to external sources, as well as the number of words the tip contains.
Author Trust: These are author statistics such as tenure as a Foursquare City Guide user, total popularity, and other aggregate facts around the user’s previously written tips. These signals attempt to capture a user’s trustworthiness as a tip author.
Global Quality: This is a set of scores from various statistical classifiers that are trained to identify specific traits, such as the sentiment of a tip (trained by using explicit “like” and “dislike” ratings) that a user provided for a venue on the same day that the tip was written. Natural Language Processing (NLP) is then used to learn which words and phrases best predict each class of tips. As for the likelihood of a tip being reported as spam — this is trained by looking at past tips reported as spam and learning the attributes that best correlate with this.
In order to train our model using these new features, we generated some training data by leveraging existing crowdsourcing platforms. To collect our data, we first determined the top 1,000 most popular venues by user views and proceeded to randomly sample 100 distinct pairs of tips from each of these venues. After accounting for some language filtering and de-duplicating, this yielded a dataset of 75,000 tip pairs.
We then created labels for this data by designing a job on Figure Eight (formerly CrowdFlower, a crowdsourcing platform for tasks similar to Amazon Mechanical Turk) where the judges would be shown a tip pair from our sample pool alongside the relevant venue. The judges were then asked the question, “If you were currently at this venue or considering visiting this venue, which of the following pieces of content is more informative?” We designed the test so that the tips would be shown in a similar context to the way they are displayed in the City Guide app, exposing our judges to all the same contextual information that affects the way our real users view a tip. The outcome of our Figure Eight job yielded around 50,000 labeled pairs of tips which we divided into training and evaluation data.
To train our new tip ranker further, we explored a variety of algorithms including LambdaMART, Coordinate Ascent, and RankBoost. After evaluating the results, we settled on using SVMrank (an implementation of Support Vector Machines) as our supervised learning algorithm. Our objective was to minimize the number of disordered pairs of tips in light of our crowdsourced training labels.
As we iterated and tuned our new ranker, we evaluated its performance against a “held out” dataset, comparing it against some baseline metrics. We also evaluated the rankers qualitatively with a new side-by-side tool to look at the best tips for a venue chosen by each model.
In the final model, Tip Ranker with text features, these were the features with the highest weight:
The features with the least amount of predictive power turned out to be:
After the encouraging results of the newly-trained tip ranker on our held out dataset, we brought the model into production to be used on our entire venue corpus and leveraged it in various touch points within the Foursquare ecosystem. Below are some of the places we experimented with the new ranker, and the results from running A/B tests with a 50% split of our user base.
There are a few areas of work left to explore that could yield further improvements in the way we select tips by incorporating new features into the model.
Some of these include:
All in all, it’s critical for us to continuously evaluate the way we process, track and showcase user feedback — which contributes to our active user base and influx of location-based insights. Through analyzing past approaches and experimenting with new techniques, we are able to serve our community with the most valuable information possible.
A Word of Advice: Revamping Foursquare’s Tip Ranking Methodology was originally published in Foursquare on Medium, where people are continuing the conversation by highlighting and responding to this story.