How Instacart built a crowdsourced data labeling process (and how you can too!) By: Neel Ajjarapu and Omar Alonso Organizations that develop technologies rooted in information retrieval, machine learning, recommender systems, and natural language processing depend on labels for modeling and experimentation. Humans provide these labels in the context of a specific task, and the data collected […]
Organizations that develop technologies rooted in information retrieval, machine learning, recommender systems, and natural language processing depend on labels for modeling and experimentation. Humans provide these labels in the context of a specific task, and the data collected is used to construct training sets and evaluate the performance of different algorithms.
How do we collect human labels? Crowdsourcing has emerged as one of the possible ways to collect labels at scale. Popular services like Amazon Mechanical Turk or FigureEight are examples of platforms where one can create tasks, upload data sets, and pay for work. However, homework needs to be done before a data set is ready to be labeled. This is even more important for new domains where there are no existing training sets or other benchmarks… domains like grocery!
At Instacart, we are revolutionizing how people search, discover, and purchase groceries at scale. Every day, our users conduct millions of searches on our platform, and we return hundreds of millions of products for them to choose from. In such a unique domain, collecting human labels at scale has allowed us to augment Instacart search and generate best practices that we hope to share.
Introducing our “Pre-flight Checklist” of tasks for implementing large-scale crowdsourcing tasks. This list is independent of a specific crowdsourcing platform and can be adapted to any domain.
Before we jump in, a note on terminology: we use the terms rater, evaluator, or worker interchangeably to mention a human who is processing a task. In a task, humans are asked to provide answers to one or more questions. This process is usually called labeling, evaluation, or annotation, depending on the domain.
The first step to approaching human evaluation is to understand what your organization has already done. Make sure to ask the following questions:
If your organization has already collected human evaluated data, make sure to understand existing processes. Do you have vendors with whom you already work? Is there an established way to store human-labeled data? Existing approaches can influence how you design your crowdsourcing task, so it’s important to take stock. Understand what went well in previous projects and what lessons were learned.
If you’re starting from scratch, focus on an area that the organization would like to know more about. For example, you may not know how good your top-k organic results are and want to quantify that metric.
At Instacart, we had previously completed a few ad-hoc projects, but now that we are beginning to run large-scale projects, we are revising the methodology.
Creating human evaluated data is often a costly and time-consuming process. Make sure to ask yourself:
Your data could be used as general training and evaluation data, as a way to quality test the output of your model, or as a reference collection to benchmark current and future models. Each of these use cases may require different approaches, which you should keep in mind.
Moreover, make sure that your use cases will genuinely benefit from human labeling. Crowdsourced tasks require proper setup and a budget and should only be reserved for tasks requiring human input.
At Instacart, we wanted to measure the relevance of our search results. Labeled data helps us understand how relevant the products we show to users are when they enter a query into their search bar. This data can be used for training and evaluating models and measuring the quality of our search results
Familiarity with your product and the data generated is crucial. As you spend time looking at the data, you will begin to understand if you have all the data a rater needs to complete a task, how complex a task it will be, and potential gray areas. Ask yourself:
This understanding is imperative, as it sets the groundwork for your task design. Without spending the appropriate amount of time, you’ll find many surprises and labels that don’t meet your expectations.
At Instacart, our goal was to measure how relevant our search results were to our queries — thoroughly considering all the associated data helped us avoid pitfalls later on. For example, we initially assumed that displaying product names and images would suffice in describing our products. However, as we internally tried to evaluate some data, we ran into trouble evaluating queries that specified product size, such as “six pack beer” or “bulk candy.” By revisiting Instacart search, we recalled that Instacart “Item Cards” display the product’s size and quantity under the product name. We made sure to present the same information in our human evaluation task. Had we not performed the internal exercise and found this discrepancy, raters definitely would have been confused on measurement-specific queries, and we would have been in for a surprise with our labels!
Ultimately, our search results return groceries — this is entirely different from airline flight times or restaurant reviews — and has its own set of complexities. You will want to do the same legwork to understand the complexities of your product and data.
After defining your use cases and data, you will want to design and implement your Human Intelligent Task (HIT) — the actual task you want your rater to complete. We’ll be using “task” and “HIT” interchangeably from now on. In designing your task, make sure to ask:
Your task should try to answer a single or a small set of questions. Avoid conditional or layered tasks, where the rater needs to answer multiple questions, as this adds additional cognitive overhead. Often raters may have language barriers or optimize their work around the volume of tasks they complete, and so complex multi-layered tasks may put you at risk for low-quality results. If you plan on multilingual tasks, such as evaluating both English and French-language products, make sure you design for the product’s native language version first (in Instacart’s case, that’s English), and then expand to other languages.
At Instacart, there are many ways to try to measure the relevance of our search results. With the “query-slate” model, a rater could be presented with the query and an array of products, which he/she rates as a whole for relevance. Alternatively, we could try to capture the relationship between the query and product, for example, whether the product is an ingredient of the query or complementary to the query — which we could then map to a relevance score.
Ultimately, we decided that the most straightforward approach would be to ask: “How relevant is this product to this query?” — in which a rater evaluates a single query and product pair. This was the most straightforward task we could present to a rater while addressing the most critical question we wanted to be answered.
A simple task was especially crucial for us, especially since search relevance for food is already such a complex area. Grocery searches need to consider brands, dietary restrictions, ingredients, compliments, and more — all of which we needed to capture in our guidelines!
Creating labeling guidelines is a bit more art than science. Your guidelines need to walk a fine line between being so broad that your labels are imprecise and so prescriptive that a rater is forced to abandon the intuitive judgment that makes human labeling valuable.
Developing your guidelines will be an iterative process. You will need to collect information from your team and users to understand and codify how raters evaluate your data. The following are methods and resources that you can use to develop your guidelines:
At Instacart, we went through all of the above. Beginning with a barebones set of criteria, our team evaluated hundreds of query-product pairs. Our ratings had disagreements, and we had to ask ourselves interesting questions such as: when users search for “gluten-free pasta,” how relevant is wheat pasta? Or for searches like “hot dog buns,” is it okay for us to show complementary results like hot dog wieners? For a brand search like “Coke,” how relevant is a competitor’s product, like “Pepsi”? Once we had identified these types of cases, we incorporated input from our User Research team on how existing users think about these types of search results. We went through this process iteratively until we had a set of guidelines that gave us the precision we needed without being over-prescriptive.
It doesn’t matter how straightforward your task is if you can’t clearly convey how you want raters to label that data. As the designer of the task, you likely have an amorphous set of rules laid out, which aren’t easily codified. Packaging that information into a digestible set of instructions is a challenge. Make sure to ask yourself:
Creating instructions for raters will require creating a document that encapsulates the criteria you want them to understand and implement. These instructions can be in the form of a booklet, a slide deck, or any other medium. In these instructions, make sure to communicate the criteria step-by-step and present plenty of examples along the way. As you create these instructions, show them to people who aren’t working on your crowdsourcing project. At this point, you are likely intimately familiar with the task and guidelines and will benefit from the feedback of people who’ve never seen the project before.
When presenting the information to a rater, think about how a user would interact with that same information in your product. Your rater’s UI should resemble your product’s UI — including text size, fonts, image quality — as closely as possible.
At Instacart, we created a set of slides that walked the rater through our criteria, including quizzes that confirm their understanding of the key concepts we presented. We made sure to display the information as close as possible to a typical Instacart Item Card UI with all the associated product information.
By creating a simple HIT, a clear and understandable set of guidelines, and a well-communicated set of instructions — you’ve built the foundation for high-quality results. These additional methods will help you measure and maintain rater quality during the evaluation period. Ask yourself:
In selecting your raters, it is important to know who your raters are. You may want control and visibility into your raters’ specific demographic information, including spoken language, nationality, gender, and age. This can help set up tasks where you want your rater pool to match your users’ demographic makeup. It can also help improve quality — for example, if you are rating English-only results, you likely want only English speakers evaluating that data. Gating by demographics can potentially increase the cost of your ratings, but it can be well worth it for the quality improvement.
You also need to make sure that your raters understand the task. After communicating your task guidelines with raters, test them on a series of hand-chosen HITs that reflect the guidelines’ complexity. This will confirm that they understood the task and its intricacies before they can rate actual data. Only allow raters to rate your data if they score on your test above the threshold you’ve chosen.
You will want to trust the labels that you get back from your platform. That is, the data needs to be reliable. Data is reliable if workers agree on the answers. Different workers produce similar results if they understand the instructions that we have provided to them. If two or more workers agree on the same answer, there is a high probability that the final label is correct. At Instacart, we had five raters evaluate each task and took the consensus rating (3 or more raters in agreement) as the final score.
Inter-rater agreement reliability measures the extent to which independent raters assess a task and produce the same answer. One of the most widely used statistics to compute agreement is Cohen’s kappa (k), a chance-adjust measure of agreement between two raters. A generalization for n raters is Fleiss’ kappa. Both statistics are available in standard libraries and packages like R or scikit-learn. We strongly recommend using inter-rater statistics to measure reliability on every data set.
Another common strategy to ensure high-quality work is to include predefined gold standard data in the data set at random, so we can test how workers perform. This technique is known as “honey pots”, “gold data”, or “verifiable answers”. If you know the correct labels for a set of HITs, you can use that precomputed information to test workers. By interleaving honey pots in the data set, it is possible to identify workers who might be performing poorly. If all workers are performing poorly on any particular honeypots, this may also indicate that there is a mismatch between your intended label and how workers are interpreting your guidelines.
How do we build a set of honey pots? As part of your internal labeling exercise, identify cases where you and your team have reached consensus. Those cases can serve as your precomputed honey pots. You can then randomly add the gold data into the data set that needs to be labeled, so raters evaluate the honey pot tasks the same as your unevaluated data set.
In some cases, the HIT may have poor data, such as an incorrect product image or severely misspelled term. For cases like these, it can help offer the rater an “I Don’t Know” option instead of having them guess. Going one step further, you may want to ask raters who select the option to explain why they cannot evaluate the task. You can provide a list of reasons from which they select or add a text field. These options can help you diagnose your information quality and have the added benefit of deterring excessive use of the option. Additional safeguards, such as rate-limiting the usage of the “I Don’t Know” option, can also be used to ensure that raters don’t abuse the option.
Now that you’ve completed the pre-flight checklist, you’re almost ready to label large data sets!
With your task and process clearly defined, it shouldn’t be too difficult to find a crowdsourcing platform that will satisfy your needs. The next step is to create processes for the data you want to evaluate, by sampling, partitioning, and preparing the data for continuous evaluation.
Crowdsourcing-based labeling is a good alternative for collecting data for evaluation and for constructing training data sets. That said, there is little information on how to set up this type of project and the amount of time and preparation needed. Many projects tend to underestimate the preparation steps and focused only on the specific crowdsourcing platform. Shortcuts like these can, unfortunately, lead to subpar results, as crowdsourcing is more than just the choice of platform. We believe that the checklist is useful for making sure that the project is successful and that the collected labels are of good quality.
In our next post for this series, we will focus on the details of running crowdsourcing tasks continuously and at scale.
We thank our team members Jonathan Bender, Nicholas Cooley, Jeremy Diaz, Valery Karpei, Aurora Lin, Jeff Moulton, Angadh Singh, Tyler Tate, Tejaswi Tenneti, Aditya Subramanian, and Rachel Zhang. Thanks to Haixun Wang for providing additional feedback.
If you are interested in learning more, there is a dedicated conference, HCOMP (Human Computation), that groups many disciplines such as artificial intelligence, human-computer interaction, economics, social computing, policy, and ethics. These bonus reads offer a good introduction to these topics:
Want to design large-scale data projects like these? Our Algorithms team is hiring! Go to instacart.com/careers to see our current opening