A simple search query correction heuristic for the resource-constrained At Instacart, our search engine is one of the most important tools customers rely on to quickly find their favorite grocery items in the digital aisles. But, oftentimes, some of the most common grocery goods are the trickiest to spell. We’re not all spellers or digital […]
At Instacart, our search engine is one of the most important tools customers rely on to quickly find their favorite grocery items in the digital aisles. But, oftentimes, some of the most common grocery goods are the trickiest to spell. We’re not all spellers or digital natives — if you mistype or forget how to spell 🥑 (it’s Avocado by the way) you’re not alone. But, that typo shouldn’t automatically mean you have a poor search experience.
Some of our most popular misspelled queries are:
During the early days of Instacart our team was small and scrappy, and we used to manually add correction terms for commonly misspelled queries. But, this approach doesn’t scale. My first project as the first Machine Learning Engineer on Instacart’s Search & Discovery team 5 years ago, was to fix this problem. In technical terms, we wanted to improve the Recall of our search engine for misspelled queries.
In general, we encounter two main query correction problems:
Non-word query corrections can be solved by using a dictionary and some form of distance function to map the incorrect word with a correct word from the dictionary. Real-word query corrections are much harder problems and require more state-of-the-art techniques like building a language model. Peter Norvig wrote a great post on “How to Write a Spelling Corrector,” which walks through some of these concepts.
Given the short one-word queries we typically see with grocery searches, query correction is not an easy problem. There are thousands of papers on this topic and this is still a very active area of research. For us, the non-word query corrections were the most prevalent and accounted for most of the spelling errors that resulted in zero-result queries in the early days of Instacart.
For any given problem, there may be standard “state-of-the-art” solutions that are not always trivial to implement. In the early days, we had to solve a number of technical problems when building the Instacart marketplace, and it was impossible to prioritize each of these problems. So, we had to get scrappy vs. state-of-the-art when tackling these issues. Our commonly used approach to problem-solving in those days was — “can we use a simple approach that solves 80%-90% of the problem in a short period of time?” We often went for a quick, yet very effective, heuristic. It not only solved most of the incorrect spelling problems but also gave us the added benefit of helping with the query reformulation problem, which we’ll explain below.
After entering a search query, a customer will take one of the following actions:
For every query, each of these can be considered a different state the customer can move to in a Markov Chain. We are interested in helping the customer add an item to their basket (conversion state) as efficiently as possible.
If we take the misspelled query “avacado” for example, the transition probabilities look something like this:
If the most probable state the user transitions to next is a conversion, then we don’t have to correct the query. If it is any other state, then it most likely indicates a spelling mistake and we want to help the customer reach a converting state in a frictionless manner. By looking at historical search query logs, we can build this Markov model and correct bad queries and lead the user to a successful query.
We came up with the following heuristic to build this Markov Chain and easily generate query correction pairs:
Using this heuristic, we were able to correct all of our commonly misspelled queries:
We also realized that by looking at pairs whose Levenshtein Distance is greater than the chosen threshold (say 2), you could infer potential query reformulations as well. For example:
>>> correct_query(‘organic ground pork’)
>>> correct_query(‘canned soup’)
In the above cases, customers didn’t necessarily misspell their queries, but we did not have relevant items for what they searched for. They themselves tried a different alternative for which we did carry relevant items and converted. We used that historical data to automatically rewrite their query to a more suitable one which would yield the most relevant results.
Using the heuristic above, we auto-generated thousands of query correction and reformulation pairs and used them to automatically redirect customers to the query with the highest conversion probability when the original query did not yield any results. The resulting experience for customers looks like this:
We were able to bootstrap our query correction and reformulation heuristic using our own data. In other words, we crowd-sourced our first search query correction heuristic 🙂
It’s not always necessary to start with the most sophisticated approach to solve a problem, especially when you’re just getting started. Often simple heuristics give you a lot of mileage on most of the problems. Over time, we deployed more advanced Machine Learning techniques to improve on this heuristic. Stay tuned for future blog posts on this topic!
Want to work on the next generations of Instacart Search? Our Algorithms team is hiring! Go to instacart.com/careers to see our current openings.