Red Means Stop. Green Means Go: A Look into Quality Assessment in Instacart’s Knowledge Graph

Author: Thomas Grubb In the modern age of big data it is important to remember that data quality is just as important as data quantity. I’m a Machine Learning PhD intern at Instacart where I’ve spent the last three months working on various projects regarding data assessment. As I complete my time at Instacart, I want […]

Author: Thomas Grubb

In the modern age of big data it is important to remember that data quality is just as important as data quantity. I’m a Machine Learning PhD intern at Instacart where I’ve spent the last three months working on various projects regarding data assessment. As I complete my time at Instacart, I want to share how Instacart is taking a proactive approach towards data quality by laying the foundations for a reliability scoring system in the Instacart Knowledge Graph.

The Instacart Knowledge Graph

The Instacart Knowledge Graph (KG) is a central data store of millions of contextualized facts regarding the grocery industry. These facts are varied in nature and help us understand food items as tangible, real world objects, not just as data points in an inventory catalog. For example, our internal taxonomy classifies grocery items into one of several thousand categories such as “couscous,” “risotto,” or “ice cream.” The internal tree structure on our taxonomy tells us that “couscous” and “risotto” are much more closely related than “couscous” and “ice cream.” Moving from categorical to quantitative data, our wide knowledge of nutritional information tells us that pasta products are high in carbohydrates, whereas nuts are high in proteins and fats. This in turn tells us to avoid the former and recommend the latter for a keto diet.

The Instacart KG doesn’t exist in isolation. It is continually rebuilt from a variety of data sources, and information from the KG is frequently packaged and published for downstream applications such as search. New data sources are frequently incorporated into the KG’s build process over time. Because of this, the KG team has implemented a reliability scoring system which automatically rates the data it ingests on its accuracy and consistency. This system accomplishes two key tasks. First, it helps us preemptively discover and flag flaws in our data which can then be corrected at the source. Second, it acts as a basic guardrail which prevents noisy and unreliable data from being published and corrupting downstream processes.

In implementing this reliability system we are banking on the old saying that “an ounce of prevention is worth a pound of cure.” Minor inaccuracies in our data can compound and present themselves in unforeseen ways which produce friction down the road. A pack of ground beef with an incorrect taxonomy classification will not be received well by a vegan customer browsing our “meat alternatives” category. Missing or incorrect product descriptions make it harder for Instacart shoppers to locate the correct product in the store. Typos in numeric data lead to noisy training sets which in turn leads to less precise predictions from our machine learning models. For a peek into how Instacart uses these models, and why their accuracy is important, see this previous blog post on product availability in stores.

Because the knowledge graph centralizes all of our knowledge regarding grocery products, we are able to take a contextualized view of a data point in relation to closely related points. This makes anomalies and inconsistencies in the data much more apparent, and makes the KG a prime spot for performing quality assessment.

Reliability Scores in Knowledge Graphs

Sophisticated models for data cleaning in knowledge graphs exist in the literature; for a survey, see [1]. These models can label information with a score from 0 to 1, which can be interpreted as a probability with which a certain fact holds. There are several downsides to relying solely on ML for data quality assessment. These include high upfront costs for implementing the algorithms in the first place, as well as difficulties for running these algorithms at scale. Moreover, reliance on techniques such as graph embeddings can lead to black box results which are hard to interpret.

As a result of these downsides, we decided to get our project off the ground using a much more simple approach to quality assessment. Our current reliability system uses a series of unit tests to assign a discrete score to our data: red, yellow, or green. Meant to evoke images of a traffic light, the scores convey the following information:

  • Green: We have no issues with this data point. Go ahead and use it!
  • Yellow: The reliability of this data point is questionable. Use with caution!
  • Red: We think this data point is actually incorrect. Stop and don’t use it!

What this system lacks in sophistication it makes up for in ease of implementation and explainability. The unit tests employed by our system are quick to employ and run at scale. Moreover, they are transparent: each piece of data which is labelled as questionable or unreliable comes with a list of explicit reasons explaining the score. This allows for easier debugging and correction of errors at the source.

Types of Tests

The unit tests in our reliability system are packaged into groups which test a specific slice of the KG at a time. For example, the nutritional unit tests measure the reliability of our knowledge of a product’s calories, fats, proteins, and carbohydrates per serving. Other slices that are tested include our knowledge of brands and taxonomic classifications of products, or our knowledge of categorical attributes of products such as “Vegan” or “Gluten Free”. To ensure that the system as a whole remains transparent, each individual unit test is restricted to a highly specific piece of knowledge. We describe several of these tests below.

The most basic unit tests employed in our quality assessment system apply to numeric data. Prime examples are outlier detection schemes, which compare a data point to “similar” data points around it. This allows us to, for instance, flag products with abnormally large protein per serving. See the chart below, where in red we have flagged the unique ice cream product in our catalog with over twenty grams of protein per serving.

Although these tests are straightforward in nature, it is important to remember not to apply them blindly; context always matters. Twenty grams of protein per serving in a product would be an abnormality if the product was a carton of ice cream, but would be perfectly normal if the product was a can of tuna. Thus outlier detection schemes are not applied to our entire catalog of products at once, but instead on a class-by-class basis corresponding to their location in our taxonomy. For a detailed analysis of using outlier detection methods to discover numerical inaccuracies in knowledge graphs, see [2].

Once numeric testing has been done, we move to logical rule-based testing schemes. These tests invoke ground truths that our data should follow. For example, the Instacart KG contains information about food products at the chemical level. It knows that carbohydrates are a constituent of food, and that sugar is a specific type of carbohydrate. This allows us to apply tests such as the following:

  • A product tagged as “sugar-free” must have 0 grams of sugar per serving.
  • Since sugar is a carbohydrate, the grams of sugar per serving cannot be larger than the grams of carbohydrate per serving in a product.
  • One gram of fat yields 9 calories of energy when broken down. One gram of either protein or carbohydrate yields 4 calories of energy when broken down. Thus the calories per serving in a product must be 9*fat + 4*protein + 4*carbs (in practice one must also take dietary fiber, sugar alcohols, and ethanol content into account).

Applying the second rule above allows us to flag the red points below, as they contain more sugar per serving than carbohydrates:

Rule-based tests are a gold standard of quality assessment. Their simplicity allows them to run quickly at scale with minimal implementation costs (as long as one has the requisite domain knowledge to understand the ground truths your data should follow). Moreover, the specificity of each rule-based test gives precise reasons for why data was scored a certain way.

The next class of tests that Instacart uses to assess data reliability are based on clustering and anomaly detection. You may know of these tests as the reason your credit card gets declined when you try to buy gas on a road trip far away from your home. Humans are creatures of habit, and when established patterns are broken there is often reason for suspicion. This same rule applies to companies and the products they create. To quantify this, we can assign each brand B a characteristic vector v_B whose ith entry is the proportion of products of brand B in taxonomy class i. In brands with a wide variety of products offered, it is natural to expect a power-law phenomenon to appear when this vector is sorted. Below is a plot of a large brand’s sorted taxonomy characteristic vector, next to the function .5*(1+x)^(-1.3):

The power law is certainly not a hard and fast rule that brands must follow, but it can highlight areas that are worth verifying by hand. For example, a beverage company with 99% of their products listed as “Beverages” and a single product listed as a “Plumbing Fixture” is likely a misclassification.

Clustering techniques are particularly useful when combined with ideas from Natural Language Processing, such as the use of word embeddings. This allows us to mathematically manipulate strings of words, such as product titles or descriptions. For example, we may take the average word vector of a product title and run a k-Nearest Neighbors algorithm over our space of products. This allows us to discover that Product123: “Orange 2 Liter” is probably closely related to Product173: “Orange Soda 2 Liter”, Product735: “2 Liter Orange Juice”, and Product223: “Orange 3 Liter” :

Having discovered this relationship, the KG can then relate the taxonomy classes of the four products above. If Product123 is classified as a “Fruit” whereas Product173, Product 735, and Product223 are all labelled as “Beverages,” then the taxonomic classification of Product123 might need to be inspected more closely. While this technique can prove very useful, it is difficult to scale at face value. Thus in practice one must use ideas such as approximate nearest neighbor searching, or downsampling techniques on the search space.

The last test employed in our KG quality assessment deals less with the data itself and more with meta-information regarding the data. These are provenancebased reliability tests, and they involve analyzing the source (or sources) of data points. Instacart obtains product-level information from a variety of sources, including retailer inventory files, third-party labelling services, and admin level spot edits. By aggregating information regarding our confidence in the data source, as well as measuring the level of agreement or disagreement between data sources, we can obtain a resulting reliability level in the underlying data point. Because the source of each data point is tracked, this also allows us to give directed feedback to partners in order to give them a better sense of any inaccuracies in the data they provide.

Infrastructure

The tests described above give a local picture of how individual data points are tested in the KG. We will now describe the bigger picture regarding how these tests are coordinated together into a more global KG reliability system. As discussed previously, the Instacart KG is continuously rebuilt using a series of Extract, Transform, Load (ETL) pipelines. These pipelines are run to create a staging version of the KG; while data is not scored during this phase of the build process, the ETLs perform a critical task of standardizing and contextualizing data.

For example, strings of information regarding product sizes are parsed into a numeric literal value and a standardized unit of measurement (we use the QUDT ontology for units, available here). This allows us to convert these sizes into a given base unit, such as grams or milliliters, in order to calculate distributions of sizes across certain classes of products.

Once the staging KG has been built the scoring process occurs. To do this, we query the graph using SPARQL to obtain the data in question as well as any relevant contextual information. For example, our nutrition unit tests query for product nutritional information, as well as product taxonomy class and relevant categorical tags, such as “sugar-free”. Once this information has been collected, a class of unit tests are run in sequence. If at any point a unit test fails, we add an entry into a log file which keeps track of what tests failed for what data points.

Based on the set of tests that a data point passed or failed, we then score that data as green (reliable), yellow (questionable), or red (unreliable). This information is then fed back into the KG using a SPARQL update statement. As our data is scored discretely, we use named graphs to store this reliability information. In essence this means we have chopped up our one staging graph into three separate graphs, each of which contains the facts of a given reliability level. One can then query these graphs individually to obtain facts above a certain reliability threshold. While the named graph approach has its limitations, it saves on storage space and is easier to query against when compared to other solutions.

A key benefit to the use of unit tests which test highly specific pieces of information is that it allows us to track the quality of our data at a granular level. After the data has been scored, we maintain logs detailing which tests failed and with what frequency. These logs can be aggregated over classes of products, such as taxonomy categories, to look for areas with hot spots of errors. This information can be passed to upstream data providers to make it easier to find and correct data inaccuracies at the source. All of this can further be tracked over time as a means of measuring progress or degradation in our KG’s quality with each subsequent build.

Moving Forward

The methods described above have provided the Instacart KG with a baseline reliability system that more advanced techniques can be built on top of; see [3] for a great overview. Promising directions of exploration include techniques from statistical relational learning, which give a much more refined notion of reliability scoring in knowledge graphs [4]. Methods such as DeFacto would allow us to invoke the “wisdom of the crowd” to verify KG facts by querying the Web [5]. And despite the black box nature of certain ML techniques, methods such as KG embeddings provide promising techniques for analyzing information inside of a knowledge graph [6].

Acknowledgements

Many thanks to the other members of the Instacart KG team for their constant support, advice, and feedback regarding this project: Omar Alonso, Bill Andersen, Chuan Lei, Wei Peng, and Rachel Zhang. Additionally, thank you to Haixun Wang, Jonathan Newman, Min Xie, Vamsi Madabhushi, and Saurav Manchanda for useful discussions throughout the project.

Interested in working on problems like these? Instacart is hiring Machine Learning PhD interns in San Francisco and Toronto.

References

[1] Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A Review of Relational Machine Learning for Knowledge Graphs. Proceedings of the IEEE, 104: 11–33, 2016.

[2] Dominik Wienand and Heiko Paulheim. Detecting Incorrect Numerical Data in DBpedia. In The Semantic Web: Trends and Challenges: 504–518. Springer, 2014.

[3] Heiko Paulheim. Knowledge Graph Refinement: A Survey of Approaches and Evaluation Methods. Semantic Web Journal 8 (3): 489–508, 2017.

[4] Angela Kimmig, Stephen H. Bach, Matthias Broecheler, Bert Huang, and Lise Getoor. A Short Introduction to Probabilistic Soft Logic. Proceedings of NIPS Workshop on Probabilistic Programming: Foundations and Applications. 2012.

[5] Jens Lehmann, Daniel Gerber, Mohamed Morsey, and Axel-Cyrille Ngonga Ngomo. DeFacto — Deep Fact Validation. In Proceedings of the 11th International Semantic Web Conference (ISWC 2012): 312–327. Springer, 2012.

[6] Wen Zhang, Shumin Deng, Han Wang, Qiang Chen, Wei Zhang, and Huajun Chen. XTransE: Explainable Knowledge Graph Embedding for Link Prediction with Lifestyles in e-Commerce. In Semantic Technology, JIST 2019: 78–87. Springer Singapore, 2019.


Red Means Stop. Green Means Go: A Look into Quality Assessment in Instacart’s Knowledge Graph was originally published in tech-at-instacart on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Instacart