Don’t let the crows guide your routes

Replacing Haversine distance using a simple Machine Learning model Authors: Reza Faturechi & Jagannath Putrevu At the core of Instacart’s grocery delivery service is our Logistics engine that matches our customers’ orders with our shoppers. In any given metro area, we can fulfill from hundreds of store locations, delivering within as fast as two hours for […]

Replacing Haversine distance using a simple Machine Learning model

Authors: Reza Faturechi & Jagannath Putrevu

At the core of Instacart’s grocery delivery service is our Logistics engine that matches our customers’ orders with our shoppers. In any given metro area, we can fulfill from hundreds of store locations, delivering within as fast as two hours for a majority of our orders. To do this effectively, our Logistics engine processes billions of computations every minute to find the most optimal routes to assign to our shoppers. More specifically, we solve a Vehicle Routing Problem, which we introduced before in our post Space, Time, and Groceries. Very often our algorithm attempts to group together orders of multiple customers who live close to each other to suggest efficient routes to shoppers and reduce the number of miles traveled on a trip.

An illustration of how we group orders together to assign to our shoppers

If our algorithm makes poor decisions with the grouping of orders, it could delay the timely delivery of our customers’ groceries. For the algorithm to make the best decisions, we need to pass inputs that are as accurate as possible.

Types of distances

One of the most important inputs required is the distance between any two origin and destination pairs. Computing distance is not as trivial as we think. Obviously, it’s hard to predict how exactly someone would drive from point A to point B on a map but with enough data, we can get to a reasonable approximation of the distance traveled and time taken. But when we started out, we didn’t have the luxury of large amounts of data to infer the right path between two points. So we had to rely on other methods to compute the distance.

Haversine Distance

The most commonly used formula for distance computation is called Haversine Distance, or more idiomatically, as the crow flies.

A crow would fly in the shortest path between two points, unlike how a car would have to drive

This can be computed easily using a mathematical formula:

We started off using Haversine distance as the key input formula for all our distance computations and designed most of our machine learning models and algorithms using it. But this way of computing distance can be pretty inaccurate. For example, when we compared Haversine distances to the actual distances our shoppers traversed in a region like Orange County, we observed a 33% mean absolute percentage error (MAPE).

Mapbox Generated Distance

The other way of computing distance is through a third-party API like Mapbox, whom we partner with. We found the Mapbox estimated distances to be much more accurate than Haversine with a mean absolute percentage error (also in Orange County) of only 5%.

Comparing the error distributions of Haversine distance and Mapbox distance

While this can be pretty accurate, it can also be pretty expensive. Since we go through billions of computations before selecting the best set of routes, we would have to query these third-party services billions of times every day, which is not viable. But we can still rely on this data to a certain extent through caching our calls to these services. Every time we offer a trip to our shoppers, we call Mapbox to show the route in our Shopper App.

When we make this call, we also cache the distances returned from Mapbox for each of the origin and destination pairs in the route. So the next time we come across the same pair of locations, we don’t have to re-query the service and just fetch the distance from our cache.

However, this still doesn’t cover all the potential combinations we have to consider before choosing the best set. Every day, new customers sign up on Instacart for the first time, we keep adding new retailers to our platform, and our shoppers can be located anywhere in a city we deliver from, which makes this a very dynamic problem.

So we decided to leverage our ever-increasing historical data of cached distances and actual distances traversed by our shoppers to build a Machine Learning model to predict distance.

A simple model to predict distance

To build a model, we fetched all the historical shopper trips data and extracted the origin & destination location coordinates, the actual distance traveled by our shoppers, and the cached Mapbox distance if available.

We also observed that the actual distance data can be noisy since it relies on anonymized GPS data from mobile phones. So we also had to use some heuristics to prune out some outliers from the actual distance data. For the bad outliers, we replaced that data with the cached Mapbox data wherever available.

We used this dataset to build an XGBoost model which performed best, both in terms of error and robustness, compared to other models such as Regularized Linear Regression and Random Forest. We built the model with five features: latitude and longitude information of origin and destination points and their corresponding Haversine distance. The model was trained to predict the corrected actual distance as the target variable. Given the semi-static nature of distance, we set up an automatic job to retrain the model on a weekly basis.

Architecture

High-level overview of the systems architecture

Sample Code

Sample code for training the ML model

Performance

The average MAPE of predicted distance we observed in a region like Orange County was 11% which is much smaller than that of Haversine distance (33%). In the chart below, you can see the comparative distribution of predicted and haversine distance errors:

Comparing the error distributions of Haversine distance and Predicted distance

We can also see that the Mapbox distance is still more accurate than the model predicted distance. So wherever we have cached Mapbox distances available, we continue to use those and supplement missing distances from cache with the ML model predicted distances.

Impact

As we mentioned earlier, accurate distance computation is very important for us to plan good routes and keep our customers and shoppers happy. With a simple model described above, we were able to improve the accuracy of our distance computation and were able to avoid planning some bad routes we would’ve created otherwise using only Haversine distance. Using this model, we were able to reduce the distance traveled by our shoppers for multi-order trips by 9% across all the areas Instacart operates in, leading to millions of miles saved per year.

Want to work on the next generation of Instacart Logistics? Our Marketplace team is hiring! Go to instacart.com/careers to see our current openings.


Don’t let the crows guide your routes was originally published in tech-at-instacart on Medium, where people are continuing the conversation by highlighting and responding to this story.

Source: Instacart