A/B Testing in Search I joined the Search team in February as their Product Manager after 2 years in the Data Science department. I’ve always admired how the team deeply integrated analytics into the Search algorithm. As I began to ramp up in the role, I spent hours reading all the documentation of the various […]
I joined the Search team in February as their Product Manager after 2 years in the Data Science department. I’ve always admired how the team deeply integrated analytics into the Search algorithm. As I began to ramp up in the role, I spent hours reading all the documentation of the various A/B tests that were run since GIPHY started – and there were a LOT! In this blog post, I’ll delve into how A/B testing works for our Search team, how we evaluate tests, and discuss some test examples. But first, a little bit about me.
As a data scientist, I partnered with teams to help them understand how their users interacted with our products so they could make data-driven decisions. One of the reasons I wanted to switch to product was to act on these insights I’d been drafting for so long!
As my first product role, moving to a team with a highly data-driven mission has made the transition exciting. Every day, I start my morning by monitoring our dashboards to look into data fluctuations – with such a massive user base, it’s generally clear to see if there are immediate trends or data inconsistencies we need to address. Even better, my team does the same so we are all speaking the same language about our product’s impact.
We use both the aggregate and A/B testing results to inform our product development and prioritization. If we start to notice a specific metric lagging, we’ll first hypothesize why that may be happening, then brainstorm solutions and possible product iterations – then A/B test them! My background in Data Science has come in handy as I can pull this analysis myself and understand what our engineers are seeing.
For starters, what is an A/B test? Most PMs may be familiar with the idea of tweaking various components of a website or an app, and running experiments to see which variant improves user engagement, retention, or another business KPI.
Our search tests run in the same manner – we test out different versions of our algorithm on small sets of users. GIPHY’s search algorithm serves hundreds of millions of users every day across many integrations. Even small tweaks to our algorithm can affect the user experience on a huge scale, so limiting these tests helps us to measure impact before releasing to everyone.
As an example, let’s say we made a change to our algorithm that affected searches for “happy birthday” that we hypothesized improved user engagement. All users right now may be seeing the control version of the algorithm, where the cake GIF is served first. In the variant, the digitized text GIF is served first.
You may be thinking that since our main product is a Search algorithm, our main KPI would be click through rate (CTR), defined as a user clicking on a GIF for a particular search query. Though CTR is certainly a primary team KPI, we use a holistic set of metrics that more accurately captures our core values as a Search team.
CTR is a good proxy for engagement, as we can understand whether our users are clicking on our results. However, we also measure relevance by looking into our rankings to see which position the clicked content is served in. As you can imagine, scrolling through hundreds of GIFs to find the best one certainly isn’t an ideal user experience! Additionally, we’ve developed more GIPHY-specific metrics to measure two of our core values – diversity and serendipity. Though only a few examples, we’re constantly evaluating additional metrics to quantify how well our algorithm is serving users.
Once we know which metrics we’re attempting to influence, we develop a hypothesis. The most common example is that by introducing a new feature to the algorithm, we expect CTR to increase. As the test is launched on our sampled users, we then can monitor the differences in CTR (as well as all other metrics) to see if there is a statistically significant effect. This means that the CTR changes are not random, and are indeed due to the algorithm changes since we’ve compared the control vs. the variant on a large enough pool of sample data.
After the test runs, we draft documentation of the test, the hypothesis, and the results. As a team, we then discuss whether to launch the experiment to our entire user base.
Let’s return to our “happy birthday” example, where we made an update to our algorithm we thought would have a positive impact on user engagement. To determine whether it was a success, we’d measure if there was a statistically significant lift in CTR among users who received the variant. If we saw that the test was a winner, we’d ship and launch this to everyone. Since “happy birthday” is such a popular search term, restricting this test to a small group of users helps us to gain insights quickly, without affecting all our users’ experiences.
Sometimes, it’s not so cut and dry as I’ll show in two real world examples.
Our trending feed is a pre-search experience, meaning that before you even type out what you’re searching for, we display some GIFs we think you might like. We’ve been iterating on our model to both improve user engagement and also promote timely, diverse partner content. We recently ran an experiment to update our trending feed model. With this specific test, we hoped to see a negligible impact on CTR, while improving our content diversity scores.
After we ran the analysis, we saw a decrease in CTR. Because this was a worse outcome than our initial hypothesis, we looked to the other business metrics to decide whether to launch. We saw large, statistically significant improvements in our content diversity scores. Even though our overall Search KPI of CTR dropped, we determined that the trade-off was worth it to improve the other metrics. If you’re an avid GIPHY user, take a peek to see what our new trending feed looks like!
The GIPHY Signals team focuses on building out various machine learning models to enrich our data. One such model they built used computer vision to detect actions occurring in GIFs – such as high fiving, running, smiling, and more. We tested including this signal into our algorithm, hypothesizing that it would improve CTR.
After the test, we didn’t observe any statistically significant changes to CTR, nor did we see any positive impact to other business metrics. We decided not to ship the feature. Ultimately if algorithm updates don’t lead to an improved user experience but increase model complexity, we generally choose not to ship.
While some may have seen this as disappointment for the Signals team, it turned into a learning opportunity to tune the model in the future. The team is constantly iterating based on testing feedback to build out models that drive impact to our user experience, and this was no exception.
One of my biggest roles on the team is to take technical concepts and explain them to the rest of the company. Not everyone understands (nor needs to understand!) what statistical significance is. What they should know, though, is that by improving our search algorithm, we’ll help users to find GIFs, Stickers, and Clips more easily. As the PM, it’s already been so rewarding to see our products benefiting users making searches as well as our integration partners developing apps using our API.
A/B tests are only one part of the product development life cycle, but I’ve learned that on the Search team, they are a critical piece of our algorithm iterations. Hopefully, you’ll be inspired to incorporate some form of A/B testing into your own process!
— Alex Anderson, Product Manager, Search team