How to define experiments with KPIs
Hello all, and welcome to Teach Me How 2 Product. I won’t help you dance, but I will help you make better products. I’m glad you could join us!
I have been struggling with the kinds of topics that I would cover here for a long time. As I’ve been discussing the role and function of product managers with others, however, the topics started falling out. It’s clear that materials about how to do the job of product ownership and management are still in demand, which is awesome! I hope these posts help you, dear reader, build and manage better products.
This first topic comes from my very sharp brother-in-law at Uber, and it’s a great topic to start with:
“How should you prioritize which experiments to run? And how should you choose KPIs affected?“
Both are great questions! I’m actually going to answer them in reverse, and I’m going to add a post in the middle to discuss a relevant topic: picking metrics that ultimately determine success or failure of your business.
Each subject requires pretty deep thought, and I was finding it difficult to go in each without separating a bit. So, welcome to the three-part series that you didn’t know you were a part of!
We’re going to use this framework to help frame each step in the process and talk about how all of this feeds into your wonderful testing roadmap.
So, how should you choose KPIs affected by your experiment? Let’s start by taking a look at the first step in our framework: experiment definition.
Defining an experiment
The very first thing as a product person you should be doing for any experiment is defining the experiment’s observations and hypothesis. These two critical pieces will ultimately help you determine what the key metrics should be for your test. Let’s dive into each component.
Observations are things that you have observed that made you decide to run the experiment in question in the first place. They can be things like:
Learnings that came as a result of previous experiments that you ran.An example of this could be “oh crap, I tanked the number of users entering our checkout when I made the button purple.” That’s a learning! Put it in there champ!
Insights from user behavior analysis of your experience. “Man, if people click on this button that doesn’t go anywhere, they convert at a rate 50% higher than average.” Winner! That totally counts!
Insights found in usability study or observed in user feedback. “We were letting users use our experience in the usability laboratory, and this one user was confused by the screen we showed her. So, improving it should positively impact our key metrics.” Totally cool and awesome to use here.
I would recommend not using things like:
“Because our CEO told me to.” Not a good look, especially when it’s not founded in any sort of data or fact that would materially benefit the user or the business. Use your brain friend.
Insights unrelated to the test at hand. Again, use your brain. If you’ve gotten this far, you probably have a pretty good one in your skull there. If you’re using an observation from a social channel study like “people use the Facebook button to share more over the Google button” to determine how often someone will use Apple Pay in checkout, that’s not going to, like, work.
Really old insights. Likely the way you have learned to A/B test has changed dramatically. You probably have also changed the product materially over time. So, a test you ran four years ago may not work or fail again if you ran it today. Thus, really old insights likely won’t apply. I’d recommend a cutoff of two years to start, but that could depend on how fast your product is moving.
Your hypothesis is a statement about what you think will happen if you run this experiment, and how it will change user behavior and, ultimately, your key metrics.
Hypotheses usually materialize using the following format:
By___ , users will ______ and we will ________.
In this format, you want to capture the following things that are important as far as what you decide to measure:
What exactly you are going to be doing in your experiment
What you anticipate users will do when they encounter your experiment
What outcome, business or otherwise, you expect to see as a result of users interacting with your experiment
Defining Your Experiment’s Key Metrics
Finally, the moment you’ve all been waiting for. “WHY WON’T HE ANSWER THE QUESTION ALREADY?!” This is why dear readers.
Your key metrics for a test should capture the user behavior and, ultimately, the objective behind the hypothesis you have laid out for your test.
Without the work you did beforehand for your observations and hypothesis, your key metrics will just be a random guess at why your experiment is important to begin with. Thus, it’s really important to do the work upfront before you get here.
Okay, you’re here now! I find that examples really help me understand how to apply frameworks like the one above. So, let’s talk about what I mean by the statement in bold with an example.
Example #1: Search Bar on the Vrbo iOS Home Screen
Since it’s my blog, I’m going to go with an example that I’m personally familiar with: adding a search bar to the top of our Vrbo iOS app home screen. For those of you that don’t know, Vrbo is a vacation rental marketplace app on iOS and Android where you can find great places to book when traveling with family and friends.
(For the record, this is not a test that we ever ran. But go with me here.) Visual example below.
Why are we considering running a test like that? Well, let’s say we learned a few things. We learned that:
When we ran a previous test for putting the same search bar on the top of our search results screen (also hasn’t happened…yet! But go with me.) , we found that users ran 5% more searches in the variant than they did in the control
Some analysis that your boss forwarded you showed that the more searches a user did, that user had a higher likelihood of converting into a booking
When we had the iOS app in a usability study, we saw that users struggled to figure out how to search when they got to the home screen.
TADAAAA!! Those are observations. You’ll have to figure out how to phrase them well, but there they are. It is the basis for why you are even considering running this test.
Okay, so now I want to construct the hypothesis for this test that is based on the observations above. Given the above and the test I’m considering, I would probably construct my hypothesis as follows:
“By putting a search bar at the top of the home screen, more users will view any property from search results, and we will see an increase in booking conversion.“
Whoa hang on Ajay, why did you change the format of the hypothesis?! Why did you not use searches performed per your observations???
First, the ad lib format is a construct. I changed the construct a little bit so the sentence used proper grammar and I didn’t sound like an idiot. But the basic framework still applies.
More importantly, the second part about not using searches performed. Here are the reasons I am not using that metric:
It’s a bit obvious that searches performed should go up based on the observations. Setting your key metric as that will not give a sense of if the experience changed in a materially better way for the user. If more searches are performed AND a user sees any property from search results, NOW we’re cookin’. I now know that the traveler has done another action that is a critical engagement action, along with what I had already expected. You want to ensure that your key metric that you select ultimately capture’s the user’s intent that you are looking for.
The more key metrics you select, the more statistical corrections need to be applied to your results based on statistical hypothesis testing. It’s complicated, but a good rule of thumb is that your p-value goal will have to be divided by the number of success metrics you use in your test. It’s a real bummer. That doesn’t mean you can use other metrics for observations in your next test, but it does mean you can’t go HAM and pick 100 metrics.
Given the observations and the hypothesis, I’d propose that we set the following key metrics:
Visitors that view any property from search results
Okay, let’s do this again with another example from my brother-in-law’s company, Uber.
Example #2: Recent searches in the Uber iOS app
Here’s another example. Let’s say we’re working on the Uber rider team, and the app didn’t have recent searches yet. In that case, you would have to search for every destination that you wanted to travel to manually. In the variant, we would add in recent searches to your experience that you could tap on to book a ride to your destination. Visual example below.
Why are we considering running a test like that? Well, again, let’s say we learned a few things. We learned that:
Watching users in the usability laboratory have to type in every destination they want to go to every time is super painful and makes our eyes bleed. Excellent observation and very true. Sounds super painful.
In other travel apps and shopping apps, they include recent searches in the typeahead experience. A good observation since they likely have figured something out about that experience that can be tested and, ultimately, leveraged.
In our data, the percentage of users that initiate the flow to select a destination and actually end up searching for a destination is 40%. Based on additional analysis, we believe that can be improved. Great observation.
Based on the above observations and given the test I want to run to help these poor souls, I’ve come up with the following hypothesis:
“By placing recent searches on the empty state of the destination search experience in the Uber app, more users will select the ride type they want and we will see an increase in booked ride conversion.“
Here’s the screen after a destination is selected from a recent search you have performed:
Per our previous example, in this case selecting the kind of ride type you want is the next step. If we did our jobs properly, we should see more users confirming the type of ride they want in the next step that they need to take before booking a ride.
“But Ajay, what about the cases where people don’t try to select a different ride?” Well dear reader, the default selection should be included in the data set if we pick the right KPI. So, if someone does nothing, their selection of the default should be included.
Users might also need to do things like change their credit card or schedule a ride before booking one. This would be captured in the booked ride conversion metric.
Given the observations and the hypothesis, I’d propose that we set the following key metrics:
Visitors that selected a destination and selected a ride type before booking a ride
Booked ride conversion
For that first number, what we’re really looking for is changes in the conversion funnel for Uber. We’re not necessarily looking at changes to absolute percentages. Rather, we’re looking at conditional numbers, i.e. numbers that depend on what a user did previously in the funnel before they got to the next step.
Here’s a visual to show you what I mean. Below is a depiction of our funnel. The dotted line pieces on top of the bars in the funnel are the changes we are hoping to see when we run the test. In the funnel, each step is depending on the previous step. A user has to initiate destination selection by tapping on the search box before they actually select a destination as an example. That’s why the KPIs are written the way they are homie.
And there you have it! That’s how you use the observations and hypothesis framework to set key metrics for your experiment.
From here, we’re going to move on to scoping, impact sizing, and understanding the key metrics that are the engine for your business. See you in the next post!
You might also be wondering about test duration, i.e. how long you can run your test and why. That’s a whole other post by itself, but it has to do with baseline conversion rates (BCR), minimum detectable effect (MDE) , and the resulting sample size. Mind bender, but I can talk about it if you’re interested, let me know!
P-value is something pretty important to understand. P-value is the probability that, when the null hypothesis is true, the statistical summary would be greater than or equal to the actual observed results. Basically, it’s the probability that the result of the test is statistical noise. The lower the value, the higher the likelihood that the observed value of your test represents what would actually happen with your test. Complicated I know, but your understanding of it will get better the more you use it.
A/B testing is a tool in the tool belt, but it isn’t the only tool. Hit me up if you’d like to hear more about a framework there.
This is my first blog post in a long time, would love to hear how y’all like it!