A guide to running A/B tests

I wanted to write about experimentation for some time now, and just earlier this week I had this epiphany (I might exaggerate a bit 😅) that running continuous experimentation (A/B testing) is one important trait successful product and marketing managers share. So I took this as a starting point to write about experimentation.

Coming from the product side of things, I thrive at running experiments around activation, monetization or engagement. And despite observing marketing and acquisition teams running and testing thousands of campaigns over the past years, it only clicked this week that there are many similarities (to be fair, I had not really thought about it previously but recently spent a bit more time than usual working on the acquisition side).

There are experiments both for acquisition campaigns and product changes where you do your due diligence on defining the hypothesis, expected impact, allocation and so forth. But then there are also experiments that you simply need to run because like Socrates said “you know that you know nothing”. Here are two examples for the latter:

Using testimonials as social proof works incredibly well for certain apps, but not at all for others. Similarly, certain creatives will work tremendously for advertising some apps, but not at all for others.
Pricing of mobile apps on the product side, and bidding on ads on the marketing side. You can approach both things incredibly structured (and you should to some extent), but there comes a point where you simply need to experiment to figure out the best solution.

‍

Running small experiments and its caveats

Everyone has heard the tales of how changing the button color from 🔵 to 🔴 increased conversion rates by xx%. Should you also experiment with changing button colors? Is this really meaningful? The answer is: it depends…

⏳ The truth is that many startups or early stage companies run experiments that they should not spend a minute working on. Why? Because in the beginning you need to be more radical. If you only get a few hundreds visitors/installs every day, any experiment will take ages to complete with statistical significance. So it is not worthwhile to run experiments where you only change minor things such as button colors or single lines of copy.

💪 Instead you need to focus on more radical ideas, in particular if you have collected feedback or insights that your current screen/page is not working as well as intended. Try a vastly different design approach instead of just changing one button or CTA copy. If it does not work, revert back and try another approach.

🚄 Companies with high traffic and well-oiled experimentation processes can run multiple experiments at the same time, including minor ones around changing lines of copy or button colors. I have seen such small changes contribute to decent increases in conversion rates and they need little to no development efforts. These small changes, however, should never take up the majority of your experimentation backlog but only be additions. Otherwise, you risk to stop innovating.

‍

The most popular areas for running experiments

I had teams reach out to me in the past that they would like to work on something else than onboarding for a change after running one experiment after another on the onboarding and activation funnel. The truth, however, was that we needed to keep going as we were still making significant progress in improving the activation journey, leading to higher conversion rates, while the retention of the product had remained stable.

🧪 Running experiments (A/B tests) is crucial for improving acquisition, activation, monetization, and retention, and thereby profitability of your service. In my experience, particularly activation journeys and pricing are destined for (heavy) experimentation. See below for a few examples 👇

Activation: This is where you have the largest volume of users to experiment with which not only reduces the time to reach statistical significance of your experiments but also means that it is extremely impactful. Any uplift you accomplish during activation will automatically raise the starting point for retention.
Pricing: There are so many people out there afraid of increasing their prices. At the same time, pricing is one of the largest levers you have in terms of increasing your product’s profitability. Run a willingness to pay survey or simply change your pricing and see what happens. I managed to increase prices by 4x for an app that was already considered more expensive than the competition while only minimally reducing conversion rates.
Retention: Find features and feature usage that correlate with increased retention, and then experiment with increasing usage of these features or introducing related features. Instead of doing the correlation math yourself, you can use Amplitude’s Compass feature.
CRM: Do not forget to experiment with CRM and messaging. Test different copy, send messages at different times, introduce new campaigns, etc.

‍

Free templates & what to do before starting an experiment

Did it ever happen to you that you got so excited by a new idea that you ran an experiment without putting together a proper experiment brief? I am guilty of it. But I have learned from that mistake and to make life easier for you, I share my template for defining experiments with you here. As a minimum, always take the time to define the hypothesis (and background), expected business impact, and target audience. See details 👇

🔮 Never start a new experiment without defining a clear hypothesis. To craft a meaningful hypothesis follow for instance a structure similar to the one below. And do not forget to add a statement (like the one below) about how confident you are in this hypothesis so that you can also prioritize it among other experimentation ideas using the (R)ICE framework or similar.

Because <some insight/learnings/data/evidence we have> we believe that <doing this thing> will result in <some change to this metric>.

📈 If you have been following my posts, then you will have noticed that I pay close attention to the expected business impact. When you are defining experiments, it is no different. You need to draw a line in the sand on how you expect the key metric(s) to move with the introduction of the new variant. Also make sure to write down how you are going to track the expected impact (who checks when and using which tools). I have published a free template that allows you to calculate the expected business impact of your experiment that you can then refer to as part of your experimentation brief.

🕵️ It is important that you take some time to think about who should be exposed to the experiment so that you can draw the right conclusions. Define which audiences and platforms are targeted at what point in time.

‍

When you should NOT run A/B test

There are a few cardinal sins you should avoid when designing and conducting experiments. Not every new feature or change should be A/B tested. See below for some threats to avoid 👇

A successful experiment begins with a clear hypothesis. Just as scientists don't conduct experiments without hypotheses, you shouldn't run A/B tests without well-defined expectations.
Do not run A/B tests when your audience is too small. It will take ages to get results with statistical significance. Instead be more radical and change your full user base to the new variant and observe what happens.
While I am a fan of A/B testing, not all changes warrant A/B testing. Especially low-risk adjustments can be made without extensive testing to save time. As you will be observing any changes, you can always revert back.
Sometimes you will simply not have enough time or capacities to run an A/B test. Instead of falling into analysis-paralysis mode or scaring away from implementing any new changes at all, go ahead with implementing the change without experimentation, and slowly start building out a culture of experimentation.
Do not run multiple A/B tests on the same audience and the same feature when you are getting started to keep impact analysis simple.
Do not forget secondary metrics. While your goal is to improve your key metric(s), you need secondary metrics acting as guardrails. For example, you do not want to maximize activation conversion rate at the cost of retention.

‍

What to consider while running an A/B test

Once you have completely defined and launched your experiment, the work is done, right? 🤡

I suggest to perform the following tasks while running an experiment 👇

↗ Gradually ramp up user assignment. You should always start exposing only a small subset (e.g. 10%) of your user base to a new experiment, and only if everything goes as expected gradually increase the exposure to the required volume.

🕵️ Continuously monitor the experiment and its impact on your key and secondary metrics. In some cases you may want to pull the plug even before statistical significance has been reached.

🔄 Share and discuss initial findings with your colleagues to keep them in the loop even before the experiment has been finalized. This keeps transparency high, may spark off new ideas and appreciates the hard work everyone put into delivering the experiment.

‍

Your experiment failed - what now?

You conducted an A/B test and it failed - what now? 😱

First of all, you are not alone. This happens all the time. And it is okay that it happens. What matters is how you deal with it.

❓ So what should you do when your A/B test fails? Take a step back and look at your initial hypothesis and think it through once again now that you are armed with new data and insights from actually running the experiment. Ask yourself whether your hypothesis still makes sense and whether you have targeted the right audience.

🕵️ By asking these questions, you may conclude that the A/B test rightfully failed, and the hypothesis was wrong. Again, this is fine and it happens. Armed with your new insights, you may also come up with a slightly adapted hypothesis or even a completely different idea. In any case, what you are doing is that you are closing the loop and turning the failed experiment into a learning that you share with your team.