A/B Testing: Top 4 Mistakes With Business Cases & Fixes

young kwon
9 min readNov 15, 2020

Introduction

Steve Jobs, while answering a tough question in 1997 had said,

“You’ve got to start with the customer experience and work backward to the technology. You can’t start with the technology and try to figure out where to sell it.”

I believe A/B testing is based precisely on this idea. Most of the innovative companies have moved on from HiPPO (highest paid person’s opinion) to data-driven decision-making. They are spending a lot on digital experiments to ensure the best customer experience and organizational decision-making.

Talking about Facebook’s investment in the huge testing framework, in an interview, Mark Zuckerberg said,

“At any given point in time, there’s not just one version of Facebook running in the world. There’re probably 10,000 versions running.”

Jeff Bezos also once said,

“Our success at Amazon is a function of how many experiments we do per year, per month, per week, per day.”

But despite proper budget and effort, some avoidable mistakes creep in during the implementation. This article points out the top 4 mistakes which commonly happen during A/B testing and go unnoticed.

I’ll include the following as a part of the article’s framework:

  • I’ll elaborate on the mistakes with real business cases/hypothetical cases to help you understand the ideas clearly.
  • I’ll also suggest suitable fixes for them.

I promise to make it very interesting and easy to understand for you. So, brew a mug of hot coffee and grab your favorite armchair.

Let’s dive right in.

1. Too Many Test Variants

When you test a hypothesis to compare 2 variations, you perform a statistical test at a certain confidence interval. Suppose your organization decides on a 5% confidence interval for a certain experiment. What does this mean?

This means that there’s a 5% chance that your test results will be due to random chance and that you will find a wrong winner.

e.g. Your test will say that Option B is better than Option A, 5 times out of 100 when that is not the case.In statistical terms, there’s a 5% probability of getting a false positive (Type 1 Error).

That was easy, right? Let’s go a step further now. Let’s extend this concept to more than 1 test variant.

The general formula for the probability of getting a false positive is as follows:

False Positive Rate = 1-(1-α)ⁿ
α → significance level
n → total number of test variants (excluding the control)

The equation currently looks like this:

1-(1–0.05)¹=0.05

Now, as the number of test variants increases, Type 1 Error increases. The following chart depicts this variation clearly:

Fig.1 Variation of False Positive Rate with the number of test variants (Image by Author)

This is called the ‘Multiple Comparison Problem’.

Let me build on this with a famous experiment by Google, called ’41 Shades of Blue Test’.

In 2009, Google wanted to decide the shade of blue that would generate the maximum clicks on their search results page. So, they carried out ‘1% experiments’ to test 41 different shades of blue, showing 1% of users one blue, another experiment showing 1% another blue, and so on.

And that was how the blue color you see in the advertising links in Gmail and Google search was chosen. Interestingly, it earned the company an extra $200m a year in revenue.

That was fascinating, right? Now, let’s come back to the ‘Multiple Comparison Problem’. How do we deal with it? What would have Google done?

FIX:

The number of variations that should be tested depends on the business requirements of your organization, organizational efficiency, and a range of factors like conversions, revenue, traffic, etc. Still, testing too many variations should be generally avoided.

Statistically, there are multiple techniques to handle this problem. I’ll explain a technique called ‘Bonferroni Correction’.

By now, you know that with the increase in the number of testing hypotheses, Type 1 error increases. How does ‘Bonferroni Correction’ help with this?

‘Bonferroni Correction’ compensates for this increase in error by testing each hypothesis at a significance level of α/n.

e.g. if an experiment is testing 40 hypotheses, with the desired significance level of 0.05, then ‘Bonferroni Correction’ would test each hypothesis at α=0.05/40=0.00125.

So, now you know that to maintain a confidence interval of 95% for ’41 shades of Blue Test’ experiment, Google would have tested each hypothesis at 99.875% confidence interval.

2. Ignoring Interaction Effects

It is important to be mindful of interaction effects when multiple experiments target the same audience. But what is the interaction effect?

It is a situation when the simultaneous influence of two variables on the success metric under measurement during the experiment is not additive. Let’s understand this through an example.

Suppose Amazon is working on its outbound customer communication to improve the conversion rate. The e-Commerce Analytics team is performing an A/B test on an ‘abandoned cart’ push notification. At the same time, the Marketing Analytics team is also carrying out an A/B test on a ‘recommendations email’ to be sent out to customers.

The following figures show the conversion rates obtained during the tests:

Fig. 3 Conversion rates obtained in the individual A/B tests (Image by Author)

Fig. 4 Conversion rates obtained in the combined A/B test (Image by Author)

This is strange, right? Both the new features do well in their respective experiments but why does the combined test tank?

This is due to the interaction effect. Amazon has gone overboard with its outbound customer communication program. The combined effect of the two features that were doing great individually is that it has annoyed customers. And thus, the conversion rate has gone down.

FIX:

There’s a two-fold approach that you can take to get rid of the negative effect of the interaction effect on the success metric of the experiment:

  • Firstly, look out for any possible interaction effect between two new features rolled out at the same time. If there are two teams involved, someone in your organization who acts as a link between the two teams with good know-how of the working of both the teams can be a helpful resource.
  • Secondly, when such an interaction effect is identified, don’t carry out both the A/B tests concurrently. Instead, test them out sequentially.

3. Ignoring Customer Value

Sometimes, the organizations focus only on the performance of the main KPIs like conversion rate or revenue per visit and miss the segmentation based on customer value. This can lead to flawed experiment results. Let’s see this with an example.

Suppose Walmart Grocery redesigns its home page, changing the location of the ‘Search bar’. The team carries out an experiment spanning over 2 weeks but find the conversion rate and revenue per visit to have gone down. So, it concludes that the old design is better.

Everything looks good, right? But is it? No.

The team has missed the important fact that loyal customers tend to respond much more unfavorably than new customers. It takes the loyal customers longer to warm up to the new design. Let’s understand this with a more relatable example.

Suppose you go shopping at your nearest brick-and-mortar Walmart store and find that they have completely rearranged the entire store. You find the electronics section at the usual place of grocery, the clothes section at the usual place of household essentials, and so on. If this is your first visit, you wouldn’t know the difference and you would buy what you came for. But if you are a frequent shopper there, you would be confused, and you may even walk out if you are in a hurry.

I think you got the point. It’s just that similar online behavior is more likely.

FIX:

I think you would agree with me now that there’s an inverse relationship between customer value and positive response to a page design modification, considering the customer value to be not just a function of lifetime revenue but also a function of recency and frequency.

In other words, the upper quartile (customers having the highest scores in Recency, Frequency, and Monetary values) in an RFM model won’t have a favorable response to the new design.

Let me quickly demonstrate the customer segmentation by RFM quartiles through a small sample data of 4 customers.

R (Recency) → Days since last conversion
F (Frequency) → Number of days with conversions
M (Monetary) → Total money spent

Fig. 5 Customer segmentation using RFM model (Image by Author)

Thus, understanding the customer mix and segmentation based on customer value is very important. Segment them into new and regular. It’s even better if you form segments based on RFM quartiles (RFM Segmentation).

Fig. 6 Sample customer segments based on RFM quartiles (Image by Author)

4. Incorrect Post-Test Segmentation

After the experiment is complete, you start dissecting the data into segments such as traffic size, new vs. loyal customers, device type, etc. You want to compare them based on your success metrics to dig out useful business insights. This is called post-test segmentation.

But, you need to be cautious here. Be mindful of the following two problems:

  • Small sample size of segments: The segments that you form after the test can end up having a very small size. Thus, the business insights you draw out by comparing different segments of the variations you tested may not hold statistical significance.
  • Multiple Comparison Problem: Remember this? Yes, you are right. We covered it at the very first point of the article. If you compare too many segments, the probability of Type 1 error increases.

FIX:

So, how do you handle this?

The best way to handle this is to opt for stratified sampling and design targeted tests. Divide the samples into homogeneous brackets so that the variability within those brackets is very less. These brackets or customer segments can be based on attributes like device category, traffic sources, demographics, etc. according to the business needs and budget. Then, conduct the experiment to compare the corresponding brackets of the variations to be tested.

To give you an example from the industry, Netflix uses stratified sampling to maintain homogeneity across a set of key metrics, of which country and device type (i.e. smart TV, game console, etc.) are the most crucial.

Conclusion

Although there are many A/B testing mistakes, my effort has been to point out the more sophisticated ones which have a high probability of going uncaught in the industry.

I hope I was able to deliver on my promise of making the article interesting and easy to understand. And I hope you found it useful.

To end with Thomas Alva Edison’s spirit of celebrating mistakes and the learnings from them,

“I have not failed 10,000 times. I have successfully found 10,000 ways that will not work.”

Note: If you want to have a look at a recent case study that I did on A/B Testing, please find the codes and presentation at my GitHub account.
https://github.com/younghai/A-BTesting

Feel free to have a look at an article that I’ve written on data analytics in e-commerce retail.

--

--

young kwon

Bizmatrixx Director, 전 KYODO Group CEO, Data analysis Expert, ex IBM GBS, Deloiite