A/B testing is a popular term in Data Science. It is fairly self-explanatory to me at the first glance but then arises an question, what is the difference between the A/B testing people are using nowadays and the hypothesis testing I learnt in the statistics class.
In this post, I include some key points for understanding the hypothesis testing and the A/B testing, as well as some takeaways from other posts. In short, A/B testing is a version of a two-sample hypothesis testing for a randomized controlled experiment with two groups, control and treatment.
Hypothesis Testing
- Specify the null (
) and alternative ( ) Hypothesis - Calculate test statistics according to the observed sample
- Reject or cannot reject the null hypothesis
In the hypothesis testing, we have some knowledge of a population and observe a representative sample from the population. By computing the test statistics, we verify if our knowledge of the population can be supported by the observed sample.
A post by Jérôme Spielmann gives a gentle introduction about the mathematics behind the statistical testing and the transition from the generic testing to the A/B testing.
In this post, the author has an example of the results in a A/B test.
Click on Link Did not Click Design A 33 362 Design B 42 283 In the hypothesis testing, we focus on the results from one design and have
In the A/B testing, we compare the results from two designs and have
A/B Testing
- Define a measure of interest (i.e., treatment) and a metric
- Specify the control and treatment group (i.e., Group A and B)
- Construct the null hypothesis that the treatment group is as same as the control group
- Calculate test statistics according to the control and treatment group
- Reject or cannot reject the null hypothesis
In the A/B testing, we have a hypothesis of a measure and we are interested in its impact. Particularly, we design a randomized experiment with two groups that are identical prior to the treatment. In the testing process, one group is treated, while the other is not.
Test Statistics
The A/B testing includes applications of statistical hypothesis testing or "two-sample hypothesis testing" as used in statistics. The test statistics vary with the metrics that are used for comparing the two groups.
- Z-test is appropriate for comparing means under stringent conditions regarding normality and a known standard deviation.
- Student's t-tests is appropriate for comparing means under relaxed conditions when less is assumed.
- Welch's t test assumes the least and is therefore the most commonly used test in a two-sample hypothesis test where the mean of a metric is to be optimized.
- Fisher's exact test is used in a comparison of two binomial distributions such as a click-through rate.
Assumed Distribution | Example Case | Standard Test | Alternative Test |
---|---|---|---|
Gaussian | Average revenue per user | Welch's
t-test (Unpaired t-test) |
Student's t-test |
Binomial | Click-through rate | Fisher's exact test | Barnard's test |
Poisson | Transactions per paying user |
E-test | C-test |
Multinomial | Number of each product purchased |
Chi-squared test | |
Unknown | Mann–Whitney U test | Gibbs sampling |
Reference Link: https://en.wikipedia.org/wiki/A/B_testing#Common_test_statistics
Test Significance Level and Statistical Power
Oftentimes, a problem will be given with a desired confidence level
instead of the significance level (
An experiment in the A/B testing is typically set up at a minimum statistical power of 80 percent. In other words, we want the test to show it is at least 80 percent in the probability that the treatment truly works.
By construction, if we increase the sample size for both A and B
group, we will make the distributions under
Test Sample Size
The key is to determine the minimum sample size for the experiment, because it is directly related to how quickly you can complete the test and deliver statistically significant results.
There are two calculators available on the web:
- Sample Size Calculator (Evan’s Awesome A/B Tools) (evanmiller.org)
- A/B Test Sample Size Calculator (optimizely.com)
You will need the baseline conversion rate and the minimum detectable effect, which is the minimum difference between the control and treatment group.
For detailed discussions on statistical power and sample size, you can also check these two notes:
According to the the lecture
notes from Stanford, we have this formula:
All-purpose power formula
Standard error (s.e.) of
Subsequently, this formula can be adapted to a case with a binary
metric:
A corresponding Python script is provided in a post by Nguyen Ngo:
1 | import scipy.stats as scs |
Refresher on A/B Testing
This part summarizes the article published in Harvard Business Review.
How Does A/B Testing Work?
A/B testing, at its most basic, is a way to compare two versions of something to figure out which performs better. "The A/B test can be considered the most basic kind of randomized controlled experiment," Fung says. "In its simplest form, there are two treatments and one acts as the control for the other." As with all randomized controlled experiments, you must estimate the sample size you need to achieve a statistical significance, which will help you make sure the result you're seeing "isn't just because of background noise," Fung says.
Lots of managers run sequential tests — e.g., testing size first (large versus small), then testing color (blue versus red), then testing typeface (Times versus Arial) — because they believe they shouldn’t vary two or more factors at the same time. But according to Fung, that view has been debunked by statisticians. And sequential tests are suboptimal because you're not measuring what happens when factors interact. For example, it may be that users prefer blue on average but prefer red when it’s combined with Arial. This kind of result is regularly missed in sequential A/B testing because the typeface test is run on blue buttons that have “won” the prior test.
Using mathematics you can "smartly pick and run only certain subsets of those treatments; then you can infer the rest from the data." This is called "multivariate" testing in the A/B testing world and often means you end up doing an A/B/C test or even an A/B/C/D test. In the example above with colors and size, it might mean showing different groups: a large red button, a small red button, a large blue button, and a small blue button.
What Mistakes Do People Make When Doing A/B Tests?
- First, he says, too many managers don't let the tests run their
course.
- The problem is that, because of randomization, it’s possible that if you let the test run to its natural end, you might get a different result.
- The second mistake is looking at too many metrics.
- The problem is that if you're looking at such a large number of metrics at the same time, you’re at risk of making what statisticians call "spurious correlations."
- In proper test design, you should decide on the metrics you’re going to look at before you execute an experiment and select a few. The more you're measuring, the more likely that you're going to see random fluctuations.
- Lastly, few companies do enough retesting.
- Even though there may be little chance that any given A/B result is driven by random chance, if you do lots of A/B tests, the chances that at least one of your results is wrong grows rapidly.
- The smaller the improvement, the less reliable the results.
It's clear that A/B testing is not a panacea. There are more complex kinds of experiments that are more efficient and will give you more reliable data. But A/B testing is a great way to gain a quick understanding of a question you have.
View / Make Comments