Hypothesis Testing vs. A/B Testing
2021-04-05

A/B testing is a popular term in Data Science. It is fairly self-explanatory to me at the first glance but then arises an question, what is the difference between the A/B testing people are using nowadays and the hypothesis testing I learnt in the statistics class.

In this post, I include some key points for understanding the hypothesis testing and the A/B testing, as well as some takeaways from other posts. In short, A/B testing is a version of a two-sample hypothesis testing for a randomized controlled experiment with two groups, control and treatment.

Hypothesis Testing

  • Specify the null () and alternative () Hypothesis
  • Calculate test statistics according to the observed sample
  • Reject or cannot reject the null hypothesis

In the hypothesis testing, we have some knowledge of a population and observe a representative sample from the population. By computing the test statistics, we verify if our knowledge of the population can be supported by the observed sample.

A post by Jérôme Spielmann gives a gentle introduction about the mathematics behind the statistical testing and the transition from the generic testing to the A/B testing.

In this post, the author has an example of the results in a A/B test.

Click on Link Did not Click
Design A 33 362
Design B 42 283

In the hypothesis testing, we focus on the results from one design and have

In the A/B testing, we compare the results from two designs and have

A/B Testing

  • Define a measure of interest (i.e., treatment) and a metric
  • Specify the control and treatment group (i.e., Group A and B)
  • Construct the null hypothesis that the treatment group is as same as the control group
  • Calculate test statistics according to the control and treatment group
  • Reject or cannot reject the null hypothesis

In the A/B testing, we have a hypothesis of a measure and we are interested in its impact. Particularly, we design a randomized experiment with two groups that are identical prior to the treatment. In the testing process, one group is treated, while the other is not.

Test Statistics

The A/B testing includes applications of statistical hypothesis testing or "two-sample hypothesis testing" as used in statistics. The test statistics vary with the metrics that are used for comparing the two groups.

  • Z-test is appropriate for comparing means under stringent conditions regarding normality and a known standard deviation.
  • Student's t-tests is appropriate for comparing means under relaxed conditions when less is assumed.
  • Welch's t test assumes the least and is therefore the most commonly used test in a two-sample hypothesis test where the mean of a metric is to be optimized.
  • Fisher's exact test is used in a comparison of two binomial distributions such as a click-through rate.
Assumed Distribution Example Case Standard Test Alternative Test
Gaussian Average revenue per user Welch's t-test
(Unpaired t-test)
Student's t-test
Binomial Click-through rate Fisher's exact test Barnard's test
Poisson Transactions
per paying user
E-test C-test
Multinomial Number of
each product purchased
Chi-squared test
Unknown Mann–Whitney U test Gibbs sampling

Reference Link: https://en.wikipedia.org/wiki/A/B_testing#Common_test_statistics

Test Significance Level and Statistical Power

Oftentimes, a problem will be given with a desired confidence level instead of the significance level (). A typical 95% confidence level for an A/B test corresponds to a significance level of 0.05.

An experiment in the A/B testing is typically set up at a minimum statistical power of 80 percent. In other words, we want the test to show it is at least 80 percent in the probability that the treatment truly works.

By construction, if we increase the sample size for both A and B group, we will make the distributions under and much narrower and hence increase the statistical power.

Test Sample Size

The key is to determine the minimum sample size for the experiment, because it is directly related to how quickly you can complete the test and deliver statistically significant results.

There are two calculators available on the web:

You will need the baseline conversion rate and the minimum detectable effect, which is the minimum difference between the control and treatment group.

For detailed discussions on statistical power and sample size, you can also check these two notes:

According to the the lecture notes from Stanford, we have this formula: where is the sample size of the smaller group, is the ratio of the larger group to the smaller group, is the standard deviation of the outcome variable, and represents the critical value at the desired power and significance level (two-sided testing), and is the size of the treatment effect (i.e., the difference in means).

All-purpose power formula

Standard error (s.e.) of

Subsequently, this formula can be adapted to a case with a binary metric: where . > For a test with two equal-size groups, and the size of one group is , given by Thus, the total sample size for two groups is .

A corresponding Python script is provided in a post by Nguyen Ngo:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import scipy.stats as scs

def min_sample_size(bcr, mde, power=0.8, sig_level=0.05):
"""Returns the minimum sample size to set up a split test

Arguments:
bcr (float): probability of success for control, sometimes
referred to as baseline conversion rate

mde (float): minimum change in measurement between control
group and test group if alternative hypothesis is true, sometimes
referred to as minimum detectable effect

power (float): probability of rejecting the null hypothesis when the
null hypothesis is false, typically 0.8

sig_level (float): significance level often denoted as alpha,
typically 0.05

Returns:
min_N: minimum sample size (float)

References:
Stanford lecture on sample sizes
http://statweb.stanford.edu/~susan/courses/s141/hopower.pdf
"""
# standard normal distribution to determine z-values
standard_norm = scs.norm(0, 1)

# find Z_beta from desired power
Z_beta = standard_norm.ppf(power)

# find Z_alpha
Z_alpha = standard_norm.ppf(1-sig_level/2)

# average of probabilities from both groups
pooled_prob = (bcr + bcr+mde) / 2

min_N = (2 * pooled_prob * (1 - pooled_prob) * (Z_beta + Z_alpha)**2
/ mde**2)

return min_N

Refresher on A/B Testing

This part summarizes the article published in Harvard Business Review.

How Does A/B Testing Work?

A/B testing, at its most basic, is a way to compare two versions of something to figure out which performs better. "The A/B test can be considered the most basic kind of randomized controlled experiment," Fung says. "In its simplest form, there are two treatments and one acts as the control for the other." As with all randomized controlled experiments, you must estimate the sample size you need to achieve a statistical significance, which will help you make sure the result you're seeing "isn't just because of background noise," Fung says.

Lots of managers run sequential tests — e.g., testing size first (large versus small), then testing color (blue versus red), then testing typeface (Times versus Arial) — because they believe they shouldn’t vary two or more factors at the same time. But according to Fung, that view has been debunked by statisticians. And sequential tests are suboptimal because you're not measuring what happens when factors interact. For example, it may be that users prefer blue on average but prefer red when it’s combined with Arial. This kind of result is regularly missed in sequential A/B testing because the typeface test is run on blue buttons that have “won” the prior test.

Using mathematics you can "smartly pick and run only certain subsets of those treatments; then you can infer the rest from the data." This is called "multivariate" testing in the A/B testing world and often means you end up doing an A/B/C test or even an A/B/C/D test. In the example above with colors and size, it might mean showing different groups: a large red button, a small red button, a large blue button, and a small blue button.

What Mistakes Do People Make When Doing A/B Tests?

  • First, he says, too many managers don't let the tests run their course.
    • The problem is that, because of randomization, it’s possible that if you let the test run to its natural end, you might get a different result.
  • The second mistake is looking at too many metrics.
    • The problem is that if you're looking at such a large number of metrics at the same time, you’re at risk of making what statisticians call "spurious correlations."
    • In proper test design, you should decide on the metrics you’re going to look at before you execute an experiment and select a few. The more you're measuring, the more likely that you're going to see random fluctuations.
  • Lastly, few companies do enough retesting.
    • Even though there may be little chance that any given A/B result is driven by random chance, if you do lots of A/B tests, the chances that at least one of your results is wrong grows rapidly.
    • The smaller the improvement, the less reliable the results.

It's clear that A/B testing is not a panacea. There are more complex kinds of experiments that are more efficient and will give you more reliable data. But A/B testing is a great way to gain a quick understanding of a question you have.