Hypothesis Testing vs. A/B Testing

A/B testing is a popular term in Data Science. It is fairly self-explanatory to me at the first glance but then arises an question, what is the difference between the A/B testing people are using nowadays and the hypothesis testing I learnt in the statistics class.

In this post, I include some key points for understanding the hypothesis testing and the A/B testing, as well as some takeaways from other posts. In short, A/B testing is a version of a two-sample hypothesis testing for a randomized controlled experiment with two groups, control and treatment.

Hypothesis Testing

Specify the null () and alternative () Hypothesis
Calculate test statistics according to the observed sample
Reject or cannot reject the null hypothesis

In the hypothesis testing, we have some knowledge of a population and observe a representative sample from the population. By computing the test statistics, we verify if our knowledge of the population can be supported by the observed sample.

A post by Jérôme Spielmann gives a gentle introduction about the mathematics behind the statistical testing and the transition from the generic testing to the A/B testing.

In this post, the author has an example of the results in a A/B test.

Click on Link Did not Click

Design A 33 362

Design B 42 283

In the hypothesis testing, we focus on the results from one design and have

In the A/B testing, we compare the results from two designs and have

	Click on Link	Did not Click
Design A	33	362
Design B	42	283

A/B Testing

Define a measure of interest (i.e., treatment) and a metric
Specify the control and treatment group (i.e., Group A and B)
Construct the null hypothesis that the treatment group is as same as the control group
Calculate test statistics according to the control and treatment group
Reject or cannot reject the null hypothesis

In the A/B testing, we have a hypothesis of a measure and we are interested in its impact. Particularly, we design a randomized experiment with two groups that are identical prior to the treatment. In the testing process, one group is treated, while the other is not.

Test Statistics

The A/B testing includes applications of statistical hypothesis testing or "two-sample hypothesis testing" as used in statistics. The test statistics vary with the metrics that are used for comparing the two groups.

Z-test is appropriate for comparing means under stringent conditions regarding normality and a known standard deviation.
Student's t-tests is appropriate for comparing means under relaxed conditions when less is assumed.
Welch's t test assumes the least and is therefore the most commonly used test in a two-sample hypothesis test where the mean of a metric is to be optimized.
Fisher's exact test is used in a comparison of two binomial distributions such as a click-through rate.

Assumed Distribution	Example Case	Standard Test	Alternative Test
Gaussian	Average revenue per user	Welch's t-test (Unpaired t-test)	Student's t-test
Binomial	Click-through rate	Fisher's exact test	Barnard's test
Poisson	Transactions per paying user	E-test	C-test
Multinomial	Number of each product purchased	Chi-squared test
Unknown		Mann–Whitney U test	Gibbs sampling

Reference Link: https://en.wikipedia.org/wiki/A/B_testing#Common_test_statistics

Test Significance Level and Statistical Power

Oftentimes, a problem will be given with a desired confidence level instead of the significance level (). A typical 95% confidence level for an A/B test corresponds to a significance level of 0.05.

An experiment in the A/B testing is typically set up at a minimum statistical power of 80 percent. In other words, we want the test to show it is at least 80 percent in the probability that the treatment truly works.

By construction, if we increase the sample size for both A and B group, we will make the distributions under and much narrower and hence increase the statistical power.

Test Sample Size

The key is to determine the minimum sample size for the experiment, because it is directly related to how quickly you can complete the test and deliver statistically significant results.

There are two calculators available on the web:

Sample Size Calculator (Evan’s Awesome A/B Tools) (evanmiller.org)

A/B Test Sample Size Calculator (optimizely.com)

You will need the baseline conversion rate and the minimum detectable effect, which is the minimum difference between the control and treatment group.

For detailed discussions on statistical power and sample size, you can also check these two notes:

Power and Sample Size Determination (Boston Univeristy School of Public Health)

Estimating Power and Sample Size (Stanford-Surgery Policy Improvement Research and Education)

According to the the lecture notes from Stanford, we have this formula: where is the sample size of the smaller group, is the ratio of the larger group to the smaller group, is the standard deviation of the outcome variable, and represents the critical value at the desired power and significance level (two-sided testing), and is the size of the treatment effect (i.e., the difference in means).

All-purpose power formula

Standard error (s.e.) of

Subsequently, this formula can be adapted to a case with a binary metric: where . > For a test with two equal-size groups, and the size of one group is , given by Thus, the total sample size for two groups is .

A corresponding Python script is provided in a post by Nguyen Ngo:

import scipy.stats as scs

def min_sample_size(bcr, mde, power=0.8, sig_level=0.05):
    """Returns the minimum sample size to set up a split test

    Arguments:
        bcr (float): probability of success for control, sometimes
        referred to as baseline conversion rate

        mde (float): minimum change in measurement between control
        group and test group if alternative hypothesis is true, sometimes
        referred to as minimum detectable effect

        power (float): probability of rejecting the null hypothesis when the
        null hypothesis is false, typically 0.8

        sig_level (float): significance level often denoted as alpha,
        typically 0.05

    Returns:
        min_N: minimum sample size (float)

    References:
        Stanford lecture on sample sizes
        http://statweb.stanford.edu/~susan/courses/s141/hopower.pdf
    """
    # standard normal distribution to determine z-values
    standard_norm = scs.norm(0, 1)

    # find Z_beta from desired power
    Z_beta = standard_norm.ppf(power)

    # find Z_alpha
    Z_alpha = standard_norm.ppf(1-sig_level/2)

    # average of probabilities from both groups
    pooled_prob = (bcr + bcr+mde) / 2

    min_N = (2 * pooled_prob * (1 - pooled_prob) * (Z_beta + Z_alpha)**2
             / mde**2)

    return min_N

Refresher on A/B Testing

This part summarizes the article published in Harvard Business Review.

How Does A/B Testing Work?

A/B testing, at its most basic, is a way to compare two versions of something to figure out which performs better. "The A/B test can be considered the most basic kind of randomized controlled experiment," Fung says. "In its simplest form, there are two treatments and one acts as the control for the other." As with all randomized controlled experiments, you must estimate the sample size you need to achieve a statistical significance, which will help you make sure the result you're seeing "isn't just because of background noise," Fung says.

Lots of managers run sequential tests — e.g., testing size first (large versus small), then testing color (blue versus red), then testing typeface (Times versus Arial) — because they believe they shouldn’t vary two or more factors at the same time. But according to Fung, that view has been debunked by statisticians. And sequential tests are suboptimal because you're not measuring what happens when factors interact. For example, it may be that users prefer blue on average but prefer red when it’s combined with Arial. This kind of result is regularly missed in sequential A/B testing because the typeface test is run on blue buttons that have “won” the prior test.

Using mathematics you can "smartly pick and run only certain subsets of those treatments; then you can infer the rest from the data." This is called "multivariate" testing in the A/B testing world and often means you end up doing an A/B/C test or even an A/B/C/D test. In the example above with colors and size, it might mean showing different groups: a large red button, a small red button, a large blue button, and a small blue button.

What Mistakes Do People Make When Doing A/B Tests?

First, he says, too many managers don't let the tests run their course.
- The problem is that, because of randomization, it’s possible that if you let the test run to its natural end, you might get a different result.
The second mistake is looking at too many metrics.
- The problem is that if you're looking at such a large number of metrics at the same time, you’re at risk of making what statisticians call "spurious correlations."
- In proper test design, you should decide on the metrics you’re going to look at before you execute an experiment and select a few. The more you're measuring, the more likely that you're going to see random fluctuations.
Lastly, few companies do enough retesting.
- Even though there may be little chance that any given A/B result is driven by random chance, if you do lots of A/B tests, the chances that at least one of your results is wrong grows rapidly.
- The smaller the improvement, the less reliable the results.

It's clear that A/B testing is not a panacea. There are more complex kinds of experiments that are more efficient and will give you more reliable data. But A/B testing is a great way to gain a quick understanding of a question you have.