Short Notes on A/B Testing
2021-09-16

Short Notes on A/B Testing

Motivation

  • Understand what drives your business and provide insights for business decisions
  • Understand causal relationship

Prerequisites

  • The control and testing groups can be clearly defined
  • Metrics of interest can be quantified
  • Data can be collected in a timely manner

Five Stages in Practice

  • Define a study question
  • Make hypotheses (variations) and identify metrics
  • Identify variables
    • Determine the data to be collected
  • Run experiments and collect data
    • Determine a detectable difference (i.e., how small of a difference you would like to detect, for example, 10% increase in your metric of interest)
    • Calculate the proper sample size using power analysis
    • Determine what fraction of traffic can be used in the treatment
    • Conduct a prior A/A test to check unfavorable impacts on business and a simultaneous A/A test to track seasonality and systematic biases/trend if any
  • Measure results

Some Details in A/B Testing

Common Web Analytics Metrics

Count Conversion Time Business
Page View
Visits / Return Visits
Click
Visitor / Unique Visitor
(Daily / Monthly) Active Users
Click Thru Rate
Click Thru Probability
User Click Probability
Bounce Rate
Active Time
Page View Duration
Revenue
Member
Order

In addition, we can also consider a composite metric.

Examples of Changes for Testing

  • Page Contents
    • Headlines, Sub Headlines, Font Size
    • Background Image, Background Color
    • Paragraph Text
    • Page Layout
  • Call-to-Action
    • Button Place, Button Color, Button Size
    • Text

Notes: Only one thing can be changed in a pair of the control and treatment group.

Experiment Settings

  • Target Audience

    • Country, Region
    • Demographics
  • Sample Size

  • Experiment Period (Time)

  • Percentage of Traffic for A/B Testing

  • Split for Control and Treatment

    Notes: Users visiting the page at different time or using different devices might see different features in the test. These are users in the mixed group, neither in A or B. To solve this problem, we might need to evenly split users in the control and treatment group. Theoretically, the percentages of the mixed group users in A and B should be similar.

  • A/A Test

    • Run a small A/A test in a short time prior to the A/B test to check the change in metrics of interest and whether there are any unfavorable impacts on business

    • Run an A/A test simultaneously to track the systematic trend during the A/B test period

Power Analysis

  • False Positive (Type I Error): Falsely reject the null hypothesis
    • False positive rate (, e.g., 5%) is the significant level of a statistic test
  • False Negative (Type II Error): Fail to reject (i.e., we should reject but we did not)
    • False negative rate (, e.g., 20%) is used in calculating the power of a test, i.e.,

Discuss the values of and with business partners.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# Here is a function in R
# https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/power.prop.test
# power.prop.test(n = NULL, p1 = NULL, p2 = NULL,
# power = NULL, sig.level = 0.05,
# alternative = c("two.sided", "one.sided"),
# strict = FALSE, tol = .Machine$double.eps^0.25)
# Examples
power.prop.test(p1 = 0.5, p2 = 0.75, power = 0.90) ## => n = 76.7 in each group
power.prop.test(p1 = 0.5, p2 = 0.501,
power = 0.90, sig.level=.001) ## => n = 10451937 in each group
power.prop.test(n = 50, p1 = 0.5, p2 = 0.75) ## => power = 0.740
power.prop.test(n = 50, p1 = 0.5, power = 0.90) ## => p2 = 0.8026
power.prop.test(n = 50, p1 = 0.5, p2 = 0.9,
power = 0.90, sig.level=NULL) ## => sig.l = 0.00131

Result Evaluation

Group Control - A Variation - B
Unique Visitor 500 500
Unique Click 50 60
Conversion Rate 10% 12%
  • Calculate 95% confidence interval ()
    • Cannot reject
  • Suppose and become 10 times larger
    • Reject

Some Challenges in A/B Testing

Tradeoff between and

  • By definition, is the false positive rate, representing the chance that we falsely reject . In contrast, is the false negative rate, representing the chance that we should reject but didn't.

  • Since resources and time are limited, we need put the effect on the project that improves the business most significantly and has the largest favorable business impact.

  • As a result, we might emphasize at expense of . Also, remember to reach an agreement with business before the test.

Insignificant Treatment Effect

  • It is worth noting that the difference between control and treatment is insignificant in statistics. The testing feature can be helpful in the long run.
  • Generate a line plot to visualize the difference and check if one is above the other in most of the time, though the difference could be statistically insignificant. Such a line plot can also provide some insights.

Multi-armed Bandit Approach

We want to achieve two goals at a time: (1) find the best variant in a longer time of experiment and (2) maximize the revenue during the experiment period as well.

  • Solution: Adjust fraction of (new) users in treatment/control according to which group seems to be doing better.