Randomized Experiments Playbook¶
%load_ext autoreload
import sys, os
import logging
import datetime as dt
from datetime import datetime
import numpy as np
import scipy.stats as st
import plotly.graph_objects as go
import plotly.express as px
import plotly.figure_factory as ff
from plotly.subplots import make_subplots
from scipy import stats
from numpy.typing import ArrayLike
from scipy.integrate import quad
from scipy.stats import norm, binom
s_logger = logging.getLogger('py4j.java_gateway')
s_logger.setLevel(logging.ERROR)
c_logger = logging.getLogger('py4j.clientserver')
c_logger.setLevel(logging.ERROR)
logging.basicConfig(
level=logging.INFO, format="%(asctime)s - %(name)s - %(levelname)s - %(message)s"
)
logger = logging.getLogger(__name__)
Introduction¶
This notebook is a practical, code-first primer on hypothesis testing and experiment design for product and engineering experiments. It covers classical fixed-horizon inference and sample-size planning: confidence intervals, one-sample vs. two-sample tests, and paired (matched) designs for both continuous outcomes (e.g., latency, time-on-page) and dichotomous outcomes (e.g., conversion). It then introduces sequential testing concepts (e.g., always-valid ideas / mSPRT-style thinking) and briefly discusses multi-armed bandits as an alternative decision-making framework. The goal throughout is to translate statistical inputs—significance level $\alpha$, power $1-\beta$, variability ($\sigma$ or $p$), and either a margin-of-error target (for estimation) or a minimum detectable effect (for hypothesis testing)—into a concrete experiment plan: required sample size per arm (and total), with optional adjustment for expected attrition/unusable data. The Python helpers compute critical values using exact Normal quantiles (via SciPy), so results can differ slightly from hand calculations that use rounded Z values.
Keywords: A/B testing, experiment design, hypothesis testing, sample size, sequential testing (mSPRT), multi-armed bandits
Fixed Horizon Testing¶
Test Statistics¶
Consider a random variable from arbitrary distribution with standard deviation $\sigma$. If the population mean is $\mu$, the means $\tilde {X}^{(n)}$ of samples of size $n$ are random variables that are normally distrbuted as $\tilde {X}^{(n)} \sim \mathcal N(\mu, \frac{\sigma^2}{n})$. This is called the sampling distribution. The scaled variable $T = \frac{\tilde {X}^{(n)} - \mu}{\frac {\sigma}{\sqrt n}}$ has a student t distribution, which for large $n$ is denoted by $Z = \frac{\tilde {X}^{(n)} - \mu}{\frac {\sigma}{\sqrt n}}$ and has a standard normal distribution $\mathcal N(0, 1)$. This is called the Z statistic. From this, we can see that the random variable $\tilde X$ can be written as $\tilde {X}^{(n)} = \mu + Z \frac{\sigma}{\sqrt n}$. Now for a sample $X^{(n)} = \lbrace x_1, x_2, ..., x_n \rbrace$ with the the average $\bar X $, we can estimate $\mu$ by $\bar X$ and rewrite the equation as $\tilde {X}^{(n)} = \bar X + Z \frac{\sigma}{\sqrt n}$. In this case a confidence interval is defined as $\bar X \pm Z\frac{\sigma}{\sqrt n}$. Note that due to the symmetry, instead of the actuall algebraic value of Z at the left side of the interval (negative), we use the positive value of the right side and use $\pm$ sign.
Now when we have two independent samples of sizes $n_1$ and $n_2$ from two populations, the difference between the sample means $\tilde {X} _1^{(n_1)} - \tilde {X} _{2}^{(n_2)}$ is normally distributed $\tilde {X} _{1}^{(n_1)} - \tilde {X} _{2}^{(n_2)} \sim N (\mu_1 - \mu_2, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})$. In this case the statistic $Z = \frac{(\tilde {X} _{1}^{(n_1)} - \tilde {X} _{2}^{(n_2)}) - (\mu_1 - \mu_2)}{\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}}$ has a standard normal distribution $\mathcal N(0, 1)$.
In a similalr manner as above, for two samples $X_1^ {(n_1)} = \lbrace x_{11}, x_{12}, ..., x_{1n_1} \rbrace$ with the the average $\bar X_1 $ and $X_2^ {(n_2)} = \lbrace x_{21}, x_{22}, ..., x_{2n_2} \rbrace$ with the the average $\bar X_2 $ we can estimate $\mu_1 - \mu_2$ by $\bar X_1 - \bar X_2$ and rewrite the equation as $\tilde {X} _{1}^{(n_1)} - \tilde {X} _{2}^{(n_2)} = \bar X_1 - \bar X_2 + Z\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$. The confidence interval in this case is $\bar X_1 - \bar X_2 \pm Z\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$.
Classical Fixed-Horizon Tests¶
For reference and further reading, see [Sullivan, 2017].
# Run this cell to start collecting the examples that are given in the text and run them later.
# --- Examples (incremental build) ---
fixed_horizon_examples = {}
Estimation of Mean¶
In what follows in this section, we are interested in estimating the mean of a population or the differences between the means of two populations.
If $X_1,\dots,X_n$ is a random sample from a population with mean $\mu$ and standard deviation $\sigma$, then (using a Normal approximation) a confidence interval for $\mu$ has the form $\bar X \pm Z\,\frac{\sigma}{\sqrt{n}}$, where (Z) is the standard-normal critical value for the chosen confidence level (e.g., $Z \approx 1.96$ for 95%). In many practical settings $\sigma$ is unknown and is replaced by a sample-based estimate, but the dependence on $n$ is the same: larger samples produce tighter intervals.
The half-width of the interval, $E = Z\,\frac{\sigma}{\sqrt{n}}$, is called the margin of error. When designing a study, we often choose a target margin of error $E$ and solve for the required sample size. For example, suppose we want to estimate the mean page load time and would like the 95% confidence interval to be within $\pm 50$ ms. If prior monitoring suggests the standard deviation is about $\sigma = 200$ ms, then we choose $n$ so that $E \le 50$ ms.
Solving the margin-of-error expression for $n$ gives: $E = Z\frac{\sigma}{\sqrt{n}} \quad\Longrightarrow\quad n = \Big(\frac{Z\sigma}{E}\Big)^2$.
The above formula applies to one sample tests where we want to choose a sample size to control the margin of error for a given test.
One Sample¶
In these studies we want to estimate the mean of an outcome variable in a single population. The required sample size is $n = \Big(\frac{Z\sigma}{E}\Big)^2$. We round up because $n$ must be an integer.
Continuous outcome: The required sample size is $n = \Big(\frac{Z\sigma}{E}\Big)^2$. For example 1, suppose we’re estimating the population mean time-on-page (seconds) and we want a 95% confidence interval ($Z=1.96$) whose margin of error is at most 5 seconds. From prior cohorts we believe the standard deviation of time-on-page is in the range 12–18 seconds. To be conservative, we plan using the larger value σ=18. The required sample size is $n = \Big(\frac{1.96(18)}{5}\Big)^2 \approx 50$
Dichotomous outcome: The required sample size is $n = p(1 - p)\Big(\frac{Z}{E}\Big)^2$. For example 2, suppose we want to estimate the proportion of users who enable a new privacy setting (a binary outcome). How many users should we sample to get a 95% confidence interval whose margin of error is at most 4 percentage points (E=0.04) of the true proportion? For a known proportion $p$ in the population, the standard deviation is $\sqrt {p(1-p)}$. If the true rate (p) is unknown, a conservative choice is $p = 0.5$, which maximizes the standard deviation. The required sample size is $n = p(1-p) \Big(\frac{Z}{E} \Big)^2 = 0.5(1-0.5) \Big(\frac{1.96}{0.04} \Big)^2 \approx 601$
# Example 1: one-sample estimation, continuous
fixed_horizon_examples["example 1"] = {
"type": "one sample estimation of mean (continuous)",
"mode": "ME",
"outcome": "continuous",
"one_sample": True,
"two_sample": False,
"sides": 2,
"alpha": 0.05,
"me": 5,
"sigma": 18,
"expected_n": 50,
}
# Example 2: one-sample estimation, dichotomous
fixed_horizon_examples["example 2"] = {
"type": "one sample estimation of proportion (dichotomous)",
"mode": "ME",
"outcome": "dichotomous",
"one_sample": True,
"two_sample": False,
"sides": 2,
"alpha": 0.05,
"me": 0.04,
"p": 0.5, # conservative
"expected_n": 601,
}
Two Independent Samples¶
In these studies we want to estimate the difference between the means of two independent populations. Assuming equal variance, The required sample size is $n = 2\Big(\frac{Z\sigma}{E}\Big)^2$, where $\sigma$ is the standard deviation of the desired outcome, which can be estimated by the pooled standard deviation of the two samples:
$n = 2\Big(\frac{ZS_p}{E}\Big)^2$ where unbiased $S_p = \sqrt{\frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_1 + n_2-2}}$ or for large $n_1, n_2$ $S_p = \sqrt{\frac{n_1S_1^2 + n_2S_2^2}{n_1 + n_2}}$. This means that instead of regarding two different standard deviations, we assume the two populations have the same standard deviation defined by the pooled standard deviation.
Because $n_1, n_2 > 30$, the confidence interval for the sampling distribution has the form $\bar X_1 - \bar X_2 \pm Z\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$, which for equal sample sizes and common variance reduces to $\bar X_1 - \bar X_2 \pm Z \sigma \sqrt{\frac{2}{n}}$.
Refresher: For two independent random variables, if $Var(X_1) = \frac{\sigma_1^2}{n_1}$ and $Var(X_2) = \frac{\sigma_2^2}{n_2}$ then $Var(X_1 - X_2) = \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}$
Continuous outcome with known common variance: For example 3, suppose we want to evaluate the impact of a new onboarding flow on time-on-page (seconds) using a two-arm randomized experiment (control vs treatment). We want a 95% confidence interval ($Z=1.96$) for the difference in means whose half-width (margin of error) is at most 4 seconds. From past reports, we estimate a common standard deviation of 24 seconds for time-on-page for any onboarding flow. We also expect about 10% of sampled users to be unusable for analysis (e.g., missing telemetry / early exits). With equal allocation, the per-arm sample size is $2\Big(\frac{1.96(24)}{4}\Big)^2 \approx 277$ which after adjusting for 10% dropout, the per arm size is $277/0.9 \approx 308$. So the total sample size is $2 \times 308 = 616$.
Continuous outcome with pooled variance: For example 4, suppose we want to compare two onboarding experiences (Flow A vs Flow B) for new users and estimate the difference in mean for 7-day activation time (hours) between the two groups with 95% confidence. Assume we expect about a 15% unusable-data/dropout rate (users who churn before day 7, missing telemetry, etc.), and we want the margin of error for the difference in means to be at most $E=3$ hours. Suppose we find two prior studies on a related population of legacy users that we can use to approximate the variability. The first study reports a standard deviation of 12 hours for 160 users for a process that is very similar to Flow A, and the second study reports a standard deviation of 13.5 hours for 120 users for a process that is very similar to flow B. Using these values, we can estimate the pooled standard deviation $S_p = \sqrt{\frac{(160 - 1)12^2 + (120 - 1)13.5^2}{160 + 120-2}} \approx 12.7$. The sample size per arm is $2\Big(\frac{1.96(12.7)}{3}\Big)^2 \approx 138$. Correcting for 15% attrition, we get $\frac{138}{0.85} \approx 163$, hence the total sample size is $2 \times 163 = 326$.
Continuous outcome with matched samples: The required sample size is $n = \Big(\frac{Z\sigma_d}{E}\Big)^2$. Notice the difference: $\sigma_d$ is the standard deviation of the differences of the scores. In the previous cases, we had independent samples $X = \lbrace x_1, x_2, ..., x_n\rbrace$ and $Y = \lbrace y_1, y_2, ..., y_m\rbrace$. The indices in each set represent different members. The $\sigma_x$ and $\sigma_y$ are the standard deviations of the populations from which X and Y were drawn, which can be called $\sigma$ if the variances are equal. In the matched case, we are referring to the sample $Z = \lbrace x_1 - y_1, x_2 - y_2, ..., x_n - y_n \rbrace$ where indices represent the same member (x and y are pre and post treatment) and $\sigma_d$ is the standard deviation of the paired differences. For example 5, suppose we want to measure the effect of a performance optimization on the same users by comparing their page load time (ms) before vs after the change. We sample (n) users, record each user's average load time during a baseline week $x_i$, then record the same user's average load time during the week after rollout $y_i$. This is a matched (paired) design because $x_i$ and $y_i$ come from the same user. Define the paired differences $d_i = x_i - y_i$, where a positive $d_i$ means the page got faster after the change. We want a 95% confidence interval for the mean improvement $\mu_d$ with margin of error at most $E=10$ ms. From a pilot study, we estimate the standard deviation of the paired differences to be $\sigma_d = 45$ ms. The required sample size is $n = \Big(\frac{Z\sigma_d}{E}\Big)^2 = \Big(\frac{1.96\cdot 45}{10}\Big)^2 \approx 78 $. So we should measure about 78 users (paired before/after) to achieve the desired precision.
Dichotomous outcome: You can use the same equation $n = 2\Big(\frac{ZS_p}{E}\Big)^2$ for the required sample size by using $S_1^2 = p_1(1 - p_1)$ and $S_2^2 = p_2(1 - p_2)$ in the pooled standard deviation. For equal sample sizes $n_1 = n_2$, it simplifies to $\Big[p_1(1 - p_1) + p_2(1 - p_2) \Big] \Big(\frac{Z}{E} \Big)^2$ . For common prevalences, it further simplifies to $\Big[2p(1 - p) \Big] \Big(\frac{Z}{E} \Big)^2$, where $p$ is the common or average prevalence. If the prevalences are unknown, we can use p = 0.5 for the most conservative estimate of the sample size. For example 6, suppose you want to estimate the effect of a new calibration procedure on the defect rate (defective vs. not defective) for two production lines A (current process) and B (new calibration process). You plan on doing an experiment with equal sample sizes from each line. Based on engineering judgment, you expect the defect prevalence to be around $p \approx 0.08$ in this product family. You want 95% confidence and a margin of error $E=0.03$ for the estimated difference in defect rates. Using the common-prevalence formula, the required sample size is $2(0.08)(0.92)\left(\frac{1.96}{0.03}\right)^2 \approx 629$ per line, or 1258 total. If you truly don’t know the defect prevalence, using $p=0.5$ gives the most conservative (largest) sample size $2(0.5)(0.5)\left(\frac{1.96}{0.03}\right)^2 \approx 2135$ per line or 4270 total.
# Example 3: two-sample estimation, continuous (common sigma), with dropout adjustment
fixed_horizon_examples["example 3"] = {
"type": "two sample estimation of mean difference (continuous, common sigma)",
"mode": "ME",
"outcome": "continuous",
"one_sample": False,
"two_sample": True,
"sides": 2,
"alpha": 0.05,
"me": 4,
"sigma": 24,
"dropout": 0.10, # unusable fraction
"expected_n_raw_per_group": 277,
"expected_n_adj_per_group": 308,
"expected_total_adj": 616,
}
# Example 4: two-sample estimation, continuous (pooled sigma), with dropout adjustment
fixed_horizon_examples["example 4"] = {
"type": "two sample estimation of mean difference (continuous, pooled sigma)",
"mode": "ME",
"outcome": "continuous",
"one_sample": False,
"two_sample": True,
"sides": 2,
"alpha": 0.05,
"me": 3,
"sigma": 12.7, # pooled S_p (computed outside)
"dropout": 0.15,
"expected_n_raw_per_group": 138,
"expected_n_adj_per_group": 163,
"expected_total_adj": 326,
}
# Example 5: matched/paired estimation, continuous (sigma_d)
fixed_horizon_examples["example 5"] = {
"type": "matched samples estimation of mean difference (continuous)",
"mode": "ME",
"outcome": "continuous",
"one_sample": True, # effectively one set of paired diffs
"two_sample": False,
"sides": 2,
"alpha": 0.05,
"me": 10,
"sigma": 45, # sigma_d
"expected_n": 78,
}
# Example 6: two-sample estimation, dichotomous (common prevalence p)
fixed_horizon_examples["example 6"] = {
"type": "two sample estimation of proportion difference (dichotomous, common p)",
"mode": "ME",
"outcome": "dichotomous",
"one_sample": False,
"two_sample": True,
"sides": 2,
"alpha": 0.05,
"me": 0.03,
"p": 0.08, # common/avg prevalence
"expected_n_per_group": 629,
"expected_total": 1258,
}
Hypothesis Testing¶
So far our calculated sample sizes controls type I error by setting the significance level $\alpha$ = $P$(Type I error) = $P$(Rejecting ($H_0 | H_0$ is true). But we should also be concerned about type II error, where $\beta$ = $P$(Type II error) = $P$(Not rejecting $H_0 | H_0$ is false). The quantity $1 - \beta$ = $P$(Rejecting $H_0 | H_0$ is false) is called power. In this section we first present the formulas and examples and then derive the equations.
One sample¶
The goal is to compare the mean of an outcome variable in a single population to a known mean, so, the hypotheses of interest are:
$H_0: \mu = \mu_0$ and $H_1: \mu \neq \mu_0$, where $\mu_0$ is the known mean (e.g., a historical control). In practice we often consider $H_1: \mu = \mu_1$ as the alternative based on a minimum detectable efect $\delta = \big \vert \mu_1 - \mu_0 \big \vert$. The required sample size is $n = \Big(\frac{Z_{1 - \alpha / 2} + Z_{1 - \beta}}{ES} \Big)^2$ where $ES = \frac{\big \vert \mu_1 - \mu_0 \big \vert}{\sigma}$ for continuous variables and $ES = \frac{\big \vert p_1 - p_0 \big \vert}{\sqrt{p_0(1 - p_0)}}$ for dichotomous variables. This experiment setup is commonly known as a two sided test. There is another setup which is called one sided test, for which $H_0: \mu \ge \mu_0 \;\text{vs}\; H_1: \mu < \mu_0$ for left-tailed (testing for a decrease), and $H_0: \mu \le \mu_0 \;\text{vs}\; H_1: \mu > \mu_0$ for right-tailed (testing for an increase). For one sided test, the effect sizes are the same but the required sample size is $n = \Big(\frac{Z_{1 - \alpha} + Z_{1 - \beta}}{ES} \Big)^2$.
- Continuous outcome: For example 7(a), we want to test whether installation of a new workstation in the plant changed (for better or worse) the mean assembly time with 95% confidence ($Z_{1-\alpha/2}=1.96$) and 80% power ($Z_{1-\beta}=0.84$). The baseline mean assembly time is 120 seconds and we consider a 12 second change of assembly time to be significant. If the standard deviation of the assembly time for this plant is known to be 30 seconds, the effect size is $ES = \frac{\big \vert 120 - 108 \big \vert}{30} = 0.4$ and the required sample size is $n = \Big(\frac{1.96 + 0.84}{0.4} \Big)^2 \approx 49$. Now in example 7(b), if instead of detecting a change, we were interested to see if the new installation has meaningfully reduced the assembly time by 12 seconds, we would use the one sided formula with ($Z_{1-\alpha} = 1.645$): $n = \Big(\frac{1.645 + 0.84}{0.4} \Big)^2 \approx 39$.
- Dichotomous outcome: For example 8, an online retailer launches a new checkout page and wants to test whether its conversion rate exceeds a known historical benchmark. Assume the historical benchmark conversion rate is $p_0 = 0.03$ and the conversion rate worth detecting is $p_1 = 0.033$, hence a difference bigger than $\Delta = p_1 - p_0 = 0.003$ is significant. For 95% confidence and 80% power, the one-sided $ Z_{1-\alpha} = 1.645$ and $Z_{1-\beta} = 0.84$. We calculate the effect size as $ES = \frac{\big \vert 0.003 \big \vert}{\sqrt{0.03(1 - 0.03)}} \approx 0.0176$. The required sample size is $\Big(\frac{1.645+0.84}{0.0176}\Big)^2 \approx 19936$.
# Example 7a: one-sample hypothesis test (two-sided), continuous
fixed_horizon_examples["example 7a"] = {
"type": "one sample hypothesis test (continuous, two-sided)",
"mode": "MDE",
"outcome": "continuous",
"one_sample": True,
"two_sample": False,
"sides": 2,
"alpha": 0.05,
"beta": 0.20,
"mde": 12,
"sigma": 30,
"expected_n": 49,
}
# Example 7b: one-sample hypothesis test (one-sided), continuous
fixed_horizon_examples["example 7b"] = {
"type": "one sample hypothesis test (continuous, one-sided)",
"mode": "MDE",
"outcome": "continuous",
"one_sample": True,
"two_sample": False,
"sides": 1,
"alpha": 0.05,
"beta": 0.20,
"mde": 12,
"sigma": 30,
"expected_n": 39,
}
# Example 8: one-sample hypothesis test (one-sided), dichotomous vs benchmark p0
fixed_horizon_examples["example 8"] = {
"type": "one sample hypothesis test (dichotomous, one-sided vs benchmark)",
"mode": "MDE",
"outcome": "dichotomous",
"one_sample": True,
"two_sample": False,
"sides": 1,
"alpha": 0.05,
"beta": 0.20,
"p0": 0.03, # benchmark under H0 (used for sigma)
"mde": 0.003,
"expected_n": 19936, # note: tiny rounding differences are possible
}
Two Independent Samples¶
The goal is to perform a test of hypothesis for comparing the means of an outcome variable in two independent populations, where the hypotheses of interest are $H_0: \mu_1 = \mu_2$ and $H_1: \mu_1 \neq \mu_2$, where $\mu_1$ and $\mu_2$ are the means in the two comparison populations. The minimum detectable effect in this case $\delta = \big \vert \mu_1 - \mu_2 \big \vert$. The is the experimentation setup that is commonly known as A/B testing in the industry. As before, we can also perform one sided tests in this case: $H_0: \mu_2 \ge \mu_1 \;\text{vs}\; H_1: \mu_2 < \mu_1$ for left-tailed (testing for a decrease), and $H_0: \mu_2 \le \mu_1 \;\text{vs}\; H_1: \mu_2 > \mu_1$ for right-tailed (testing for an increase). One-sided A/B tests are less common because teams usually want to detect any meaningful change, including the possibility that the variant makes things worse, and report results in a direction-agnostic way. They’re also avoided because opting for a one-sided test can look like “fishing” for significance, so two-sided tests are the more transparent default. So the following examples only consider the two sided tests.
Continuous outcome: The required sample size is $n = 2\Big(\frac{Z_{1 - \alpha / 2} + Z_{1 - \beta}}{ES} \Big)^2$ where $ES = \frac{\big \vert \mu_1 - \mu_2 \big \vert}{\sigma}$ is the effect size, and $\sigma$ is the standard deviation of the outcome of interest which can be estimated by the pooled standard deviation of the samples $S_p = \sqrt{\frac{(n_1 - 1)S_1^2 + (n_2 - 1)S_2^2}{n_1 + n_2-2}}$. For example 9, a software team wants to evaluate whether a new caching strategy reduces page load time (in milliseconds). Incoming sessions are randomly assigned to group A (control, current caching) and group B (treatment, new caching). A difference of 40 ms in mean load time is considered practically significant. Based on prior monitoring data, the standard deviation of page load time is known to be approximately 300 ms. Assume they use a two-sided 5% significance level and 80% power for this test. We can now calculate Effect size as $ES=\frac{40}{300} \approx 0.133$ and the required sample size (per group, with equal group sizes) as $n=2\Big(\frac{1.96+0.84}{0.133}\Big)^2 \approx 887$. The total sample size for both groups is therefore 1774.
Continuous outcome with matched samples: Here the hypotheses of interest are $H_0: \mu_d = 0$ and $H_1: \mu_d \neq 0$, where $\mu_d$ is the mean difference. The sample size and the effect size are $n = \Big(\frac{Z_{1 - \alpha / 2} + Z_{1 - \beta}}{ES} \Big)^2$ and $ES = \frac{\mu_d}{\sigma_d}$, where $\sigma_d$ is the standard deviation of the difference outcomes (e.g., the difference based on measurements over time or the difference between matched pairs). Refer to the case of two independent samples in estimation of the mean above. For example 10, A platform team wants to test whether a new kernel/network tuning reduces mean request latency under a standardized load test. They select $n$ servers from the fleet (or a test cluster). They want to run the same load test first with the baseline configuration for each server $i$ and then with the applied tuning. Define the difference $d_i = (\text{latency before new tuning})_i - (\text{latency after new tuning})_i$. A mean reduction of 5 ms is considered practically meaningful, so $\mu_d = 5$ ms. From a small pilot study, the standard deviation of the paired differences is known to be $\sigma_d = 12$ ms. for a two-sided test with 5% significance level and 80% power, the effect size is $ES=\frac{5}{12} \approx 0.417$ and the required sample size (number of paired servers) is $n=\Big(\frac{1.96+0.84}{0.417}\Big)^2 \approx 46$.
Dichotomous outcome: The hypotheses of interest are $H_0: p_1 = p_2$ and $H_1: p_1 \neq p_2$, where $p_1$ and $p_2$ are the proportions in the two comparison populations. The required sample size and effect size are $n = 2\Big(\frac{Z_{1 - \alpha / 2} + Z_{1 - \beta}}{ES} \Big)^2$ and $ES = \frac{\big \vert p_1 - p_2 \big \vert}{\sqrt{p(1 - p)}}$. For example 11, an online retailer wants to test whether a new checkout design (B) changes the purchase conversion rate compared to the current design (A). Incoming visitors are randomly assigned 50/50 to A or B, and the outcome is whether the visitor completes a purchase (yes/no). Assume the baseline conversion rate for (A) to be $p_1 = 0.06$. A 20% relative increase is considered practically significant, hence $p_2 = 1.2 \times 0.06 = 0.072$. We take the pooled variance $p = \frac{0.06 + 0.072}{2} = 0.066$. FOR A Two-sided test with 5% significance and 80% power, we get the effect size $ES=\frac{|0.072-0.06|}{\sqrt{0.066(1-0.066)}} \approx 0.048$ and the required sample size (per group, equal group sizes) is $n = 2\Big(\frac{1.96+0.84}{0.048}\Big)^2 \approx 6806$. So the total required sample size is $2 \times 6806 = 13612$.
# Example 9: two-sample hypothesis test (A/B), continuous
fixed_horizon_examples["example 9"] = {
"type": "two sample hypothesis test (continuous A/B, two-sided)",
"mode": "MDE",
"outcome": "continuous",
"one_sample": False,
"two_sample": True,
"sides": 2,
"alpha": 0.05,
"beta": 0.20,
"mde": 40,
"sigma": 300, # SD of outcome (per group)
# "expected_n_per_group": 887, # text used ES≈0.133; exact with sigma=300 gives 882
}
# Example 10: matched/paired hypothesis test, continuous (server before/after)
fixed_horizon_examples["example 10"] = {
"type": "matched samples hypothesis test (continuous, two-sided)",
"mode": "MDE",
"outcome": "continuous",
"one_sample": True,
"two_sample": False,
"sides": 2,
"alpha": 0.05,
"beta": 0.20,
"mde": 5, # mu_d
"sigma": 12, # sigma_d
"expected_n": 46,
}
# Example 11: two-sample hypothesis test (A/B), dichotomous
fixed_horizon_examples["example 11"] = {
"type": "two sample hypothesis test (dichotomous A/B, two-sided)",
"mode": "MDE",
"outcome": "dichotomous",
"one_sample": False,
"two_sample": True,
"sides": 2,
"alpha": 0.05,
"beta": 0.20,
"p1": 0.06,
"p2": 0.072,
# mde omitted intentionally; runner will set mde = |p2-p1|
# "expected_n_per_group": 6806, # this depends on rounding; exact inputs will differ slightly
}
Python Funtions for Fixed Horizon Sample Sizes¶
Run the following cell to get the Python functions for calculating fixed-horizin sample sizes.
import numpy as np
import scipy.stats as st
def _z_critical(alpha: float, sides: int) -> float:
"""
Compute the Normal critical value for a total type-I error rate alpha.
Two-sided: z = Z_{1 - alpha/2}
One-sided: z = Z_{1 - alpha}
Args:
alpha: Total significance level (e.g., 0.05).
sides: 2 for two-sided, 1 for one-sided.
Returns:
Critical value z.
Raises:
ValueError: if alpha not in (0,1) or sides not in {1,2}.
"""
if not (0 < alpha < 1):
raise ValueError(f"alpha must be in (0, 1), got {alpha}")
if sides not in (1, 2):
raise ValueError(f"sides must be 1 or 2, got {sides}")
return st.norm.ppf(1 - alpha / 2) if sides == 2 else st.norm.ppf(1 - alpha)
def get_sample_size_ME(
sigma: float,
alpha: float,
me: float,
two_sample: bool = False,
sides: int = 2,
attrition: float = 0.0,
) -> int:
"""
Sample size for estimation with margin of error (ME) using Normal approximation.
One sample (or paired differences treated as one sample):
n = (z * sigma / ME)^2
Two independent samples with equal allocation and common variance:
n_per_group = 2 * (z * sigma / ME)^2
Implemented by using sigma_eff = sqrt(2)*sigma when two_sample=True.
Dichotomous outcomes:
Provide sigma = sqrt(p*(1-p)) (or pooled/benchmark p as appropriate).
Matched/paired outcomes:
Set two_sample=False and provide sigma = sigma_d (SD of paired differences).
Attrition:
If attrition > 0, returned n is inflated by 1/(1-attrition).
Args:
sigma: Standard deviation of outcome (or sqrt(p*(1-p)) for dichotomous).
alpha: Total type-I error rate (e.g., 0.05).
me: Desired margin of error (half-width).
two_sample: If True, returns per-group n for two independent groups.
sides: 2 (two-sided) or 1 (one-sided).
attrition: Fraction unusable (e.g., 0.10). Default 0.0.
Returns:
Required sample size (ceil).
- If two_sample=False: n
- If two_sample=True: n_per_group
"""
if sigma <= 0:
raise ValueError(f"sigma must be > 0, got {sigma}")
if me <= 0:
raise ValueError(f"me must be > 0, got {me}")
if not (0 <= attrition < 1):
raise ValueError(f"attrition must be in [0, 1), got {attrition}")
z = _z_critical(alpha=alpha, sides=sides)
sigma_eff = (np.sqrt(2) * sigma) if two_sample else sigma
n = (z * sigma_eff / me) ** 2
n = np.ceil(n)
if attrition > 0:
n = np.ceil(n / (1 - attrition))
return int(n)
def get_sample_size_MDE(
sigma: float,
alpha: float,
beta: float,
mde: float,
two_sample: bool = False,
sides: int = 2,
attrition: float = 0.0,
) -> int:
"""
Sample size for hypothesis testing with power (MDE) using Normal approximation.
One sample (or paired differences treated as one sample):
n = ((z_alpha + z_beta) * sigma / MDE)^2
Two independent samples with equal allocation and common variance:
n_per_group = 2 * ((z_alpha + z_beta) * sigma / MDE)^2
Implemented by using sigma_eff = sqrt(2)*sigma when two_sample=True.
Dichotomous outcomes:
Provide sigma = sqrt(p*(1-p)) using:
- p0 for one-sample benchmark tests, or
- pooled p for two-sample proportion comparisons.
Matched/paired outcomes:
Set two_sample=False, provide sigma = sigma_d and mde = mu_d.
Attrition:
If attrition > 0, returned n is inflated by 1/(1-attrition).
Args:
sigma: Standard deviation of outcome (or sqrt(p*(1-p)) for dichotomous).
alpha: Total type-I error rate (e.g., 0.05).
beta: Type-II error rate (e.g., 0.20 for 80% power).
mde: Minimum detectable effect (absolute difference).
two_sample: If True, returns per-group n for two independent groups.
sides: 2 (two-sided) or 1 (one-sided).
attrition: Fraction unusable (e.g., 0.10). Default 0.0.
Returns:
Required sample size (ceil).
- If two_sample=False: n
- If two_sample=True: n_per_group
"""
if sigma <= 0:
raise ValueError(f"sigma must be > 0, got {sigma}")
if mde <= 0:
raise ValueError(f"mde must be > 0, got {mde}")
if not (0 < beta < 1):
raise ValueError(f"beta must be in (0, 1), got {beta}")
if not (0 <= attrition < 1):
raise ValueError(f"attrition must be in [0, 1), got {attrition}")
z_a = _z_critical(alpha=alpha, sides=sides)
z_b = st.norm.ppf(1 - beta)
sigma_eff = (np.sqrt(2) * sigma) if two_sample else sigma
n = (((z_a + z_b) * sigma_eff) / mde) ** 2
n = np.ceil(n)
if attrition > 0:
n = np.ceil(n / (1 - attrition))
return int(n)
# ---------------- Runner helpers ----------------
def _resolve_sigma_from_example(ex: dict) -> float:
"""
Resolve 'sigma' for an example dict.
- continuous: expects ex["sigma"]
- dichotomous:
* if ex has "p": sigma = sqrt(p*(1-p))
* elif ex has "p0": sigma = sqrt(p0*(1-p0))
* elif ex has "p1" and "p2": sigma = sqrt(p_pool*(1-p_pool)),
where p_pool defaults to (p1+p2)/2 unless ex["p_pooled"] is provided
* elif ex has "sigma": uses it directly
"""
outcome = ex.get("outcome")
if outcome == "continuous":
return float(ex["sigma"])
if outcome == "dichotomous":
if "sigma" in ex:
return float(ex["sigma"])
if "p" in ex:
p = float(ex["p"])
return float(np.sqrt(p * (1 - p)))
if "p0" in ex:
p0 = float(ex["p0"])
return float(np.sqrt(p0 * (1 - p0)))
if "p1" in ex and "p2" in ex:
p1, p2 = float(ex["p1"]), float(ex["p2"])
p_pool = float(ex.get("p_pooled", (p1 + p2) / 2))
return float(np.sqrt(p_pool * (1 - p_pool)))
raise ValueError(f"Cannot resolve sigma for example: {ex}")
def _resolve_mde_from_example(ex: dict) -> float:
"""
Resolve 'mde' for an example dict.
- If ex has "mde": use it.
- If dichotomous and has p1,p2: mde = |p2 - p1|
"""
if "mde" in ex:
return float(ex["mde"])
if ex.get("outcome") == "dichotomous" and "p1" in ex and "p2" in ex:
return float(abs(float(ex["p2"]) - float(ex["p1"])))
raise ValueError(f"Cannot resolve mde for example: {ex}")
def _format_expected_fields(ex: dict) -> str:
"""
Return a suffix like:
" | expected_n=50, expected_total_adj=616"
including any keys that start with 'expected'.
"""
expected_items = [(k, ex[k]) for k in ex.keys() if k.startswith("expected")]
if not expected_items:
return ""
expected_items.sort(key=lambda kv: kv[0]) # stable order
parts = [f"{k}={v}" for k, v in expected_items]
return " | " + ", ".join(parts)
def run_fixed_horizon_examples(examples: dict) -> None:
"""
Iterate over the examples dict and print sample sizes.
Conventions:
- mode == "ME": prints n (or n_per_group for two_sample). If dropout present, prints raw + adjusted.
- mode == "MDE": prints n (or n_per_group for two_sample).
- accepts either "dropout" or "attrition" in the example dict.
"""
for name, ex in examples.items():
mode = ex["mode"]
two_sample = bool(ex.get("two_sample", False))
sides = int(ex.get("sides", 2))
alpha = float(ex["alpha"])
attrition = float(ex.get("attrition", ex.get("dropout", 0.0)))
suffix = _format_expected_fields(ex)
sigma = _resolve_sigma_from_example(ex)
if mode == "ME":
me = float(ex["me"])
# raw
n_raw = get_sample_size_ME(
sigma=sigma,
alpha=alpha,
me=me,
two_sample=two_sample,
sides=sides,
attrition=0.0,
)
if attrition > 0:
n_adj = get_sample_size_ME(
sigma=sigma,
alpha=alpha,
me=me,
two_sample=two_sample,
sides=sides,
attrition=attrition,
)
if two_sample:
print(f"{name}: n_raw_per_group={n_raw}, n_adj_per_group={n_adj}, total_adj={2*n_adj}{suffix}")
else:
print(f"{name}: n_raw={n_raw}, n_adj={n_adj}{suffix}")
else:
if two_sample:
print(f"{name}: n_per_group={n_raw}, total={2*n_raw}{suffix}")
else:
print(f"{name}: n={n_raw}{suffix}")
elif mode == "MDE":
beta = float(ex["beta"])
mde = _resolve_mde_from_example(ex)
n_raw = get_sample_size_MDE(
sigma=sigma,
alpha=alpha,
beta=beta,
mde=mde,
two_sample=two_sample,
sides=sides,
attrition=0.0,
)
if attrition > 0:
n_adj = get_sample_size_MDE(
sigma=sigma,
alpha=alpha,
beta=beta,
mde=mde,
two_sample=two_sample,
sides=sides,
attrition=attrition,
)
if two_sample:
print(f"{name}: n_raw_per_group={n_raw}, n_adj_per_group={n_adj}, total_adj={2*n_adj}{suffix}")
else:
print(f"{name}: n_raw={n_raw}, n_adj={n_adj}{suffix}")
else:
if two_sample:
print(f"{name}: n_per_group={n_raw}, total={2*n_raw}{suffix}")
else:
print(f"{name}: n={n_raw}{suffix}")
# Run:
# run_fixed_horizon_examples(fixed_horizon_examples)
Run the following cell to see the results. Small differences from the hand-calculated values are expected due to rounding in manual calculations.
run_fixed_horizon_examples(fixed_horizon_examples)
example 1: n=50 | expected_n=50 example 2: n=601 | expected_n=601 example 3: n_raw_per_group=277, n_adj_per_group=308, total_adj=616 | expected_n_adj_per_group=308, expected_n_raw_per_group=277, expected_total_adj=616 example 4: n_raw_per_group=138, n_adj_per_group=163, total_adj=326 | expected_n_adj_per_group=163, expected_n_raw_per_group=138, expected_total_adj=326 example 5: n=78 | expected_n=78 example 6: n_per_group=629, total=1258 | expected_n_per_group=629, expected_total=1258 example 7a: n=50 | expected_n=49 example 7b: n=39 | expected_n=39 example 8: n=19991 | expected_n=19936 example 9: n_per_group=883, total=1766 example 10: n=46 | expected_n=46 example 11: n_per_group=6720, total=13440
Derivations of Sample Size Equations for Classical Fixed horizon Testing¶
One Sample¶
As stated above, consider $H_0: \mu = \mu_0$ and $H_1: \mu = \mu_1$. Consider the sampling distribution of $\tilde {X}^{(n)} \sim \mathcal N(\mu, \frac{\sigma^2}{n})$ which we can write as $\tilde {X}^{(n)} = \mu + Z \frac{\sigma}{\sqrt n}$. Let's replace the superscript $(n)$, which we don't need here, by $(\mu)$ to indicate that the samples of size $n$ are taken from the distribution with the mean $\mu$, that is $\tilde {X}^{(\mu)} = \mu + Z \frac{\sigma}{\sqrt n}$. We can reorder and write this as $\tilde {X}^{(\mu)} - \mu = Z \frac{\sigma}{\sqrt n}$. When testing the hypothesis for the two values of the mean, the standard deviation and the sample size are the same because we are hypothesizing on the mean of a single population using a single sample only. Assume $\mu_1 > \mu_0$ and let the minimum detectable effect be $\delta = \mu_1 - \mu_0$. From Figure 1 for the distributions of $\tilde {X}^{(\mu_0)}$ and $\tilde {X}^{(\mu_1)}$, we can see that the minimum detectable effect can be calculated based on the location of point C, which is the point that corresponds to the $\alpha$ significance level on the first curve and $\beta$ significance level on the second curve, that is, $P(C \vert \mu_0) = \alpha$, $Z(C) = Z_{1 - \alpha/2}$, $P(C \vert \mu_1) = \beta$ and $Z(C) = Z_{\beta}$. The alpha and beta regions are shaded.
$\delta = \overline{\mu_0C} + \overline{C\mu_1} = \big(\tilde {X}^{(\mu_0)}(C) - \mu_0 \big) + \big(\mu_1 - \tilde {X}^{(\mu_1)}(C) \big) = \big(\tilde {X}^{(\mu_0)}(C) - \mu_0 \big) - \big(\tilde {X}^{(\mu_1)}(C) -\mu_1 \big)$ which we can express using the corresponding Z values
$= Z_{1 - \alpha/2}\frac{\sigma}{\sqrt n} - Z_{\beta}\frac{\sigma}{\sqrt n} = \big( Z_{1 - \alpha/2} + Z_{1 - \beta}\big)\frac{\sigma}{\sqrt n}$
Note that $Z \sim \mathcal N(0, 1)$ where for small values of beta that we use, such as 0.2, $Z_{\beta} < 0$, $P(Z_{\beta}) = \beta$, $Z_{1 - \beta} > 0$ and $Z_{1 - \beta} = - Z_{\beta}$ by symmetry.
which corresponds to the formula for the sample size for one sample comparision.
# Figure 1
import numpy as np
from scipy.stats import norm
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = "jupyterlab"
X = np.arange(-60, 140.1, 0.1)
mu0, mu1 = 20, 70
sigma = 15 # note this is actually sigma / sqrt(n), we don't care about n here for illustration, so we just use this value.
x_c = 47
Y0 = norm.pdf(X, mu0, sigma)
Y1 = norm.pdf(X, mu1, sigma)
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Figure 1: Sampling distribution of X_bar for some imaginary sample size under the null and alternative hypothesis",
# 'yaxis_range': [-0.5, 1.5],
"height": 500}, )
fig.add_scatter(x=X, y=Y0, name="N(mu_0, sigma^2/n)", mode="lines", line={"color": "blue", "width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=Y1, name="N(mu_1, sigma^2/n)", mode="lines", line={"color": "red", "width": 0.75}, row=1, col=1)
fig.add_scatter(x=[x_c,], y=[0.0,], name="C", mode="markers", line={"color": "black", "width": 5}, row=1, col=1)
fig.add_vline(x=x_c, name="C", line_width=1, line_dash="dot", line_color="black", row=1, col=1)
fig.add_vline(x=mu0, name="mu0", line_width=1, line_dash="dot", line_color="blue", row=1, col=1)
fig.add_vline(x=mu1, name="mu1", line_width=1, line_dash="dot", line_color="red", row=1, col=1)
fig.add_annotation(
x=x_c, y=0, # position
text="C", # text
#showarrow=True,
arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor="black"
)
fig.add_annotation(
x=mu0, y=0, # position
text="mu0", # text
#showarrow=True,
arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor="black"
)
fig.add_annotation(
x=mu1, y=0, # position
text="mu1", # text
#showarrow=True,
arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor="black"
)
X_fill0 = np.arange(x_c, 140.1, 0.1)
Y_fill0 = norm.pdf(X_fill0, mu0, sigma)
X_fill1 = np.arange(-60, x_c+0.1, 0.1)
Y_fill1 = norm.pdf(X_fill1, mu1, sigma)
fig.add_trace(go.Scatter(x=X_fill0, y=Y_fill0, fill="tozeroy", name="alpha"))
fig.add_trace(go.Scatter(x=X_fill1, y=Y_fill1, fill="tozeroy", name="beta"))
fig.show()
Two Independent Samples¶
For this case $H_0: \mu_1 = \mu_2$ and $H_1: \mu_1 \neq \mu_2$, and we use the sampling distribution $\tilde {X} _{2}^{(n_1)} - \tilde {X} _{1}^{(n_2)} \sim \mathcal N (\mu_2 - \mu_1, \frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2})$. Again let's get rid of the sample size in the superscript and wrtie the equivalent form $\tilde {X} _{2}^{(\mu_2)} - \tilde {X} _{1}^{(\mu_1)} = \mu_2 - \mu_1 + Z\sqrt{\frac{\sigma_1^2}{n_1} + \frac{\sigma_2^2}{n_2}}$. Note that here we are concerned with the distribution of the difference of the means, so let's introduce $\tilde {D}^{(d)} = \tilde {X} _{2}^{(\mu_2)} - \tilde {X} _{1}^{(\mu_1)}$ and $d = \mu_2 - \mu_1$. By assuming equal sample size $n$ and common standard deviation $\sigma$, we can write the previous equation in a simplified form $\tilde {D}^{(d)} = d + Z\sigma\sqrt{\frac{2}{n}}$ with the quivalent hypothesis $H_0: d = 0$ and $H_1: d = \delta$ where $\delta = \mu_2 - \mu_1, \mu_2 > \mu_1$ is the minimum detectable effect. We can now proceed in a similar manner as the one sample case, with the exception that the distribution of interest here is $\tilde {D}^{(d)}$ instead of $\tilde {X}^{(\mu)}$. Figure 2 represents the distribution of $\tilde {D}^{(d)}$, with $d_0 = 0$ and $d_1 = \delta$. Again, the minimum detectable effect can be calculated based on the location of the point C as
$\delta = \overline{d_0C} + \overline{Cd_1} = \big(\tilde D^{(d_0)}(C) - d_0 \big) + \big(d_1 - \tilde D^{(d_1)}(C) \big) = \big(\tilde D^{(d_0)}(C) - d_0 \big) - \big(\tilde D^{(d_1)}(C) - d_1 \big)$
$= Z_{1 - \alpha/2}\sigma\sqrt{\frac{2}{n}} - Z_{\beta}\sigma\sqrt{\frac{2}{n}} = \big( Z_{1 - \alpha/2} + Z_{1 - \beta}\big)\sigma\sqrt{\frac{2}{n}}$
which corresponds to the formula for the sample size for two sample comparisions.
# Figure 1
X = np.arange(-80, 120.1, 0.1)
d0, d1 = 0, 50
sigma = 15
x_c = 27
Y0 = norm.pdf(X, d0, sigma)
Y1 = norm.pdf(X, d1, sigma)
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Figure 2: Sampling distribution of D_bar for some imaginary sample size under the null and alternative hypothesis",
# 'yaxis_range': [-0.5, 1.5],
"height": 500}, )
fig.add_scatter(x=X, y=Y0, name="N(d_0=0, 2*sigma^2/n)", mode="lines", line={"color": "blue", "width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=Y1, name="N(d1=delta, 2*sigma^2/n)", mode="lines", line={"color": "red", "width": 0.75}, row=1, col=1)
fig.add_scatter(x=[x_c,], y=[0.0,], name="C", mode="markers", line={"color": "black", "width": 5}, row=1, col=1)
fig.add_vline(x=x_c, name="C", line_width=1, line_dash="dot", line_color="black", row=1, col=1)
fig.add_vline(x=d0, name="d0", line_width=1, line_dash="dot", line_color="blue", row=1, col=1)
fig.add_vline(x=d1, name="d1", line_width=1, line_dash="dot", line_color="red", row=1, col=1)
fig.add_annotation(
x=x_c, y=0, # position
text="C", # text
#showarrow=True,
arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor="black"
)
fig.add_annotation(
x=d0, y=0, # position
text="d0=0", # text
#showarrow=True,
arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor="black"
)
fig.add_annotation(
x=d1, y=0, # position
text="d1=delta", # text
#showarrow=True,
arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor="black"
)
X_fill0 = np.arange(x_c, 120.1, 0.1)
Y_fill0 = norm.pdf(X_fill0, d0, sigma)
X_fill1 = np.arange(-80, x_c+0.1, 0.1)
Y_fill1 = norm.pdf(X_fill1, d1, sigma)
fig.add_trace(go.Scatter(x=X_fill0, y=Y_fill0, fill="tozeroy", name="alpha"))
fig.add_trace(go.Scatter(x=X_fill1, y=Y_fill1, fill="tozeroy", name="beta"))
fig.show()
Sequential Testing¶
- A standard randomized fixed horizon test may take too long
- We cannot just naïvely monitor p-values as we observe data and stop when $p_n < \alpha$. This modifies the test, and the type I and type II errors will not be $\alpha$ and $\beta$ anymore [Johari et al 2019].
Methods for Sequential Testing¶
There are two main methods for sequential testing [Johari et al. 2019].
- Sequential probability ratio test (SPRT), which is based on the original SPRT method by Wald 1947.
- Multi arm bandit [Scott 2014].
Sequential Probability Ratio Test (SPRT)¶
A Sequential test is performed as follows. At each step of the sequantial test, we have three choices: accept $H_0$, reject $H_0$ (hereafter we call it accept $H_1$ for simplicity), or continue. We denote the expected value of $n$ if $H_0$ is true by $E_0(n)$, and the expected value of $n$ if $H_1$ is true by $E_1(n)$ .
As in fixed procedures, type I and type II errors may be committed in sequential tests. We say two sequential tests $S$ and $S'$ have equal strength if $\alpha$ and $\beta$ are equal. We say test $S$ is stronger than $S'$ if $\alpha \lt \alpha'$ and $\beta \le \beta'$, or if $\alpha \le \alpha'$ and $\beta \lt \beta'$.
If $S$ and $S'$ have equal strength, we shall say $S'$ is better than $S$ if either $E_0(n \mid S') \lt E_0(n \mid S)$ and $E_1(n \mid S') \le E_1(n \mid S)$, or $E_0(n \mid S') \le E_0(n \mid S)$ and $E_1(n \mid S') \lt E_1(n \mid S)$.
The sequential test $S^{\ast}$ is called the best if for any sequential tests $S$ of equal strength, $E_0(n \mid S^{\ast}) \le E_0(n \mid S)$ and $E_1(n \mid S^{\ast}) \le E_1(n \mid S)$.
That a sequential test exists in which both $E_0(n)$ and $E_1(n)$ are minimized is not proved, but for most practical purposes, the SPRT that is outlined in the following sections can be considered the best.
We need a definition for optimum test. For most practical purposes, the optimum test can be defined as the test $S^{\ast}$ for which $MAX[E_0(n \mid S^{\ast}), E_1(n \mid S^{\ast})] \le MAX[E_0(n \mid S), E_1(n \mid S)]$ for any $S$ of equal strength.
The efficieny of a sequential test $S$ is then defined by the ratio $\frac{MAX[E_0(n \mid S^{\ast}), E_1(n \mid S^{\ast})]}{MAX[E_0(n \mid S), E_1(n \mid S)]}$.
Neymar Pearson Lemma¶
It is the fundation of most, if not all, sequential probability ratio tests.
Definition:
- Let $X$ be a random variable with probability distribution $f(x)$
- Let the hypothesis $H_0$ to be tested be the statement that the distribution of $X$ is $f_0(x)$
- Suppose $H_0$ is to be tested against $H_1$ which states that the distribution of $X$ is $f_1(x)$
According to the Neymar-Pearson theory for testing hypothesis the most powerful critical (rejection) region $W_N$ for testing $H_0$ agains $H_1$ on the basis of $n$ independent observations $x_1, x_2, ..., x_N$ on $X$ is given by the inequality
$\frac{f_1(x_1)f_1(x_2)...f_N(x_N)}{f_0(x_1)f_0(x_2)...f_0(x_N)} \geq k$
The quantity $k$ is a constant and is choosen so that the size of the critical region, i.e., the probablity of type I error has the required value $\alpha$.
The proof is outside of the scope, but notice that the left side is just the ratio of the likelihoods of the observations. This intuitively makes sense because to be able to reject $H_0$, the observations should be more likely under alternative hypothesis. Recall that the likelihood function $L(\theta)$ is the joint probability density (or mass) function of the sample $x_1, x_2, ..., x_N$, that is
$L(\theta; x_1, x_2, ..., x_N) = L(\theta \mid x_1, x_2, ..., x_N) = f(x_1 \mid \theta)f(x_2 \mid \theta)...f(x_N \mid \theta)$. Note that $L(\theta)$ is written as $L(\theta ; x_1, x_2, ..., x_N)$ instead of $L(x_1, x_2, ..., x_N \mid \theta)$ to emphasisze that it is a function of $\theta$, the parameter(s) of the distribution.
SPRT Decision Rules¶
How do we get the decision rule for accepting $H_0$, accepting $H_1$ or to continue drawing new samples?
Suppose before the sample is drawn, there exists a priori a porbability that $H_0$ is true. Denote this prior probability by $g_0$. Then the prior probability that $H_1$ is true is given by $g_1 = 1 - g_0$ since it is assumed that $H_0$ and $H_1$ exhaust all possibilities. After $m$ observations $x_1, x_2, ..., x_m$ are made, the posterior probability that $H_i (i = 0, 1)$ is true changes. Let $g_{0m}$ be the posterior probability that $H_0$ is true and $g_{1m}$ be the posterior probability that $H_1$ is true. Now from Bayes theorem:
$g_{0m} = \frac{g_0p_{0m}(x_1, x_2, ..., x_m)}{g_0p_{0m}(x_1, x_2, ..., x_m) + g_1p_{1m}(x_1, x_2, ..., x_m)}$ (1)
$g_{1m} = \frac{g_1p_{1m}(x_1, x_2, ..., x_m)}{g_0p_{0m}(x_1, x_2, ..., x_m) + g_1p_{1m}(x_1, x_2, ..., x_m)}$ (2)
where $p_{im}(x_1, x_2, ..., x_m) (i = 0, 1)$ is the probability density of the $m$ dimensional sample space calculated under $H_i (i = 0, 1)$, which we will abbreviate by $p_{im}$.
Let $\frac{1}{2} \lt d_0, d_1 \lt 1$ be two positive numbers. We want to construct the sequnetial test such that the onditional probability of correctly accepting $H_i (i = 0, 1)$ when $H_i$ is true is grater than or equal to $d_i (i = 0, 1)$. Then the following sequantial process seems reasonable:
At each step, if $g_{1m} \geq d_1$ accept $H_1$. If $g_{0m} \geq d_0$ accept $H_0$. If $g_{1m} \lt d_1$ and $g_{0m} \lt d_0$ draw an additional observation. We cannot have both $g_{1m} \gt d_1$ and $g_{0m} \gt d_0$ simultanously because $g_{0m} + g_{1m} = 1$.
(1) and (2) together with inequalities $g_{1m} \geq d_1$ and $g_{0m} \geq d_0$ can be re-arranged as:
$\frac{p_{1m}}{p_{0m}} \geq \frac{g_0}{g_1}\frac{d_1}{1-d_1}$ (3)
$\frac{p_{1m}}{p_{0m}} \le \frac{g_0}{g_1}\frac{1 - d_0}{d_0}$ (4)
If the prior probabilities do not exist or unknown, these inequalities suggest the following sequential test. At each step calculate $\frac{p_{1m}}{p_{0m}}$:
Accept $H_1$ if $\frac{p_{1m}}{p_{0m}} \geq A$ (5)
Accept $H_0$ if $\frac{p_{1m}}{p_{0m}} \le B$ (6)
Continue to observe if $B \lt \frac{p_{1m}}{p_{0m}} \lt A$ (7)
The constants $A$ and $B$ are chosen so that $0 \lt B \lt A$ and the sequential test has the desired values of $\alpha$ and $\beta$ for type I and type II errors respectively.
A fundamental relationship exists between $A$, $B$, $\alpha$ and $\beta$. The proof is outside of the scope. We just write it here.
$A(\alpha, \beta) \le \frac{1 - \beta}{\alpha}$ (8)
$B(\alpha, \beta) \geq \frac{\beta}{1 - \alpha}$ (9)
SPRT Process¶
When $\alpha$ and $\beta$ are reasonably small desirable type I and type II erros, for most practical purposes choose constants $A$ and $B$ as:
$A = \frac{1 - \beta}{\alpha}$
$B = \frac{\beta}{1 - \alpha}$
And then follow this procedure:
Accept $H_1$ if $\frac{p_{1m}}{p_{0m}} \geq \frac{1 - \beta}{\alpha}$ (10)
Accept $H_0$ if $\frac{p_{1m}}{p_{0m}} \le \frac{\beta}{1 - \alpha}$ (11)
Continue to observe if $\frac{\beta}{1 - \alpha} \lt \frac{p_{1m}}{p_{0m}} \lt \frac{1 - \beta}{\alpha}$ (12)
where $p_{im} (i = 0, 1)$ is the probability density of the $m$ observations under $H_i (i = 0, 1)$.
One-Sided Test of a Simple Hypothesis - Case of a Binomial Distribution¶
Example from manufacturing [Wald, 1947].
Let $p$ denote the probability of having a defective unit in a lot. If $m$ units are drawn at random, the probability of observing $d$ defective units is $\frac{m!}{d!(m-d)!}p^d(1-p)^{m-d}$
Risk tolerance for making the wrong decision: It is possible to specify a value of $p'$ so that if $p \le p'$, we prefer to accept the lot, and if $p \gt p'$, we prefer to reject the lot. Also, when $p \le p'$, our preference for acceptance is stronger the smaller the $p$ is, and when $p \gt p'$ our preference for rejection is stronger the larger the $p$ is. Thus, it is possible to select a value $p_0 \lt p'$ and a value $p_1 \gt p'$ such that the error is considered serious if we reject the lot when $p \le p_0$ (type I) or if we accept the lot when $p \ge p_1$ (type II). Let the associated probabilities with these risks be $\alpha$ and $\beta$ respectively. Thus the tolerated risk is charactrized by four quantities $p_0, p_1, \alpha$ and $\beta $.
We can translate the above requirement into the following hypothesis testing over quantities $p_0, p_1, \alpha$ and $\beta $. Let $H_0: p = p_0$ and $H_1: p = p_1$. Consider the sequential probability ratio test T for testing $H_0$ against $H_1$ in which $\alpha$ is the probability of accepting $H_1$ when $H_0$ is true and $\beta$ is the probability of accepting $H_0$ when $H_1$ is true. This test will satisfy our risk requirements because for this test the probability of accepting the lot (accepting $H_0$) is $\le \beta$ whenever $p \ge p_1$ and the probability of rejecting the lot (accepting $H_1$) is $\le \alpha$ whenever $p \le p_0$
The sequential probability ratio can be written as:
$\frac{p_{1m}}{p_{0m}} = \frac{p_1^{d_m}(1-p_1)^{m-d_m}}{p_0^{d_m}(1-p_0)^{m-d_m}}$ (13)
where the subscript in $d_m$ is introduced to emphasize that this is the number of defective units in the $m$ units that have been inspected. If we plug equation (13) into inequalities 10 to 12, take the logarithms, and solve for $d_m$ we get the following inequalities:
$d_m \ge \frac{log \frac{1 - \beta}{\alpha}}{log \frac{p_1}{p0} - log \frac{1 - p1}{1 - p0}} + m \frac{log \frac{1 - p_0}{1 - p_1}}{log \frac{p_1}{p0} - log \frac{1-p1}{1-p0}}$ (14)
$d_m \le \frac{log \frac{\beta}{1 - \alpha}}{log \frac{p_1}{p0} - log \frac{1 - p1}{1 - p0}} + m \frac{log \frac{1 - p_0}{1 - p_1}}{log \frac{p_1}{p0} - log \frac{1-p1}{1-p0}}$ (15)
$\frac{log \frac{\beta}{1 - \alpha}}{log \frac{p_1}{p0} - log \frac{1 - p1}{1 - p0}} + m \frac{log \frac{1 - p_0}{1 - p_1}}{log \frac{p_1}{p0} - log \frac{1-p1}{1-p0}} \lt d_m \lt \frac{log \frac{1 - \beta}{\alpha}}{log \frac{p_1}{p0} - log \frac{1 - p1}{1 - p0}} + m \frac{log \frac{1 - p_0}{1 - p_1}}{log \frac{p_1}{p0} - log \frac{1-p1}{1-p0}}$ (16)
Equations 14 to 16 represent three decision regions that correspond to the three decision rules of the SPRT processess defined by Eq. 10 - 12. These regions can be graphically represented by two lines with equal slopes that define the boundaries for accepting or rejecting the null hypothesis after $m$ observations:
$A_m = \frac{log \frac{\beta}{1 - \alpha}}{log \frac{p_1}{p0} - log \frac{1 - p1}{1 - p0}} + m \frac{log \frac{1 - p_0}{1 - p_1}}{log \frac{p_1}{p0} - log \frac{1-p1}{1-p0}}$ (17), i.e the acceptance line $L_0$, and
$R_m = \frac{log \frac{1 - \beta}{\alpha}}{log \frac{p_1}{p0} - log \frac{1 - p1}{1 - p0}} + m \frac{log \frac{1 - p_0}{1 - p_1}}{log \frac{p_1}{p0} - log \frac{1-p1}{1-p0}}$ (18), i.e the rejection line $L_1$.
Calling these lines as acceptance and rejection lines is with respect to the null hypothesis $H_0$.
We can now graphically carry out the sequential test as follows. The number $m$ is measured along the abscissa axis, and $d$ is measured along the ordinate axis. We draw lines $L_0$ and $L_1$ and at each observation $m$, we plot the point $(m, d_m)$. The test continues as long as the point $(m, d_m)$ lies in the region between $L_0$ and $L_1$ and is terminated as soon as the point crosses either of these two lines. $H_0$ is rejected (the lot is rejected) if the point lies on $L_1$ or above, and it is accepted if the point lies on $L_0$ or below.
Simulated Example¶
class Lot():
def __init__(self, p, seed=42):
self.rng = np.random.default_rng(seed)
self.p = p
def observe_unit(self):
return self.rng.binomial(1, self.p)
class SPRT():
def __init__(self, p0, p1, alpha, beta):
assert p0 < p1, "p0 and p1 cannot be equal (div 0) and p1 must be stricktly bigger than p0"
assert 0 < p0 < 1 and 0 < p1 < 1, "p0 and p1 cannot be 0 or 1"
a = np.log(beta / (1 - alpha))
b = np.log((1 - beta) / alpha)
c = np.log(p1 / p0)
d = np.log((1 - p1) / (1 - p0))
e = np.log((1 - p0) / (1 - p1))
denom = c - d
self.bias0 = a / denom
self.bias1 = b / denom
self.slope = e / denom
assert self.slope > 0, "slope cannot be negative"
assert self.bias0 <= 0, "L0's bias must be negaive"
assert self.bias1 >= 0, "L1's bias must be positive"
def get_l0(self, m):
return self.bias0 + self.slope*m
def get_l1(self, m):
return self.bias1 + self.slope*m
p0, p1 = 0.4, 0.6
alpha, beta = 0.05, 0.2
sprt = SPRT(p0, p1, alpha, beta)
# consider three different lots
p_lot1, p_lot2, p_lot3 = 0.3, 0.5, 0.7
lot1, lot2, lot3 = Lot(p_lot1), Lot(p_lot2), Lot(p_lot3)
n_observe = 50
L0 = sprt.get_l0(0), sprt.get_l0(n_observe)
L1 = sprt.get_l1(0), sprt.get_l1(n_observe)
sample1 = np.cumsum([lot1.observe_unit() for m in range(n_observe)])
sample2 = np.cumsum([lot2.observe_unit() for m in range(n_observe)])
sample3 = np.cumsum([lot3.observe_unit() for m in range(n_observe)])
X = [m for m in range(1, n_observe+1)]
fig = make_subplots(rows=3, cols=1,)
fig.update_layout({"title": f"Simulated SPRT for hypothesis H0: p = {p0}, H1: p ={p1}, for three different lots with p = {p_lot1}, {p_lot2}, {p_lot3}",
# 'yaxis_range': [-0.5, 1.5],
"height": 1500}, )
fig.add_scatter(x=(0, n_observe), y=L0, name="Accept H0", mode="lines", line={"color": "blue", "width": 0.75, "dash": "dot"}, row=1, col=1)
fig.add_scatter(x=(0, n_observe), y=L1, name="Accept H1", mode="lines", line={"color": "red", "width": 0.75, "dash": "dot"}, row=1, col=1)
fig.add_scatter(x=X, y=sample1, name="Sample from Lot1", mode="lines+markers", row=1, col=1)
fig.add_scatter(x=(0, n_observe), y=L0, showlegend = False, mode="lines", line={"color": "blue", "width": 0.75, "dash": "dot"}, row=2, col=1)
fig.add_scatter(x=(0, n_observe), y=L1, showlegend = False, mode="lines", line={"color": "red", "width": 0.75, "dash": "dot"}, row=2, col=1)
fig.add_scatter(x=X, y=sample2, name="Sample from Lot2", mode="lines+markers", row=2, col=1)
fig.add_scatter(x=(0, n_observe), y=L0, showlegend = False, mode="lines", line={"color": "blue", "width": 0.75, "dash": "dot"}, row=3, col=1)
fig.add_scatter(x=(0, n_observe), y=L1, showlegend = False, mode="lines", line={"color": "red", "width": 0.75, "dash": "dot"}, row=3, col=1)
fig.add_scatter(x=X, y=sample3, name="Sample from Lot3", mode="lines+markers", row=3, col=1)
Comparision with fixed horizon sample size calculation
p0, p1 = 0.4, 0.6
alpha, beta = 0.05, 0.2
# Here we assume p0 is the baseline
sigma = np.sqrt(p0 * (1 - p0))
mde = abs(p1 - p0)
n_per_group = get_sample_size_MDE(
sigma=sigma,
alpha=alpha,
beta=beta,
mde=mde,
two_sample=True,
sides=2,
attrition=0.0
)
print("n_per_group =", n_per_group, "total =", 2 * n_per_group)
n_per_group = 95 total = 190
Compare the above number with the graphs. Although we may get a false positive or false negative in each case (crossing the line when it shouldn't) due to alpha and beta levels, we can see that the sequential test produces a result faster (about 30 observations) compared to the required 95 observations in fixed horizon experiment. Note that false positive and false negative also occurs in the case of fixed horizon experiment. This would be easy to show by simulations.
Mixture Probability Ratio Test (mSPRT)¶
In SPRT we need to specify both of the null hypothesis and the alternative hypothesis. Specifying an alternative hypothesis is not required in fixed horizon tests when only Type I error probability constraint $\alpha$ is imposed as in marin of error consideration for estimating population means. One mthod to overcome the necessity for specifying the alternative hypothesis is by using a mixture sequential probability ratio test (mSPRT).
The mSPRT was first introduced by [Robbins 1970] as an extension of Wald's SPRT. It's first application to to A/B testing was done by [Johari et al 2015], and it can be outlied as follows as noted in [Stenberg, 2019]. Let $\theta$ represent the parameter that measures the difference between the two groups in the experiment. Further let $\lbrace X_i\rbrace_{i=1}^n ∼ f_θ(x_n)$ be a sequence of iid random variables, $f_{\theta_0}(x_n)$ and $f_{\theta_1}(x_n)$ be two probability distributions, e.g. that represent the null and alternative hypothesis, and $\Lambda_n = \frac {f_{\theta_1}(x_n)}{f_{\theta_0}(x_n)}$ be their likelihood ratio which is a martingale under the null hypothesis $f_{\theta_0}(x_n)$. In mSPRT, we may not know the exact value of the parameter $\theta_1$. Instead, we use a prior distribution $\pi(\theta) > 0$ to repsent the true difference between the two groups which is called a mixture. This is sometimes easier than trying to specify a minimumd etectable effect as required by SPRT or fixed horizon tests.
The likelihood ratio in mSRPT is defined as [Stenberg 2019]
$\displaystyle \tilde \Lambda_n = \int_{\theta \in \Theta} \Lambda_n \pi(\theta)d\theta = \int_{\theta \in \Theta} \Big(\prod_{i=1}^n \frac {f_{\theta}(x_i)}{f_{\theta_0}(x_i)}\Big)\pi(\theta)d\theta$ (19)
which simply states that in mSPRT, we integrate the SPRT test statistic over some prior probability space for parameter $\theta$ over $H_1$. It is further shown that since $\tilde {\Lambda}_n$ is a martingale under null hypothesis, it satisfies Doob's inequality:
$\displaystyle P_{\theta_0}[\tilde {\Lambda}_n > b, n \ge 1] \le \frac{1}{b}$ for any $b \gt 0$, hence a natural stopping rule for mSPRT is $inf [n: \tilde {\Lambda}_n \lt \alpha^{-1}]$.
It further follows from Johari et al. [2015] that if after $n$ observations we define the p-value as
$p_n = min[1, (min(\tilde {\Lambda}_t^{-1}): t \le n)]$ (20)
then the probability of p-value falling below $\alpha$ during any time in the experiment is less that $\alpha$ itself is
$P_{\theta_0}[p_n \le \alpha] \le \alpha $ for any $n$.
A closed form solution for Gaussian mSPRT is provided by [Stenberg 2019]. Denote $X_n = \lbrace X_1, X_2, ..., X_n\rbrace$ and $Y_n = \lbrace Y_1, Y_2, ..., Y_n \rbrace$ as observations drawn independently from the two normal distributions for each group, $Z_n = Y_n - X_n$, $Z_n \sim N(\theta, 2\sigma^2)$ and $\pi(\theta) \sim N(\theta_0, \tau^2)$, then the solution to
$\displaystyle \tilde \Lambda_n = \int_{\theta \in \Theta} \Big(\prod_{i=1}^n \frac {f_{\theta}(z_i)}{f_{\theta_0}(z_i)}\Big)\pi(\theta)d\theta$ (21)
is
$\displaystyle \tilde \Lambda_n = \sqrt \frac {2 \sigma^2}{2 \sigma^2 + n \tau^2} exp\Big\lbrace \frac{\tau^2n^2(\bar {Y}_n - \bar {X}_n - \theta_0)^2}{4 \sigma^2 (2 \sigma^2 + n \tau^2)} \Big\rbrace$ (22)
in which $X$ and $Y$ are observations from the control and the treatment groups. Also according to [Stenberg 2019], for large $n$, the following formula provides a good approximation for binary data in A/B tests:
$\displaystyle \tilde \Lambda_n = \sqrt \frac {V_n}{V_n + n \tau^2} exp\Big\lbrace \frac{\tau^2n^2(\bar {Y}_n - \bar {X}_n - \theta_0)^2}{2V_n (V_n + n \tau^2)} \Big\rbrace$ (23)
in which X and Y are binary and $V_n = \bar {X}_n(1 - \bar{X}_n) + \bar {Y}_n(1 - \bar{Y}_n)$. If we wanted to use the exact form of (19) for a Binomial likelihood (13), we would need a Beta prior, which we may ned to solve numerically.
One challenge in apllying (23) to A/B testing is the need to set the prior variance $\tau^2$ for the mixture. It is very common to assume a normal mixing distribution $\pi(\theta) \sim N(\theta_0, \tau^2)$ because this is often observed in practice and for large sample sizes the distribution of the effect is normal. The benefit of assuming a mixture model instead of specifying a simple alternative hypothesis is that these experiments usually target very small effects, and a misspecified minimal detectable effect can reduce the power of both t-tests and the SPRT, whereas mSPRT is robust to misspecified mixture variance [Stenberg 2019].
Note again that the parameter $\theta$ in the above closed form solutions represents the differences between the groups, that is, if we denote $X_n = \lbrace X_1, X_2, ..., X_n\rbrace$ and $Y_n = \lbrace Y_1, Y_2, ..., Y_n \rbrace$ as observations drawn independently from the two distributions for each group, then $Z_n = Y_n - X_n$ where $Z_n \sim N(\theta, \sigma^2)$ is the random variable for which the likelihood ratio is calculated. This includes the conveninet case of $\theta_0 = 0$ under the null hypothesis. See the appendix of [Stenberg 2019] for details.
Can we solve mSPRT numerically?¶
It is tempting to compare the closed form mSPRT solution above with a numericall solution. This might be needed in the case of the exact Binomial Likelihood which requires a Beta prior. For this purpose we take re-write the mSRPT equation (21) in the following numerically more stable form and solve it.
$\displaystyle \tilde \Lambda_n = \int_{\theta \in \Theta} e^{\ln\Big(\prod_{i=1}^n \frac {f_{\theta}(z_i)}{f_{\theta_0}(z_i)}\Big)}\pi(\theta)d\theta = \int_{\theta \in \Theta} e^{\sum_{i=1}^n \ln \frac {f_{\theta}(z_i)}{f_{\theta_0}(z_i)}}\pi(\theta)d\theta$ (24)
One question that may arise when computing (24) is how elements $z_i$ are calculated. The closed form (22) only has $\bar X$ and $\bar Y$ which can be calculated separately for each sample. It can be easily shown via algebra that when the two variables are independent and the sample sizes are equal, the $n$ pairwise differences $z_i = x_i - y_i$ generates the same average and variance as the $n^2$ differences $z_{ij} = x_i - y_j$. This means that the following two ways of calculating $\bar Z$ and $Var(Z)$ are idential. If the samples were not independent, then naturally we would be considering pairs of X and Y. If they are independent, then we can pair then any way we want and calculate the statistics.
$\displaystyle \bar Z = \frac{\sum_{i,j = 1}^n (x_i - y_j)}{n^2} = \frac{\sum_{i = 1}^n (nx_i - n \bar Y)}{n^2} = \frac{n^2 \bar X - n^2 \bar Y}{n^2}$
$$
$\displaystyle = \bar X - \bar Y$ which can be equally obtained from $\displaystyle \frac{\sum_{i = 1}^n (x_i - y_i)}{n}$
$\displaystyle Var(Z) = \frac{\sum_{i,j = 1}^n \Big [(x_i - y_j) - (\bar X - \bar Y) \Big]^2}{n^2}$
$\displaystyle = \frac{\sum_{i,j = 1}^n \Big [(x_i - \bar X)^2 + (y_j - \bar Y)^2 -2(x_i - \bar X)(y_j - \bar Y) \Big]}{n^2}$
$\displaystyle = \frac{n \sum_{i = 1}^n (x_i - \bar X)^2 + n \sum_{j = 1}^n (y_j - \bar Y)^2 - 2 \sum_{i = 1}^n(x_i - \bar X) \sum_{j = 1}^n(y_j - \bar Y)}{n^2}$
Note that in deriving this formula either of the product terms in the last term is zero, so using all $n^2$ pairs eliminates any dependency between X and Y. Hence, this sum is equivalent to pairwise case only if the X and Y are independent.
$\displaystyle = \frac{n \sum_{i = 1}^n (x_i - \bar X)^2 + n \sum_{j = 1}^n (y_j - \bar Y)^2}{n^2}$
$\displaystyle = \frac{n^2Var(X) + n^2Var(Y)}{n^2} = Var(X) + Var(Y)$. This alwyas holds we we do $n^2$ sums, but if X and Y are independent, then $Var(X) + Var(Y) = Var(X-Y)$
We can equally obtain the same result from $\displaystyle \frac{\sum_{i = 1}^n \Big [(x_i - y_i) - (\bar X - \bar Y) \Big]^2}{n}$ when X and Y are independent.
from numpy.typing import ArrayLike
from scipy.integrate import quad
# from scipy.stats import norm, binom
class GaussianSample():
def __init__(self, loc, scale, seed=42):
self.rng = np.random.default_rng(seed)
self.loc = loc
self.scale = scale
def draw(self, size=None):
return self.rng.normal(self.loc, self.scale, size)
class BinomialSample():
def __init__(self, p, seed=42):
self.rng = np.random.default_rng(seed)
self.p = p
def draw(self, n=1, size=None):
return self.rng.binomial(n, self.p, size)
class mSPRT():
"""
mSPRT parent class containing closed form solution
and helper function for numerical solutions
"""
def __init__(self, theta0, tau):
self.theta0 = theta0
self.tau = tau
# def get_likelihood(self, f_theta: ArrayLike, f_theta0: ArrayLike):
# log_likelihood = np.sum(np.log(f_theta) - np.log(f_theta0))
# return np.exp(log_likelihood)
def get_lambda_normal_cf(self, X, Y, sigma):
"""
Calculate lambda hat from closed form.
"""
if sigma == 0:
print("invalid data")
return None
n = len(X)
X, Y = np.array(X), np.array(Y)
Z_bar = np.mean(Y) - np.mean(X)
n2 = n**2
sigma2 = sigma**2
tau2 = self.tau**2
a = np.sqrt(2*sigma2 / (2*sigma2 + n*tau2))
b = tau2*n2*(Z_bar - self.theta0)**2
c = 4*sigma2*(2*sigma2 + n*tau2)
return a*np.exp(b/c)
def get_lambda_bernoulli_cf(self, X, Y):
"""
Calculate lambda hat from closed form.
"""
n = len(X)
X, Y = np.array(X), np.array(Y)
X_bar, Y_bar = np.mean(X), np.mean(Y)
Z_bar = Y_bar - X_bar
# v_n can be 0 specially in the beginning where there is not enough data.
# The obvious case is where there is a single sample, but it can also happen
# when the value of the below is zero
Vn = X_bar*(1 - X_bar) + Y_bar*(1 - Y_bar)
if Vn == 0:
return None
n2 = n**2
tau2 = self.tau**2
a = np.sqrt(Vn / (Vn + n*tau2))
b = tau2*n2*(Z_bar - self.theta0)**2
c = 2*Vn*(Vn + n*tau2)
return a*np.exp(b/c)
def get_likelihood(self, theta: float, Z: ArrayLike, sigma):
"""
Function of theta for integration
Z is the array of Gaussian observations
"""
f_theta = norm.pdf(Z, theta, np.sqrt(2)*sigma)
f_theta0 = norm.pdf(Z, self.theta0, np.sqrt(2)*sigma)
pi_theta = norm.pdf(theta, self.theta0, self.tau)
log_likelihood = np.sum(np.log(f_theta) - np.log(f_theta0))
return np.exp(log_likelihood)*pi_theta
def get_lambda_normal_nm(self, X, Y, sigma):
"""
Calculate lambda hat numerically.
"""
if sigma == 0:
print("invalid data")
return None
X, Y = np.array(X), np.array(Y)
Z = Y - X
# We use 6 sigma as the limits of integral because it convers more
# than 99.99 % of the area under the normal distribution curve.
a, b = self.theta0 - 3*self.tau, self.theta0 + 3*self.tau
y, err = quad(self.get_likelihood, a, b, args=(Z, sigma))
return y
Let's first verify that the values of lambdas are similar in both closed form and numerical calculations for Gaussian samples. We can see that due to exponentiation, when the true difference beween the means is large and sigma is relatively small, the values of lambda get extremely large and we have numerical issues for comparing two calculations, but for cases where the values are small, the results are similar. This may need further investigation to make sure it does not affect the tests, but for now we skip it.
# compare numeric and closed form solutions for different samples sizes
N = 50
p1, p2, sigma = 0.3, 0.6, 1
theta0 = 0.0 # Diff of means under H0
tau = 2 # variance of mixture
msprt = mSPRT(theta0, tau)
sampler0 = GaussianSample(p1, sigma)
sampler1 = GaussianSample(p2, sigma)
lambda_cf, lambda_nm = [], []
for n in range(1, N+1): # n represents sample size
X = np.array(sampler0.draw(n))
Y = np.array(sampler1.draw(n))
lambda_cf.append(msprt.get_lambda_normal_cf(X, Y, sigma))
lambda_nm.append(msprt.get_lambda_normal_nm(X, Y, sigma))
# we can see that the closed form and numerical solutions are very close.
print("RMS", np.mean(np.array(lambda_cf) - np.array(lambda_nm)))
print("------------------------------------")
N = 50
p1, p2, sigma = 100.0, 130.0, 10
theta0 = 0.0 # Diff of means under H0
tau = 2 # variance of mixture
msprt = mSPRT(theta0, tau)
sampler0 = GaussianSample(p1, sigma)
sampler1 = GaussianSample(p2, sigma)
lambda_cf, lambda_nm = [], []
for n in range(1, N+1): # n represents sample size
X = np.array(sampler0.draw(n))
Y = np.array(sampler1.draw(n))
lambda_cf.append(msprt.get_lambda_normal_cf(X, Y, sigma))
lambda_nm.append(msprt.get_lambda_normal_nm(X, Y, sigma))
# we can see that the closed form and numerical solutions are very largely.
print("RMS", np.mean(np.array(lambda_cf) - np.array(lambda_nm)))
RMS 3.4458893599653705e-09 ------------------------------------ RMS 4.674640557812788e+22
Let's now simulate two different tests. One is an A/B test and the other is a Gaussian test. We have used the same values for the parameters only to show how different tests converge differently, otherwise these two tests are not the same: in the first one, the observations are drawn from a Bernoulli distribution, whereas in the second one, they are drawn from a Gaussian distribution.
# A/B test
N = 100
p1, p2 = 0.3, 0.6
theta0 = 0.0 # Diff of means under H0
tau = 0.05 # variance of mixture
msprt = mSPRT(theta0, tau)
# do not use the same random seed for the two samplers, otherwise they will be in sync
samplerb0 = BinomialSample(p1, seed=11)
samplerb1 = BinomialSample(p2, seed=37)
lambda_cf1 = []
X, Y = [], []
# We draw one at a time to simulate sequential testing
for i in range(1, N+1):
X.append(samplerb0.draw(n=1))
Y.append(samplerb1.draw(n=1))
lambda_cf1.append(msprt.get_lambda_bernoulli_cf(X, Y,))
#----------------------------------
# Gaussian test
N = 100
mu1, mu2, sigma = 0.3, 0.6, np.sqrt(0.45*(1-0.45)) # for average p (pooled)
theta0 = 0.05 # Diff of means under H0
tau = 0.1 # variance of mixture
msprt = mSPRT(theta0, tau)
# do not use the same random seed for the two samplers, otherwise they will be in sync
samplerg0 = GaussianSample(mu1, sigma, seed=11)
samplerg1 = GaussianSample(mu2, sigma, seed=37)
lambda_cf2, lambda_nm2 = [], []
X, Y = [], []
# We draw one at a time to simulate sequential testing
for i in range(1, N+1):
X.append(samplerg0.draw())
Y.append(samplerg1.draw())
lambda_cf2.append(msprt.get_lambda_normal_cf(X, Y, sigma))
lambda_nm2.append(msprt.get_lambda_normal_nm(X, Y, sigma))
Let's calculate p-values
lambda_cf1 = [x for x in lambda_cf1 if x is not None]
lambda_cf2 = [x for x in lambda_cf2 if x is not None]
lambda_nm2 = [x for x in lambda_nm2 if x is not None]
p_cf1 = [min(1, 1/x) for x in lambda_cf1]
p_cf2 = [min(1, 1/x) for x in lambda_cf2]
p_nm2 = [min(1, 1/x) for x in lambda_nm2]
# RMS of diff between closed form and numerical in each case
print(np.mean(np.array(lambda_cf2) - np.array(lambda_nm2)))
print(np.sqrt(np.mean((np.array(lambda_cf2) - np.array(lambda_nm2))**2)))
0.01769053394626333 0.028664705511410888
print(np.mean(np.array(p_cf2) - np.array(p_nm2)))
print(np.sqrt(np.mean((np.array(p_cf2) - np.array(p_nm2))**2)))
-0.0011365351459868937 0.001591464082717296
from copy import deepcopy
def fmin(a):
"""
Forward copy the minimum element in a list, e.g.
[4, 3, 5, 6, 2, 4, 1, 3, 6] --> [4, 3, 3, 3, 2, 2, 1, 1, 1]
"""
L = deepcopy(a)
if not L:
return None
mn = L[0]
for i in range(len(L)):
if L[i] < mn:
mn = L[i]
else:
L[i] = mn
return L
pv_cf1 = fmin(p_cf1)
pv_cf2 = fmin(p_cf2)
pv_nm2 = fmin(p_nm2)
print(np.mean(np.array(pv_cf2) - np.array(pv_nm2)))
print(np.sqrt(np.mean((np.array(pv_cf2) - np.array(pv_nm2))**2)))
-0.0016721060447630182 0.00212154120582936
X1, X2 = [m for m in range(len(p_cf1))], [m for m in range(len(p_cf2))]
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"P Raw truncated at 1 for mSPRT for Gaussian and Bernoulli tests for verification only (these are not p values)",
# 'yaxis_range': [-0.5, 1.5],
"height": 500}, )
fig.add_scatter(x=X1, y=p_cf1, name="Bernoulli", mode="lines", line={"color": "blue", "width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X2, y=p_cf2, name="Gaussian", mode="lines", line={"color": "red", "width": 0.75}, row=1, col=1)
X1, X2 = [m for m in range(len(pv_cf1))], [m for m in range(len(pv_cf2))]
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"p-values for mSPRT for Gaussian and Bernoulli tests",
# 'yaxis_range': [-0.5, 1.5],
"height": 500}, )
fig.add_scatter(x=X1, y=pv_cf1, name="Bernoulli", mode="lines", line={"color": "blue", "width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X2, y=pv_cf2, name="Gaussian", mode="lines", line={"color": "red", "width": 0.75}, row=1, col=1)
fig.add_hline(y=0.05, line_width=0.75, line_dash="dot", line_color="black")
fig.add_annotation(
x=100.0, y=0.05, # position
text="alpha=0.05", # text
#showarrow=True,
arrowhead=3, arrowsize=1, arrowwidth=2, arrowcolor="black"
)
fig.show()
We can see that after about 50 samples, the Binomial process converged with p-value below the 0.05 $\alpha$ threshold. The fixed horizon sample size for this case would be:
# TODO is mSPRT equavalent to two sample? From the closed form solution and using 2sigma^2 it looks like it is.
p0 = 0.45
# Here we assume p0 is the baseline
sigma = np.sqrt(p0 * (1 - p0)) # baseline SD, no extra sqrt(2)
mde = 0.3
n_per_group = get_sample_size_MDE(
sigma=sigma,
alpha=0.05,
beta=0.2,
mde=mde,
two_sample=True, # applies sqrt(2) internally for equal-size two-sample
sides=2, # set to 1 if you want one-sided
attrition=0.0
)
print(n_per_group)
44
At first glance, the mSPRT does not seem to reduce the test size. But one should keep in mind that here the comparision is made with a mixture of alternatives, not a single alternative. Also, the power of mSPRT is 1, that is, the mSPRT always detets the effect if there is an effect. The mixture variance $\tau^2$ plays an important role in the length of the experiment. We will not simulate it here but as $\tau^2$ increases, the time for the p values to converge increases. Careful selection of the mixture variance is important.
X1, X2 = [m for m in range(len(pv_cf2))], [m for m in range(len(pv_nm2))]
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"comparing p values between closed form and numberic for for Normal case",
# 'yaxis_range': [-0.5, 1.5],
"height": 500}, )
fig.add_scatter(x=X1, y=pv_cf2, name="Closed Form", mode="lines", line={"color": "blue", "width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X2, y=pv_nm2, name="Numerical", mode="lines", line={"color": "red", "width": 0.75}, row=1, col=1)
We can see that the closed form and the numerical form are almost equivanet as they are fully overlapping.
Multi-Arm Bandit Tests¶
Problem Statement¶
In a typical setting, there are $K$ actions or amrs. Arm $a$ is associated with the unknown quantity $v_a$ which gives the value of that arm. Assume the rewards come from the probability distribution $f_a(y \mid \bf\Theta)$, where $a$ indexes the action that is taken, $y$ is the observed reward, and $\bf\Theta$ is the vector of unknown parameters that charactrize $f_a$ and it is to be learned from the experiment.
Examples:
- In the slot machine problem (binomial bandit), we have ${\bf\Theta} = [\theta_1, \theta_2, ..., \theta_K]^T$ and $v_a({\bf\Theta}) = \theta_a (a = 1, 2, ..., K)$.
- In a two factor experiment for maximizing the conversion rate of a website, where we have two dummy variabes $X_c$ (color red=0 or blue=1) and $X_p$ (position left=0 or right=1), $\bf\Theta$ could be the set of logistic regression coeficients $logit(Pr(conversion)) = \theta_0 + \theta_1X_c + \theta_2X_p + \theta_3X_cX_p$. The action $a$ is any realization of the vector of design variables ${\bf X} = [1, X_c, X_p, X_cX_p]^T$ with the value $v_a({\bf\Theta}) = logit^{-1}({\bf\Theta}^T {\bf X}_a)$. In general, ${\bf\Theta} = [\theta_0, \theta_1, ...., \theta_p]^T$ and ${\bf X} = [1, X_1, X_2, ..., X_p]$ is a vector representation of the actions.
Thomspon Sampling¶
For binomial bandit, the reward of an arm has a Bernoulli distribution $Bernoulli(\theta)$, where $\theta$ is the probability of success (reward = 1) of that arm. If we play this arm $n$ times, the total reward (the number of successes) is $y_{\theta} \sim Binomial(n, \theta)$. We seek the posterior distribution $P(\theta \mid y)$ as we observe rewards, which by Byes' rule is $P(\theta \mid y) \varpropto P(y \mid \theta)P(\theta)$. Assume a prior $\theta \sim Beta(\alpha, \beta)$ for the arm ($\alpha$ and $\beta$ are number of success and failures), so we can write the posterior distribution as:
$P(\theta \mid y) \varpropto P(y \mid \theta)P(\theta) = {n \choose y}\theta^y(1 - \theta)^{n - y} \frac 1 {B(\alpha, \beta)}\theta^{\alpha - 1}(1 - \theta)^{\beta - 1} \varpropto \theta^{y+\alpha-1}(1 - \theta)^{n-y+\beta-1} \varpropto Beta(y + \alpha, n - y + \beta)$ (25)
The Beta family is conjugate to the Binomial family, meaning that if we start from a Beta prior, we will end with a Beta Posterior. Note that $y$ is the number of successes and $n-y$ is the number of failures, so to model $P(\theta)$ we start by setting $\alpha$ and $\beta$ to 1 for all arms, and then increment them accordingly as we observe a success or failure for that arm. In the long run, the optimal arm will have higher rate of successes and hence higher probability of being selected via Thompson sampling.
Stopping Criteria for Thompson Sampling¶
We can apply the Thompson sampling to determine the winning arm in the multi arm bandit problem as follows. Let ${\bf y}_t$ be the set of rewards observed by time $t$. The probability that the arm $a$ is optimal can be defined by:
$\omega_{at} = Pr(a \space is \space optimal \mid {\bf y}_t) = \int I(a = argmax \space v_a({\bf\Theta}))P({\bf \Theta} \mid {\bf y}_t)d{\bf\Theta}$ (26)
which simply states that the action $a$ has the highest probability of being optimal if for all possible values of the parameter vector it has the maximum value most of the time. $I$ is the indicator function. Note that $\omega_{at}$ is not the same as the belief in $P({\bf\Theta})$ in the experiment.
The Thompson heuristic assigns observation at time $t+1$ to arm $a$ with the probability $\omega_{at}$. We can compute $\omega_{at}$ from the Monte-Carlo sample ${\bf \Theta}^{(1)}, {\bf \Theta}^{(2)}, ..., {\bf\Theta}^{(G)}$ simulated from $P({\bf\Theta} \mid {\bf y}_t)$ using
$\omega_{a_t} \approx \frac 1 G \sum_{g=1}^GI(a = argmax \space v_a({\bf\Theta}^{(g)}))$ (27)
which will be the proportion of the times for which the action $a$has the maximum value in the sample.
Note 1: The algorithm where one first computes $\omega_{at}$ and then generates an action from the discrete set $\omega_{1t}, \omega_{2t}, ...,\omega_{Kt},$ is equivalent to generating a single ${\bf\Theta}_{t} \sim P({\bf \Theta} \mid {\bf y}_t)$ and selecting the action $a$ with maximum value $v_a({\bf\Theta}_t)$.
Thus Thompson sampling can be implemented using a single draw from $P({\bf \Theta} \mid \bf y)$ but calculating $\omega_{at}$ produces useful statistics, including a decision rule to stop the experiment. In practice, in a single Monte-Carlo draw Thompson sampling, we can still compute $\omega_{at}$, but we need to keep track of the history of the number of draws and number of times an action had the maximum values in all draws. Using G draws to calculate this proability is easier.
As an example consider the three arm example $\bf\Theta = (\theta_a, \theta_b, \theta_c)^T$ below and assume at time $t$, we have (13, 9) success and failures for arm $a$, (21, 11) for arm $b$, and (21, 11) for arm $c$. Suppose we get the following Monte-Carlo samples ${\bf\Theta}^{(g)}$:
| draw(g) | $\theta_a \sim Beta(13, 9)$ | $\theta_b \sim Beta(21, 11)$ | $\theta_c \sim Beta(31, 11)$ | Winner |
|---|---|---|---|---|
| 1 | 0.54 | 0.73 | 0.74 | c |
| 2 | 0.55 | 0.66 | 0.73 | c |
| 3 | 0.53 | 0.85 | 0.80 | b |
| 4 | 0.57 | 0.50 | 0.65 | c |
| 5 | 0.52 | 0.67 | 0.83 | c |
| ... | ... | ... | ... | ... |
| G | 0.65 | 0.84 | 0.63 | b |
From this sample we can calculate the probabilities that each action is optimal, ${\bf\Omega}_t = (\omega_{at}, \omega_{bt}, \omega_{ct})$. Assume that in 2/3 rows above arm $c$ is the winner, and in 1/3 of the rows arm $b$ is the winner. So at this time step $t$, the probability of being optimal for each action is ${\bf\Omega}_t = (0.0, 0.333, 0.667)$.
The probability of being the optimal arm $\omega_{at}$ can be used as a stoppage criteria to end the experiment. For example we can decide to end the experiment once one of the arms has a probability of being optimal of 95%.
From the Monte-Carlo method that is used to calculate $\omega_{at}$ we can also calculate the estimated regret and use it as a second condition to end the experiemnt in cnjunction with the 95% confidence of finding the optimal arm. Let ${\bf \Theta}_o$ be the true value ${\bf \Theta}$ and $a^* = argmax_a v_a({\bf \Theta}_o)$ be the tue optimal arm. Then the regret at time $t$ is $r_t = v_{a^*}({\bf \Theta}_0) - v_{a_t^*}({\bf \Theta}_0)$, which is the value difference between the truely optimal arm and the apparently optimal arm at time $t$.
In practice, the true regret is unknown but it can be estimated as $r_t^{(g)} = v_*({\bf \Theta}^{(g)}) - v_{a_t^*}({\bf \Theta}^{(g)})$ (28)
where $v_*({\bf \Theta}^{(g)}) = max_a v_a({\bf \Theta}^{(g)})$ is the current maximum value in the Monte-Carlo draw of ${\bf \Theta}^{(g)}$ from $P({\bf \Theta} \mid {\bf y}_t)$ in draw $g$ and
$v_{a_t^*}({\bf \Theta}^{(g)})$ the value of the arm that is deemed best across all Monte-Carlo draws, in draw $g$.
Note the emphasis on draw $g$ here. This means that we first find the best action $a_t^*$ across all draws, and then use its value in the current draw. Hence, the difference in Eq. (28) is often zero and sometimes positive.
In practice, it may be useful to use the percentage change of the regret instead of its absolute value defined by
$\rho_t^{(g)} = \frac {v_*({\bf \Theta}^{(g)}) - v_{a_t^*}({\bf \Theta}^{(g)})}{v_{a_t^*}({\bf \Theta}^{(g)}) }$ (29)
The second criterion to end the experiment can now be defined as stopping the experiment when the potential value remaining (PVR) which is defined by (29) is below a practical significance level, or if there no practical significance is given, the experiment can be stopped when regret is less than 1%.
Let's use the previous example to demonstrate how regret is calculated. As evident by its expected value, the best arm so far at time $t$ is arm c. To calculate the regret in each draw, i.e., row, we subtract the maximum value of the row from the value of this arm in that row. For example, the maximum value of row 1 is 0.74 which is actually the value of arm c in this row, hence the regret is 0. In row 3, arm c does not have the maximum value. In this row, the maximum value is 0.85 and the value of arm c is 0.8, resulting in a net regret of 0.05.
| draw(g) | $\theta_a \sim Beta(13, 9)$ | $\theta_b \sim Beta(21, 11)$ | $\theta_c \sim Beta(31, 11)$ | Regret |
|---|---|---|---|---|
| 1 | 0.54 | 0.73 | 0.74 | 0.0 |
| 2 | 0.55 | 0.66 | 0.73 | 0.0 |
| 3 | 0.53 | 0.85 | 0.80 | 0.05 |
| 4 | 0.57 | 0.50 | 0.65 | 0.0 |
| 5 | 0.52 | 0.67 | 0.83 | 0.0 |
| ... | ... | ... | ... | ... |
| G | 0.65 | 0.84 | 0.63 | 0.11 |
The numbers in the above table show that there is still some potential value remaining in the experiment as the regret is more than 1% in some cases, hence there is a chance that there is an action that is more optimal than c. But if instead, we had the following results:
| draw(g) | $\theta_a \sim Beta(13, 9)$ | $\theta_b \sim Beta(21, 11)$ | $\theta_c \sim Beta(31, 11)$ | Regret |
|---|---|---|---|---|
| 1 | 0.54 | 0.73 | 0.74 | 0.0 |
| 2 | 0.55 | 0.66 | 0.73 | 0.0 |
| 3 | 0.53 | 0.805 | 0.80 | 0.005 |
| 4 | 0.57 | 0.50 | 0.65 | 0.0 |
| 5 | 0.52 | 0.67 | 0.83 | 0.0 |
| ... | ... | ... | ... | ... |
| G | 0.65 | 0.639 | 0.63 | 0.009 |
The maximum value of the regret across all draws is 0.009, which indicates that there is less than 1% potential value remaining in the experiment. If we also determine that action c has more than 95% probablity of being the optimal arm, we can end this experiment.
An alternative way to track the regret is to estimate its expected value. The the expected value of an action be defined by $\tilde{v}_a({\bf \Theta}) = E[v_a({\bf \Theta})]$.
The estimated expected regret in is defined by:
$\tilde{r_t} = E[v_{a_t^*}(\bf \Theta_t)] - E[v_{a_t}(\bf \Theta_t)]$ (30)
where $a_t^*$ is the best known action at time $t$ and $a_t$ is the selected action at time $t$. Note that we have defined the above eqation for single draw Thomson sampling at each time step $t$, but it is eually aplicable to each draw $g$ in $G$ Monte-Carlo draws with some changes in the notation.
The expected regrets for the above example are shown below. The expected values for the three actions are 13/22 = 0.59, 21/32 = 0.656 and 31/42 = 0.738.
| draw(g) | $\theta_a \sim Beta(13, 9)$, $E[\theta_a]=0.59$ | $\theta_b \sim Beta(21, 11)$, $E[\theta_b]=0.656$ | $\theta_c \sim Beta(31, 11)$, $E[\theta_c]=0.738$ | Expected Regret |
|---|---|---|---|---|
| 1 | 0.54 | 0.73 | 0.74 | 0.0 |
| 2 | 0.55 | 0.66 | 0.73 | 0.0 |
| 3 | 0.53 | 0.805 | 0.80 | 0.082 |
| 4 | 0.57 | 0.50 | 0.65 | 0.0 |
| 5 | 0.52 | 0.67 | 0.83 | 0.0 |
| ... | ... | ... | ... | ... |
| G | 0.65 | 0.639 | 0.63 | 0.082 |
To understand the calculations, the drawn values are only used to find the maximum (selected) arm, and the regret is calcuated using the expected values. For example, in row 1, arm c is both the best known action (highest expected value) and the action with maximum value in this row, so the regret is 0.738 - 0.738 = 0.0. In row 3, arm b has the maximum value and the difference between it value and that of the best known arm c is 0.738 - 0.656 = 0.082. We get the same result in row G because even though the drawn values are different than row 3, the maximum arm and the best know arm are the same, hence the difference between the expected values is the same.
Note: The difference between time step $t$ and draws $G$: At each time step we must have at least one draw, but having more draws G > 1 is optional. The regret definitions are per a single draw, so it does not matter how many draws we have. We can decide on some aggregated values of regret when we have G > 1 if desirable. In a single draw sampling we just calculate one value for regret, either the sampled value or the expected value. The probability of being the optimal arm (26) can also be computed using single draws per time step. In this sense the summation in (26) will be done over $t$ time steps. In practice, however, this become more challenging because we now have to keep track of the history of the number of time steps and the number of times each arm was selected. This tracking gets even more complicated in batch computation. Hence it is easier to draw G > 1 and calculate the probabilities of bein the optimal arm on the spot without keeping track of any history.
Also the draws G should not be confused with batch processing. The time step in batch processing $\tau$ counts the number of batches, whereas the time step $t$ counts each tim ethe arm is played (shown to the user). In bacth processing we need to average the regrets. Based on all of this, the most efficient Monte-Carlo sampling for bacth mode is as follows: During service, do a single draw at each time step $t$. During posterior update, calulate the average expected regret for the batch, draw G samples and calculate the probabilities of being optimal.
Simulated Example¶
class Arm():
"""
The simulated reward environment for b Bernoulli arm with an unknown true
probability p.
"""
def __init__(self, p: float, seed: int=42):
self.rng = np.random.default_rng(seed)
self.p = p
def get_reward(self):
return self.rng.binomial(1, self.p)
class ArmModel():
"""
Model is Beta distribution
for Bernoulli bandit.
Params: alpha, beta
"""
def __init__(self, seed: int=42):
"""
"""
self.rng = np.random.default_rng(seed)
self.alpha = 1
self.beta = 1
self.value = 0.5
def reset(self):
self.alpha = 1
self.beta = 1
self.value = 0.5
def posterior_update(self, reward: int):
self.alpha += reward
self.beta += 1 - reward
@property
def expected_value(self):
return self.alpha / (self.alpha + self.beta)
# we don't need the setter method for the expected_value property
def draw(self):
self.value = self.rng.beta(self.alpha, self.beta)
return self.value
class MAB():
"""
For the purpose of arm selection, we draw one sample from
the posterior instead of drawing G samples and selecting the most probable arm
based on omega probablities. Hence this MAB forces more exploration. However, we do take
into account as an option drawing G samples for calculating omega probabilities for
stoppage.
"""
def __init__(self, arms: list, models: list, G: int =100, seed: int=42):
"""
Multi-Arm Bandit class
Arguments:
arms (list): List of Bernoulli arms
models (list): List of models for the arms
G (int): Number of Monte-Carlo draws
"""
self.rng = np.random.default_rng(seed)
self.K = len(arms)
self.arms = arms
self.models = models
self.G = G
def draw(self):
return [model.draw() for model in self.models]
def observe(self, selected_arm):
return self.arms[selected_arm].get_reward()
def sample_omega(self):
"""
Sample the proability of being optimal for each arm model
by taking G Monte-Carlo draws and calculating the frequency
of being the winner for each arm.
"""
assert self.G > 1, "Monte-Carlo sample needs G > 1"
omega = np.array([0]*self.K)
for g in range(self.G):
sample = self.draw()
idx = np.argmax(sample)
omega[idx] += 1
omega = omega / self.G
return omega
def select(self, sample):
return np.argmax(sample)
def regret_from_expected_value(self, sample):
expected_values = [model.expected_value for model in self.models]
selected_arm = self.select(sample)
best_arm = self.select(expected_values)
regret = expected_values[best_arm] - expected_values[selected_arm]
regret_percent = regret / expected_values[best_arm]
return regret, regret_percent
def regret_from_value(self, sample, omegas=None):
omegas = omegas or self.sample_omega()
selected_arm = self.select(sample)
best_arm = self.select(omegas)
regret = sample[selected_arm] - sample[best_arm]
regret_percent = regret / sample[best_arm]
return regret, regret_percent, omegas
def posterior_update(self, selected_arm, reward):
self.models[selected_arm].posterior_update(reward)
class ExperimentResult():
"""Placeholder for results
'norm' prefix stands for normalized regret, i.e., regret percentage.
'avg' prefix stands for running average of the regret at the time step.
We use range [1, T] that corresponds to the number of tirals.
"""
def __init__(self, K: int, T: int, regret_method: str):
self.K = K
self.X = [i for i in range(1, T)]
self.regret_method = regret_method
self.drawn = [0]*self.K
self.rewards = []
self.regret = []
self.norm_regret = []
self.avg_regret = []
self.avg_norm_regret = []
self.omegas = []
def format(self):
self.omegas = np.round(self.omegas, 3)
self.regret = np.round(self.regret, 3)
self.avg_regret = np.round(self.avg_regret, 3)
self.norm_regret = np.round(self.norm_regret, 3)
self.avg_norm_regret = np.round(self.avg_norm_regret, 3)
class RegretMethod():
def __init__(self, name):
self.name = name
def calc_step_average(self, x_list, x_t):
"""
Running avergae of sequence of values
"""
if not x_list:
return x_t
t = 1 + len(x_list)
x_t_minus_1 = x_list[-1]
return x_t_minus_1 + (1 / t) * (x_t - x_t_minus_1)
def append_results(
self,
mab: MAB,
reward: int,
sample: list,
selected_arm: int,
result: ExperimentResult):
raise NotImplementedError
class ERegret(RegretMethod):
"""
Expected regret
"""
def init__(self, name):
super(ERegret, self).__init__(name)
def append_results(
self,
mab: MAB,
reward: int,
sample: list,
selected_arm: int,
result: ExperimentResult
):
result.rewards.append(self.calc_step_average(result.rewards, reward))
result.drawn[selected_arm] += 1
step_omegas = list(np.array(result.drawn) / sum(result.drawn))
a, b = mab.regret_from_expected_value(sample)
result.regret.append(a)
result.norm_regret.append(b)
result.avg_regret.append(self.calc_step_average(result.avg_regret, a))
result.avg_norm_regret.append(self.calc_step_average(result.avg_norm_regret, b))
result.omegas.append(step_omegas)
class TRegret(RegretMethod):
"""
Time step based regret
"""
def init__(self, name):
super(TRegret, self).__init__(name)
def append_results(
self,
mab: MAB,
reward: int,
sample: list,
selected_arm: int,
result: ExperimentResult
):
"""
In the first run the selected_arm in step_omegas has the highest omega which is the way it should be.
Also, this shows why we need two conditions for optimality convergence: In the begining
we have an arm with 100% probability of being optimal, but the regret is large. If after
the regret gets sufficiently small we get a p_optimal > 0.95 for an arm, then we have found
the optimal arm.
"""
result.rewards.append(self.calc_step_average(result.rewards, reward))
result.drawn[selected_arm] += 1
step_omegas = list(np.array(result.drawn) / sum(result.drawn))
a, b, _ = mab.regret_from_value(sample, omegas=step_omegas)
result.regret.append(a)
result.norm_regret.append(b)
result.avg_regret.append(self.calc_step_average(result.avg_regret, a))
result.avg_norm_regret.append(self.calc_step_average(result.avg_norm_regret, b))
result.omegas.append(step_omegas)
class GRegret(RegretMethod):
"""
G sample based regret.
"""
def init__(self, name):
super(GRegret, self).__init__(name)
def append_results(
self,
mab: MAB,
reward: int,
sample: list,
selected_arm: int,
result: ExperimentResult
):
result.rewards.append(self.calc_step_average(result.rewards, reward))
result.drawn[selected_arm] += 1
a, b, step_omegas = mab.regret_from_value(sample)
result.regret.append(a)
result.norm_regret.append(b)
result.avg_regret.append(self.calc_step_average(result.avg_regret, a))
result.avg_norm_regret.append(self.calc_step_average(result.avg_norm_regret, b))
result.omegas.append(step_omegas)
class Experiment():
def __init__(
self,
p_list: list,
regret_method: RegretMethod,
T: int,
G:int,
seed: int=42
):
arms = [Arm(p, seed=seed) for p in p_list]
models = [ArmModel(seed=seed+1) for i in range(len(p_list))]
self.mab = MAB(arms, models, G, seed=seed+2)
self.regret_method = regret_method
self.T = T
self.result = ExperimentResult(self.mab.K, self.T, self.regret_method.name)
def run(self):
# time steps
# We use range [1, T] that corresponds to the number of tirals.
for t in range(1, self.T + 1):
sample = self.mab.draw()
selected_arm = self.mab.select(sample)
reward = self.mab.observe(selected_arm)
self.regret_method.append_results(
self.mab,
reward,
sample,
selected_arm,
self.result)
self.mab.posterior_update(selected_arm, reward)
self.result.format()
return self.result
p_list=[0.58, 0.60, 0.62]
T = 7000
G = 100
seed = 42
experiment1 = Experiment(p_list=p_list, regret_method=ERegret("E Regret"), T=T, G=G, seed=seed)
result1 = experiment1.run()
experiment2 = Experiment(p_list=p_list, regret_method=TRegret("T Regret"), T=T, G=G, seed=seed)
result2 = experiment2.run()
experiment3 = Experiment(p_list=p_list, regret_method=GRegret("G Regret"), T=T, G=G, seed=seed)
result3 = experiment3.run()
X = result1.X
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Regret Values for Different Methods", "height": 500, "yaxis_range": [0.0, 0.5],}, )
fig.add_scatter(x=X, y=result1.regret, name=result1.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result2.regret, name=result2.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result3.regret, name=result3.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Avg Regret Values for Different Methods", "height": 500, "yaxis_range": [0.0, 0.2],}, )
fig.add_scatter(x=X, y=result1.avg_regret, name=result1.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result2.avg_regret, name=result2.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result3.avg_regret, name=result3.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Norm Regret Values for Different Methods", "height": 500, "yaxis_range": [0.0, 0.5],}, )
fig.add_scatter(x=X, y=result1.norm_regret, name=result1.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result2.norm_regret, name=result2.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result3.norm_regret, name=result3.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Avg Norm Regret Values for Different Methods", "height": 500, "yaxis_range": [0.0, 0.5],}, )
fig.add_scatter(x=X, y=result1.avg_norm_regret, name=result1.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result2.avg_norm_regret, name=result2.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result3.avg_norm_regret, name=result3.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
result1.norm_regret[-1], result2.norm_regret[-1], result3.norm_regret[-1]
(np.float64(0.0), np.float64(0.0), np.float64(0.0))
result1.drawn, result2.drawn, result3.drawn
([522, 1024, 5454], [522, 1024, 5454], [468, 1012, 5520])
result1.omegas[-1], result2.omegas[-1], result3.omegas[-1]
(array([0.075, 0.146, 0.779]), array([0.075, 0.146, 0.779]), array([0. , 0.03, 0.97]))
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Probability of being optimal for the true best arm", "height": 500, "yaxis_range": [0.0, 1.0],}, )
fig.add_scatter(x=X, y=result1.omegas[:, -1], name=result1.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result2.omegas[:, -1], name=result2.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result3.omegas[:, -1], name=result3.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig = make_subplots(rows=1, cols=1,)
fig.update_layout({"title": f"Average Rewards", "height": 500, "yaxis_range": [0.0, 1.0],}, )
fig.add_scatter(x=X, y=result1.rewards, name=result1.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result2.rewards, name=result2.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
fig.add_scatter(x=X, y=result3.rewards, name=result3.regret_method, mode="lines", line={"width": 0.75,}, row=1, col=1)
Discussion¶
From the above charts, we can see that there is not much difference between the regret calculation methods (expected, time, G sample), although the expected regret seems to converge a bit faster. The effect of averaging on the smoothness of the curves is obvious. In non averaging methods, the expected regret flacuates more in the later steps of the experiment but that is expected because the beta distribution for the optimal arms are narrower in late steps and the differences between expected values are more crisp than the sampled values. Also note that after decreasing in early steps, the regret almost stays constant with flactuations between zero and anon-zero values. Notice that regret will never remain zero forever, because there is always some remaining variance in the model which gets close to zero but never becomes zero. So seeing vertical spikes even after the experiment converged is normal, altough as time continues, these spikes get more and more distant.
In terms of the probability of being optial, we can see that the G sampling method converges faster than the time sampling method (Note: both ERegret and TRegret use the same method, time step sampling, for omega, so they have exactly the same curves in the plot). It can be seen that the time sampling method for omega computation is slower than the G sampling method, because it takes the entire history of arm selections into account, which has a lot of non-optimality in the beginning. The G sampling method on the other hand relies on immediate posterior distributions and hence shows updated results much more quickly without taking into account the useless data from the early steps. So this should be the method of choice for calculating omegas. Also, a nice thing about the G sampling method is that it can be implemented independently from the arm selection logic and be used merely to compute optimality probabilities.
# TESTS
rm = RegretMethod("rm")
vals = [1, 2, 3, 4, 5]
avgs = []
for val in vals:
avgs.append(rm.calc_step_average(avgs, val))
assert avgs == [1, 1.5, 2, 2.5, 3.0]
avgs
[1, 1.5, 2.0, 2.5, 3.0]
References¶
Sullivan, Lisa M. (n.d.). Power and Sample Size Determination. Boston University School of Public Health (teaching module).
PDF copy accessed Jan 26, 2026: https://paragstatistics.wordpress.com/wp-content/uploads/2017/01/sample-size.pdf
Original BU OTLT URL (now redirects): https://sphweb.bumc.bu.edu/otlt/mph-modules/bs/bs704_power/bs704_power_print.html
Johari, Ramesh, Leo Pekelis, and David Walsh (2015). “Always valid inference: Bringing sequential analysis to A/B testing”.
Johari, Ramesh et al. (2017). “Peeking at A/B Tests: Why It Matters, and What to Do About It”. KDD ’17. Halifax, NS, Canada: ACM, pp. 1517–1525.
Robbins, Herbert (1970). “Statistical Methods Related to the Law of the Iterated Logarithm”. The Annals of Mathematical Statistics 41.5, pp. 1397–1409.
Wald, A. (1945). “Sequential Tests of Statistical Hypotheses”. English. The Annals of Mathe- matical Statistics 16.2, pp. 117–186.
Stenberg, Erik (2019). "Sequential A/B Testing Using Pre-Experiment Data". Master's Thesis, Uppsala Universitet.
Scott S. (2015). "Multi-armed bandit experiments in the online service economy". Applied stochastic models in business and industry. vol(31:1) Special Issue:Actual Impact and Future Perspectives on Stochastic Modelling in Business and Industry.