hypothesis testing

because they do not actually explain what a p-value is in intro statistics classes

2025-11-30

It is my considered opinion as a mathematician that I learned the fundamentals of statistics much better from my Privacy and Fairness class than either I did from AP Statistics or Roofon did from her college statistics class. The latter two explained how to perform a bunch of specific hypothesis tests, but only in my Privacy and Fairness class did I learn what a hypothesis test actually is. I figured it was worth writing out this knowledge so I can solidify it a little better.

Apologies to any actual statisticians in my audience.

You can probably skip this part if “a random variable is a function from a sample space $\Omega$ to some nice space like a finite set or $\mathbb{R}$ or whatever” is good enough for you.

A probability space is a measure space $(\Omega, P)$ such that $P(\Omega) = 1$. The set $\Omega$ is the sample space, and contains possible samples $\omega$. An event $E \subseteq \Omega$ is any measurable subset of $\Omega$. We will denote the set of measurable subsets of $\Omega$ with $\varphi(\Omega)$¹¹ Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.. The measure $P$, a function $P : \varphi(\Omega) \to [0, 1]$, is the probability measure, and determines how probable any given sample $\omega \in \Omega$ or event $E \subseteq \Omega$ is. Sometimes we write $P[\,]$ instead of $P(\,)$ if it looks nicer.

A measurable function is a function such that the preïmages of measurable sets are measurable.

Let $\mathcal{X}$ be a measure space²² Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.. An $\mathcal{X}$-valued random variable is a measurable function $X : \Omega \to \mathcal{X}$. Common choices of $\mathcal{X}$ are finite sets (like $\{\text{Heads}, \text{Tails}\}$ or $\{1, 2, 3, 4, 5, 6\}$), $\mathbb{R}$, or intervals like $[0, 1]$. A random variable induces a measure on $\mathcal{X}$, which we also write P, given by

\begin{align}P(U) &:= P(X^{-1}(U))\\ &=P(\{\omega \in \Omega : X(\omega) \in U\}) \end{align}

If $\mathcal{X}$ is, for example, $\mathbb{R}$, we often write stuff like

\begin{align} P(X \geq x_0) &:= P([x_0, \infty)),\\ P(X = x_0) &:= P(\{x_0\}), \end{align}

or in general, $P(\psi(X)) = P(\{x \in \mathcal{X} : \psi(x)\})$ for any (measurable) predicate $\psi : \mathcal{X} \to \{0, 1\}$. This notation can be seen as sort of like conflating a function with the output of that function—similar to how you might think of the formula “$\sin(x^2)$” as referring to the function $f$ such that $f(x) = \sin(x^2)$.

Suppose we have a random variable $X : \Omega \to \mathcal{X}$, and we want to use hypothesis testing to learn about $X$. Often, the random variable is a sequence of individual random variables $X = (X_1, X_2, X_3, \ldots, X_n)$, and so $X$ takes on values $x = (x_1, x_2, x_3, \ldots, x_n)$ in $\mathcal{X} = (\mathcal{X}_1, \mathcal{X}_2, \mathcal{X}_3, \ldots, \mathcal{X}_n)$. In the common case that each $X_i$ is real-valued, we have $\mathcal{X} = \mathbb{R}^n$.

A hypothesis is a proposition about X. Propositions about X can be given by predicates on the set of $\mathcal{X}$-valued random variables. That is, a hypothesis is a function

H : \{f: \Omega \to \mathcal{X} \mathop{|} f \text{ measurable}\} \to \{0, 1\}.

The null hypothesis is some specific hypothesis H₀.

A test statistic is a measurable function $T : \mathcal{X} \to Y$³³ Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of $\mathcal{X}$., where $Y$ is a measurable ordered⁴⁴ Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.

Slightly trollish definition that might work: a one-sided test-statistic is a measurable function $T : \mathcal{X} \to \mathbb{R}$, where $\mathbb{R}$ is given the partial order where $0 < x$ for all $x$, positive and negative numbers are incomparable to each other, and $x < x'$ if $|x| < |x'|$ when $x$ and $x'$ have the same sign. The corresponding two-sided test-statistic uses the order where $x \le x'$ whenever $|x| \le |x'|$, regardless of sign. set such as $\mathbb{R}$. We say that $x$ is a more extreme outcome than $x'$ (according to $T$) when $T(x) > T(x')$.

The p-value of a sample x under a test-statistic T is given by

\begin{align} p &:= P[T(X_0) \geq T(x)] \\ &= P[\{ x_0 \in \mathcal{X} : T(x_0) \geq T(x)\}]\\ &= P[\{\omega \in \Omega : T(X_0(\omega)) \geq T(x)\}] \end{align}

where X₀ is a random variable compatible with H₀. You want to choose H₀ and T such that which X₀ you use does not matter—H₀ should be enough information to compute a p-value.

In English, the p-value of a sample x is the probability of an outcome at least as extreme as x, assuming that the null hypothesis holds.

For a given alternative hypothesis H_A, and significance level α, the power of a hypothesis test is the probability, assuming H_A, that p < α. That is, let y_α be the largest element of Y such that

P[\{y \in Y : y \geq y_\alpha\}] \leq \alpha.

under the null hypothesis. Then, the power is given by

P[\{\omega \in \Omega : T(X_A(\omega)) \geq y_\alpha\}].

You generally want your hypothesis test to be as powerful as possible given a particular value of α.

Example

Suppose that a gamer, Dream, might be cheating in a Minecraft speedrun. Specifically, you have a livestream VOD, and you want to check if Dream’s blaze rod drop rate has been altered.

The null hypothesis is that the drop rate is normal, 1/2. More specifically, it is that the blaze rod drops are some Bernoulli process X₀ where each trial has probability 1/2⁵⁵ We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth..

The sample x = (x₁, x₂, x₃, …, x_n) is a boolean sequence where x_i = 1 if the ith killed blaze dropped a rod, and 0 if it did not.

The test-statistic $T : \mathbf{2}^n \to \mathbb{N}$ is given by $T(x) = \sum x_i$, and measures the number of dropped blaze rods.

The p-value is given by

\displaystyle\begin{align} p &= P[T(X_0) \geq T(x)]\\ &= \sum_{k=T(x)}^n P[T(X_0) = k]\\ &= \sum_{k=T(x)}^n \binom{n}{k}\div2^{n}. \end{align}

If p is sufficiently small, you should be suspicious of the null hypothesis.

You can use literally whatever the hell test-statistic you want⁶⁶ Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking., so long as you know how to correctly compute a p-value given a null hypothesis. The specific hypothesis tests they tell you about in school are just a collection of convenient ones for commonly occurring types of random variables and null hypotheses.

Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.
↩
Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.
↩
Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of $\mathcal{X}$.
↩
Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.
Slightly trollish definition that might work: a one-sided test-statistic is a measurable function $T : \mathcal{X} \to \mathbb{R}$, where $\mathbb{R}$ is given the partial order where $0 < x$ for all $x$, positive and negative numbers are incomparable to each other, and $x < x'$ if $|x| < |x'|$ when $x$ and $x'$ have the same sign. The corresponding two-sided test-statistic uses the order where $x \le x'$ whenever $|x| \le |x'|$, regardless of sign.
↩
We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth.
↩
Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking.
↩