hypothesis testing
because they do not actually explain what a p-value is in intro statistics classes
It is my considered opinion as a mathematician that I learned the fundamentals of statistics much better from my Privacy and Fairness class than either I did from AP Statistics or Roofon did from her college statistics class. The latter two explained how to perform a bunch of specific hypothesis tests, but only in my Privacy and Fairness class did I learn what a hypothesis test actually is. I figured it was worth writing out this knowledge so I can solidify it a little better.
Apologies to any actual statisticians in my audience.
You can probably skip this part if “a random variable is a function from a sample space $\Omega$ to some nice space like a finite set or $\mathbb{R}$ or whatever” is good enough for you.
A probability space is a measure space $(\Omega, P)$ such that $P(\Omega) = 1$. The set $\Omega$ is the sample space, and contains possible samples $\omega$. An event $E \subseteq \Omega$ is any measurable subset of $\Omega$. We will denote the set of measurable subsets of $\Omega$ with $\varphi(\Omega)$11 Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.. The measure $P$, a function $P : \varphi(\Omega) \to [0, 1]$, is the probability measure, and determines how probable any given sample $\omega \in \Omega$ or event $E \subseteq \Omega$ is. Sometimes we write $P[\,]$ instead of $P(\,)$ if it looks nicer.
A measurable function is a function such that the preïmages of measurable sets are measurable.
Let $\mathcal{X}$ be a measure space22 Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.. An $\mathcal{X}$-valued random variable is a measurable function $X : \Omega \to \mathcal{X}$. Common choices of $\mathcal{X}$ are finite sets (like $\{\text{Heads}, \text{Tails}\}$ or $\{1, 2, 3, 4, 5, 6\}$), $\mathbb{R}$, or intervals like $[0, 1]$. A random variable induces a measure on $\mathcal{X}$, which we also write P, given by
If $\mathcal{X}$ is, for example, $\mathbb{R}$, we often write stuff like
or in general, $P(\psi(X)) = P(\{x \in \mathcal{X} : \psi(x)\})$ for any (measurable) predicate $\psi : \mathcal{X} \to \{0, 1\}$. This notation can be seen as sort of like conflating a function with the output of that function—similar to how you might think of the formula “$\sin(x^2)$” as referring to the function $f$ such that $f(x) = \sin(x^2)$.
Suppose we have a random variable $X : \Omega \to \mathcal{X}$, and we want to use hypothesis testing to learn about $X$. Often, the random variable is a sequence of individual random variables $X = (X_1, X_2, X_3, \ldots, X_n)$, and so $X$ takes on values $x = (x_1, x_2, x_3, \ldots, x_n)$ in $\mathcal{X} = (\mathcal{X}_1, \mathcal{X}_2, \mathcal{X}_3, \ldots, \mathcal{X}_n)$. In the common case that each $X_i$ is real-valued, we have $\mathcal{X} = \mathbb{R}^n$.
A hypothesis is a proposition about X. Propositions about X can be given by predicates on the set of $\mathcal{X}$-valued random variables. That is, a hypothesis is a function
The null hypothesis is some specific hypothesis H0.
A test statistic is a measurable function $T : \mathcal{X} \to Y$33 Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of $\mathcal{X}$., where $Y$ is a measurable ordered44 Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.
Slightly trollish definition that might work: a one-sided test-statistic is a measurable function $T : \mathcal{X} \to \mathbb{R}$, where $\mathbb{R}$ is given the partial order where $0 < x$ for all $x$, positive and negative numbers are incomparable to each other, and $x < x'$ if $|x| < |x'|$ when $x$ and $x'$ have the same sign. The corresponding two-sided test-statistic uses the order where $x \le x'$ whenever $|x| \le |x'|$, regardless of sign. set such as $\mathbb{R}$. We say that $x$ is a more extreme outcome than $x'$ (according to $T$) when $T(x) > T(x')$.
The p-value of a sample x under a test-statistic T is given by
where X0 is a random variable compatible with H0. You want to choose H0 and T such that which X0 you use does not matter—H0 should be enough information to compute a p-value.
In English, the p-value of a sample x is the probability of an outcome at least as extreme as x, assuming that the null hypothesis holds.
For a given alternative hypothesis HA, and significance level α, the power of a hypothesis test is the probability, assuming HA, that p < α. That is, let yα be the largest element of Y such that
under the null hypothesis. Then, the power is given by
You generally want your hypothesis test to be as powerful as possible given a particular value of α.
Example
Suppose that a gamer, Dream, might be cheating in a Minecraft speedrun. Specifically, you have a livestream VOD, and you want to check if Dream’s blaze rod drop rate has been altered.
The null hypothesis is that the drop rate is normal, 1/2. More specifically, it is that the blaze rod drops are some Bernoulli process X0 where each trial has probability 1/255 We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth..
The sample x = (x1, x2, x3, …, xn) is a boolean sequence where xi = 1 if the ith killed blaze dropped a rod, and 0 if it did not.
The test-statistic $T : \mathbf{2}^n \to \mathbb{N}$ is given by $T(x) = \sum x_i$, and measures the number of dropped blaze rods.
The p-value is given by
If p is sufficiently small, you should be suspicious of the null hypothesis.
You can use literally whatever the hell test-statistic you want66 Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking., so long as you know how to correctly compute a p-value given a null hypothesis. The specific hypothesis tests they tell you about in school are just a collection of convenient ones for commonly occurring types of random variables and null hypotheses.
Because φ is kind of like p, and φ(Ω) is kind of like the power set of Ω, and I don’t feel like assigning every event space a whole-ass calligraphic letter.
↩Technically, it only needs to be a measurable space, because we don’t need to assume it has a measure yet—just that it has like, a nice topology or some other structure that lets us talk about which subsets of it are measureable.
↩Usually you want a function T that’s defined regardless of the sample size n, so really it can be a superset of $\mathcal{X}$.
↩Probably a total preörder is good enough? I would find it amusing to try to do hypothesis testing with only a partial order.
Slightly trollish definition that might work: a one-sided test-statistic is a measurable function $T : \mathcal{X} \to \mathbb{R}$, where $\mathbb{R}$ is given the partial order where $0 < x$ for all $x$, positive and negative numbers are incomparable to each other, and $x < x'$ if $|x| < |x'|$ when $x$ and $x'$ have the same sign. The corresponding two-sided test-statistic uses the order where $x \le x'$ whenever $|x| \le |x'|$, regardless of sign.
↩We, in fact, know that the null hypothesis is false. This is why we must adjust for putative Shifty Sams and sample biases and so forth.
↩Though you ought to choose your test-statistic before you observe x, or if that fails maybe make an effort to pick a Schelling hypothesis test—broadly, the point is to avoid giving yourself many degrees of freedom for p-hacking.
↩