P-Values and Statistical Significance: A Plain-English Guide

P-Values and Statistical Significance: A Plain-English Guide

If you've ever read a news headline announcing that "scientists have proven" some new drug works, or that a certain food "significantly" reduces cancer risk, there's a good chance a p-value was involved. And there's also a reasonable chance the headline was misleading you.

The p-value is one of the most widely used — and most widely misunderstood — tools in all of science. It shows up in clinical trials, psychology papers, economics research, A/B testing dashboards, and genomics studies. Entire careers have been built on getting p < 0.05. Entire careers have also been derailed when results that cleared that threshold turned out to be noise.

So let's get into what a p-value actually measures, why the 0.05 cutoff exists and whether it should, and how statistical significance gets abused in ways that produce genuinely bad science.

What a P-Value Actually Measures

Here's the definition that statistics textbooks give you: a p-value is the probability of observing results at least as extreme as the ones you got, assuming the null hypothesis is true.

That sentence is doing a lot of work, so let's unpack it carefully.

The null hypothesis is the default assumption of no effect, no difference, nothing interesting happening. If you're testing whether a new blood pressure medication works better than a placebo, the null hypothesis says: there is no difference in blood pressure outcomes between the two groups.

Now you run your experiment. You collect data. You find that the medication group has, on average, a 6 mmHg lower systolic blood pressure than the placebo group. You compute a p-value of 0.03.

What does 0.03 mean? It means: if the null hypothesis were true — if the medication genuinely had zero effect — there would be only a 3% chance of seeing a difference as large as 6 mmHg (or larger) just due to random variation in your sample.

That's all it means. It is not the probability that the null hypothesis is true. It is not the probability that your result is a false alarm. It is not a measure of effect size or practical importance.

This distinction matters enormously, and it's where most popular science reporting goes wrong. When a reporter writes "there's only a 3% chance this is a coincidence," they're describing something fundamentally different from what a p-value actually says.

The Machinery Behind the Number

To compute a p-value, you need a test statistic — something that summarizes the signal in your data relative to the noise. Common ones include the t-statistic for comparing means, the chi-square statistic for categorical data, and the F-statistic in regression and ANOVA.

Each of these test statistics has a known probability distribution under the null hypothesis. The p-value is just the area in the tail of that distribution beyond your observed value. You're asking: if sampling noise alone produced this distribution of outcomes, how far out in the tail does my result sit?

A simple worked example: suppose you flip a coin 100 times and get 60 heads. You want to test whether the coin is fair. Under the null hypothesis (fair coin, p = 0.5), the number of heads follows a binomial distribution. The p-value for observing 60 or more heads on a fair coin is roughly 0.028. That's a reasonably small probability. But it doesn't tell you the coin is rigged — it tells you that if the coin were fair, this outcome would be somewhat unusual.

Where 0.05 Came From — and Why It Became a Magic Number

The 0.05 threshold traces back to Ronald Fisher's 1925 book Statistical Methods for Research Workers. Fisher suggested that a result occurring less than 1 in 20 times by chance was worth paying attention to. He was explicit that this was a rough guideline, a starting point for scientific judgment — not a hard rule.

Somehow, over the following decades, that rough guideline calcified into a binary gate. Either p < 0.05 (published, significant, real) or p ≥ 0.05 (null result, drawer file, forgotten). The journal system reinforced this. Positive results get published; null results don't. Researchers learned, consciously or not, which side of the threshold they needed to land on.

The actual choice of 0.05 is nearly arbitrary. Why not 0.04? Why not 0.10? Different fields use different standards. Clinical trials for new drugs often require p < 0.001 or smaller, because the consequences of approving an ineffective treatment are serious. Particle physics requires what's called 5-sigma — roughly p < 0.0000003 — before declaring a discovery. Meanwhile, some fields in social science routinely publish results at p = 0.049 as though this represents robust evidence.

The Replication Crisis and What It Revealed

Starting around 2011, something uncomfortable happened to psychology and social science: researchers started trying to replicate famous experiments, and a disturbing fraction of them failed. The Reproducibility Project in 2015 attempted to replicate 100 psychology studies and found that only about 39% produced a statistically significant result the second time around.

This wasn't because the original researchers were fraudulent (though some were). It was largely because the p-value machinery had been quietly abused in ways that inflated false positive rates.

The main culprit is called p-hacking (also known as researcher degrees of freedom, or the garden of forking paths). It works like this: you run an experiment, collect data, check for significance, and don't find it. Then you try analyzing the data a slightly different way — exclude a few outliers, change the comparison group, add a covariate, switch from a two-tailed to a one-tailed test, collect 20 more participants. Eventually you find a combination that yields p < 0.05, and you write the paper as though that was your analysis plan all along.

Every additional analysis you run inflates your false positive rate. If you perform 20 independent tests on random data, you'd expect one of them to clear p < 0.05 just by chance. The 0.05 threshold assumes you ran one pre-specified test. When you run many flexible analyses, the actual false positive rate can be far higher than 5%.

Related to this is publication bias: the systematic tendency for significant results to get published and null results to get buried. This means the published literature is a skewed sample of all the research that was actually run. Meta-analyses built on that literature can reach confident-sounding conclusions that rest on a distorted evidence base.

Statistical Significance vs. Practical Importance

There's a second failure mode that doesn't get as much attention: the conflation of statistical significance with practical significance.

With a large enough sample, you can detect differences so small they're completely meaningless. Suppose you have data on 500,000 people and you find that people who eat breakfast at home score, on average, 0.3 IQ points higher than people who skip breakfast. With that sample size, this trivial difference might yield p < 0.0001. It's statistically significant in the technical sense — it's real, not sampling noise. But a 0.3-point IQ difference is well within measurement error and has no practical consequence whatsoever.

The right tool for communicating practical importance is effect size — Cohen's d for mean differences, odds ratios in medical research, R² in regression. These tell you how large an effect is, not just whether it exists. A p-value of 0.001 with a tiny effect size should be treated very differently from a p-value of 0.04 with a large effect size.

Good statistics reporting always shows both. Bad statistics reporting — especially in press releases and news articles — shows only the p-value (if it shows anything at all) and lets readers assume that "significant" means "large and important."

The ASA Statement and What Comes Next

In 2016, the American Statistical Association did something unusual: it issued a formal statement about p-values, warning against the mechanical application of the 0.05 threshold. In 2019, an editorial in The American Statistician went further, arguing for the abandonment of the phrase "statistical significance" altogether.

The proposals that have gained traction include pre-registration (locking in your analysis plan before collecting data, so p-hacking becomes impossible), reporting confidence intervals instead of just p-values, adopting a threshold of 0.005 for "significance" and relabeling 0.05–0.005 as "suggestive," and focusing on effect sizes and their uncertainty.

None of these have fully solved the problem because they depend on researchers, journals, and funding bodies all cooperating — and the incentives in academic publishing still tend to reward novelty and clean results over careful, uncertain science.

How to Read a P-Value Without Being Fooled

When you encounter a p-value in a paper or a news story, here's a practical mental checklist:

  • Was the analysis pre-registered? If yes, the p-value means more. If no, ask whether flexibility in the analysis could have inflated it.
  • What's the effect size? A p-value without an effect size is missing half the information you need.
  • What's the sample size? Tiny samples with p < 0.05 are fragile. Huge samples with p < 0.05 may be detecting effects too small to care about.
  • Has it been replicated? A single significant result is evidence, not proof. Independent replication is the real test.
  • What's the prior plausibility? Bayes' theorem matters here. Even a p = 0.01 result means something very different when testing a well-established mechanism vs. testing a claim that violates basic physics.

The p-value is not a villain. It's a reasonable tool being used in ways that exceed what it was designed to do. Understanding what it measures — and equally, what it doesn't — is the difference between reading research critically and being manipulated by statistics you can't interrogate.

That 0.05 threshold is not a wall between truth and noise. It's a signpost that says: "something worth looking at might be here." What comes after that signpost — replication, effect size estimation, mechanistic understanding, real-world testing — is where actual scientific knowledge gets built.