“Is that statistically significant?”

This question sounds scientific and well-meaning, but is too often misguided and dangerous according to leading statisticians. The popular misconception that being *statistically significant* is the be-all and end-all of truth has many unintended consequences. The biggest problem is that often what appears to be fact is simply statistical noise, and what may well be fact is sadly ignored.

The problem is not so much with the statistical tools, but rather in how they are applied. You’ve probably seen data tables in which every column is compared to every other column, usually at a 95% confidence interval. “Significant” differences are noted with a letter or number. Each table can have 100 or more tests. Researchers often scour the tables looking for “significant” differences and then try to explain the differences. This is where the trouble begins.

Most people are familiar with the idea that one in twenty of these tests will yield a false positive—indicating there is a difference when none exists. But the reality is the error rate is actually much higher, particularly when there are multiple comparisons involved and the tools are used in ways they were never intended.

“Most scientists would look at…(a) p-value of 0.01 and say that there was just a 1% chance of his result being a false alarm. But they would be wrong”, Regina Nuzzo wrote in Nature. “The p-value cannot say this: all it can do is summarize the data assuming a specific null hypothesis. It cannot work backward and make statements about the underlying reality. That requires another piece of information: the odds that a real effect was there in the first place.”

She goes on to say “According to one widely used calculation, a p-value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a p-value of 0.05 raises that chance to at least 29%.”

## Replication crisis in academia

It has been the convention to use a 95% confidence interval (p ≤ 0.05) as a marker of truth not only in the world of market research but in the scientific community in general. Journals are generally not interested in publishing studies in which the results are not “significant.” And academics need to publish to get promoted or even stay employed. The result has been a disaster for science.

There is plenty of evidence to suggest that many published papers have findings that cannot be reproduced because of a misuse of statistical testing. The journals Nature and Science are very prestigious, and academics clamor to have their papers accepted by them. But a study published in Nature entitled “Evaluating the replicability of social science experiments in Nature and Science between 2010 and 2015” reported they were able to replicate six in ten of them, and “the effect size of the replications is on average about 50% of the original effect size.”

Similarly, dismal measures of reproducibility and replicability have been reported in many other fields including psychology, economics, and medicine. One high profile casualty of a mindless focus on “significant findings” is American researcher and professor Brian Wansink.

His research focused on how people make food choices and he is the author of *Mindless Eating* and*Slim by Design**. *His work popularized the ideas that plate size and color influence how much you eat and that 100 calorie packages reduce the amount overweight people eat. Many problems were later discovered with Wansink’s work, but what got people looking into his work was his uncritical use of statistical tests, or “p-hacking.” He was caught out after publicly encouraging graduate students and collaborators to troll through data sets looking for “statistically significant” findings, rather than following the scientific process of testing predetermined hypotheses.

According to Tim Vanderzee, a researcher who investigated Wansink’s work, there are alleged problems with 52 of his publications, which have been cited over 4,000 times in 25 different journals and in 8 books. When evidence of his unscientific approach surfaced, his university suspended him from teaching and ultimately released him. Misuse of statistical testing can have dire effects.

As researchers we seek to inform and guide decision making. Uncritical use of stats testing can result in us misleading and misinforming instead. That’s a situation no one wants. So, what’s a researcher to do? Fortunately, the American Statistical Association (ASA) has some advice.

## On the use and abuse of significance testing

The ASA put out a formal statement on the use and misuse of p-values. They set forth “principles underlying the proper use and interpretation of the p-value.” They state, “The widespread use of ‘statistical significance’ (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process. A conclusion does not immediately become ‘true’ on one side of the divide and ‘false’ on the other.”

They suggest “Researchers should recognize that a p-value without context or other evidence provides limited information.” They counsel that researchers “should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis.” They conclude that good statistical practice emphasizes “understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.” Context is truly the anecdote.

Statisticians are also warning against thinking too narrowly about what a p-value can reveal. A comment in Nature by Valentin Amrhein, Sander Greenland, Blake McShane, and more than 800 signatories points to the reduction of a p-value to a significant/not significant dichotomy as a big part of the problem. They write “we are not advocating a ban on p values, confidence intervals or other statistical measures — only that we should not treat them categorically. This includes dichotomization as statistically significant or not, as well as categorization based on other statistical measures such as Bayes factors.”

“One reason to avoid such ‘dichotomania’ is that all statistics, including p-values and confidence intervals, naturally vary from study to study, and often do so to a surprising degree. In fact, random variation alone can easily lead to large disparities in p values, far beyond falling just to either side of the 0.05 threshold. For example, even if researchers could conduct two perfect replication studies of some genuine effect, each with 80% power (chance) of achieving p < 0.05, it would not be very surprising for one to obtain p < 0.01 and the other p > 0.30.”

They suggest “[t]he trouble is human and cognitive more than it is statistical: bucketing results into ‘statistically significant’ and ‘statistically non-significant’ makes people think that the items assigned in that way are categorically different.”

That kind of categorical thinking makes it easy to miss real differences and focus on false ones. A more nuanced approach to judging what is “significant” is clearly needed.

## How did we get into this mess?

If our reliance on over-simplified “significance” testing is a problem, how the heck did we end up here? Incredibly, the story involves two statisticians who despised each other and ended up with their ideas mashed together in an unholy alliance that neither approved of. And it all started with a cup of tea in 1920’s England.

A group of academics had gathered for tea. One was Dr. Blanche Bristol who, when offered a cup of tea by a colleague, turned it down. The trouble was the man poured the tea into the cup first, then added the milk. Dr. Bristol rejected it because she preferred the milk to be poured into the cup first, and the tea afterward. The man who had poured the tea suggested she surely could not tell the difference. She insisted she could. The man, Dr. Ronald Aylmer Fisher, proposed a test, which he famously described in his book *The Design of Experiments*. He would prepare eight cups of tea; four with the tea poured first and four with the milk poured first. She had to guess which was which.

He proposed the null hypothesis that she would be unable to do that correctly. Fisher calculated that her chance of guessing all cups correctly was 1/70. He was provisionally willing to concede her ability (rejecting the null hypothesis) in this case only. She, reportedly, got them all correct. The null hypothesis was rejected. This was the beginning of significance testing.

Meanwhile, two statisticians, Jerzy Neyman and Ergon Pearson, were working on hypothesis testing – selecting among competing hypotheses based on the experimental evidence alone. Neyman suggested that hypothesis testing was an improvement on significance testing. That did not sit well with Fisher, who already disliked Neyman because he had worked with Pearson’s father with whom Fisher had a long-running disagreement. They battled over which way was better until Fisher’s death.

But in the meantime, something funny happened. Gerd Gigerenzer sums it up rather nicely in his (wryly caustic) paper Statistical Rituals: The Replication Delusion and How We Got There: “Early textbook writers struggled to create a supposedly objective method of statistical inference that would distinguish a cause from a chance in a mechanical way, eliminating judgment. The result was a shotgun wedding between some of Fisher’s ideas and those of his intellectual opponents, the Polish statistician Jerzy Neyman (1894–1981) and the British statistician Egon S. Pearson (1895–1980). The essence of this hybrid theory is the null ritual.”

He describes what he calls the “null ritual” this way:

“1. Set up a null hypothesis of ‘no mean difference’ or ‘zero correlation.’ Do not specify the predictions of your own research hypothesis.

2. Use 5% as a convention for rejecting the null hypothesis. If the test is significant, accept your research hypothesis. Report the test result as p < .05, p < .01, or p < .001, whichever level is met by the obtained p-value.

3. Always perform this procedure.”

“The null ritual does not exist in statistics proper”, Gigerenzer continues. “This point is not always understood; even its critics sometimes confuse it with Fisher’s theory of null-hypothesis testing and call it ‘null-hypothesis significance testing.’ In fact, the ritual is an incoherent mishmash of ideas from Fisher on the one hand and Neyman and Pearson on the other, spiked with a characteristically novel contribution: the elimination of researchers’ judgment.”

## The way forward

We find ourselves in a situation where uncritical use of a bastardized test has lead to a “replication crisis” in science and the misuse and abuse of the notion of significance. We need to rethink how we use significance testing.

Firstly, we need to take a finding of “significance” with a grain of salt. The error rate is larger than is commonly assumed. Secondly, we need to stay away from data fishing; trolling for “significant” differences. Thirdly, we need to understand that something which might not pass the significance test might be meaningful. Fourthly, we need to consider context. As the ASA recommends, it is essential to take into account “understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”

So next time someone asks: “Is that statistically significant?” we should consider if that is the right question. A better question would be “are the differences meaningful?”

Asking this would steer us away from the trap of binary thinking about a pair of numbers, and guide us toward contextualizing a finding in a world rich with information.