Misusing Statistical Significance Tests Can End Your Career: A Cautionary Tale

statistical significance

Renowned Cornell University professor, and best selling author of Mindless Eating and Slim by Design, Dr. Richard Wansink has recently resigned from Cornell in shame. Dr. Wansink had thirteen peer-reviewed papers retracted, including six retractions recently announced by the Journal of the American Medical Association (JAMA). What lead to his downfall? One of the things he did was encourage co-authors to use statistical tests to examine relationships between a whole host of attributes and then comb through them looking for statistically significant relationships.

Does that sound uncomfortably familiar? It should, because it is a common practice in market research. But it is an approach that is sure to generate misleading results.

Instead of randomly combing through tables looking for “significant differences,” researchers should set out to prove a specific hypothesis before a study begins. Dr. Wansink, in contrast, was retrospectively generating hypotheses to fit data patterns that emerged after a study was over. This practice is known to statisticians as “p-hacking” or “data fishing.”

Data fishing” refers to the practice of applying statistical significance tests to every row of figures in the data tables, between every possible pair of columns in the crosstab. This is often taught to researchers as a good way to spot significant differences between subgroups. Any results with a significant indicator are then written into the final report with confidence: “Yes, we tested it!”

What’s wrong with that? Just this: there’s a very high probability that one or more of those significant indicators are wrong. Statistical significance testing, when done right, is a form of hypothesis testing. The researcher formulates a theory about differences in key subgroup populations. A study is designed to collect the necessary data around the hypothesis. When the expected differences in the data are observed, a test of statistical significance is carried out.

In market research, since the data is collected through samples, a statistical significance test is used to assess whether the differences observed in the data are due to a true difference in the population or if they are simply caused by random sampling error. Statisticians, being a cagey lot, will never give you a definite answer. Instead, their answers come in the form of probabilities. They may tell you that the difference is statistically significant with a confidence level of 95%. What this implies (but is usually left unstated) is that there is a 5% probability that this answer is wrong, and there is, in fact, no difference. If you go data fishing and run thousands of these tests, you will have a wrong answer to about 5% of those tests.

So, what is a person to do? The appropriate approach is to develop a hypothesis first (“I think people in the South spend more time fishing than those in the North”) and then test only those theories. These hypotheses should be based on other research, observations, ideas or hunches. If you receive a stack of data tables with significance tests everywhere, ignore the ones that you didn’t set out to prove, even if they have significant indicators beside them. If you don’t, you could generate misleading results that seem noteworthy when in fact they are not.

Tests of statistical significance can be very helpful when used appropriately. Used indiscriminately, they will inevitably lead to some misleading results. And that, as the example of Dr. Wansink reveals, can have truly significant consequences.