Is there a Cheshire Cat in science? One might believe so, given the many published scientific discoveries that cannot be independently reproduced. The “replication crisis” in science has become a widely discussed issue among scientists and the lay media and even has its own entry in *Wikipedia*.

Case in point: In August (2015), *Science* published a paper by a group called the Open Science Collaboration describing the group’s attempts to replicate 100 studies that had been published in 2008 in three major psychology journals [1]. Of the original studies, 97% had reported statistically significant effects. Only 36% of the replications could confirm them, and most of the “replicated” effects were much smaller than in the original studies. Were the original papers wrong, or did the followup studies somehow fail to properly replicate the original findings?

The failure to replicate previously published findings is distressingly common in science. In his famous paper, John Ioannidis, a prominent epidemiologist now at Stanford, claimed, astonishingly, “most published research findings are false” [2]. Andrew Gelman, a statistician at Columbia University, recently commented, “top journals in psychology routinely publish ridiculous, scientifically implausible claims” [3]. But the same criticism can be (and has been) raised about other fields of science as well.

## False Discovery THrough Statistics

Gelman, Ioannidis, and others with similar viewpoints are chiefly concerned about the misuse of statistics—in particular, statistical tests based on null hypothesis significance testing (NHST). Their papers are highly technical, but their conclusions are stark: NHST is a prime cause of false discovery.

In such a test—for example, as the familiar t-test found in many spreadsheet programs—the input consists of data from experimental and control groups. The test then calculates a *p* value. If *p* < 0.05, the scientist will typically proclaim the result “statistically significant” or, more ambiguously, just “significant.” Since journals are reluctant to publish negative results, *p* < 0.05 is, in a sense, a threshold for publishable results. In today’s “publish or perish” environment, it then becomes a goal for a scientist to aim for.

Many scientists assume that *p* < 0.05 establishes the “probable presence of an effect in a medical experiment,” as an author recently e-mailed one of us (an editor of a biomedical engineering journal) in protesting the rejection of his paper. That is not the case at all.

The *p* value was popularized in the 1920s by Sir Ronald Aylmer Fisher (1890–1962), the famous U.K. statistician, as an informal way to judge whether evidence was worth a second look. Assuming that the assumptions required for the test are satisfied, the p value is the probability of finding differences between two samples that are as large as or larger than the observed differences, assuming that the null hypothesis is correct. Put more simply, the test is based on the assumption that the treatment had no effect. It is a measure of the likelihood of a false positive result due to random sampling errors.

But a test based on the assumption that there is no effect cannot tell you the likelihood that the assumption is correct, or, conversely, the likelihood that an effect is present. As the editors of a major psychology journal explained recently in an editorial in their journal, “the problem is in traversing the distance from the probability of the finding, given the null hypothesis, to the probability of the null hypothesis, given the finding” [4]. The editors then banned the use of *p* values and other statistical measures based on NHST from their journal, asking authors instead to focus on descriptive statistics (those that describes the data) and avoid basing conclusions on NHST.

“Traversing the distance” from a p value to the “probability of a finding” is not a simple matter. It depends on the likelihood that the hypothesis being tested is correct before the experiment is conducted (the priors, in statistical jargon). David Colquhoun, University College London, illustrated that nicely in his 2014 study that simulated 100,000 experiments, which varied in both effect size and likelihood that the hypothesis being tested was, in fact, true. Some of the simulated experiments correctly reported an effect based on *p* < 0.05. Others were false positives (they reported an effect based on *p* < 0.05 even though the control and treated groups were drawn from the same population and differed only due to sampling errors). The number of true positive findings depended on the likelihood of the effect being present. The number of false positives depended on the *p* value. “If you use *p* = 0.05 to suggest that you have made a discovery,” he concludes, “you will be wrong at least 30% of the time” [5].

This problem is analogous, Colquhoun pointed out, to the problem of medical testing. Applying even a good test to a group of patients in search of a rare condition can swamp you with false positive results and its results will be, for individual patients, highly unreliable.

But the larger problem is not statistical but scientific. The null hypothesis is seldom very interesting—scientists rarely do experiments expecting to find no difference at all between the control and treated groups. And rejecting the null hypothesis does not prove whatever alternative hypothesis the investigator might have in mind. NHST leads to “a flow of noisy claims which bear only a weak relation to reality,” Gelman pointed out [3].

## Project Henhouse

While all of this sounds technical and academic, a naïve reliance on *p* values and NHST can have large real-world consequences. For example, the scientific literature has hundreds (if not thousands) of reports of “statistically significant” effects of electromagnetic fields on biological systems, many at low exposure levels and many lacking any clear explanation. Virtually all of these claims are based on *p* < 0.05 and NHST. Activists campaigning against what they believe are health hazards of electromagnetic fields from commonplace sources compile lists of “effects,” even as health agencies conduct extensive reviews of the literature and do not see clear evidence of harm from such exposures. Decades of political controversy in this field have been fuelled by scientific reports of effects based on *p* < 0.05.

Only a few of these studies have been subject to careful replication attempts, and the attempts themselves have frequently been controversial. Consider a 1982 study by Spanish investigator José Delgado that reported striking effects of pulsed magnetic fields on number of abnormalities in chick embryos [11]. This biologically implausible finding quickly became a factor in the public debate then taking place about possible health effects of video display terminals.

Delgado’s study was unblinded and, with a total of 68 eggs for nine different combinations of frequency and field levels, had very low statistical power. At considerable expense, the U.S. Office of Naval Research funded a major replication study, with six different laboratories in the United States and Europe repeating Delgado’s original study using similar exposures, but with meticulous quality control, a blinded assessment, and using many more eggs than in the original study. Only three of the six laboratories found statistically significant (*p* < 0.05) differences between the exposed and control eggs. When the data were combined, the review concluded “the difference in incidence of abnormalities … is highly [statistically] significant” [6].

However, the results were also highly inconsistent: one lab reported a strong effect of exposure (*p* < 0.001), while the others found either no statistically significant differences or, at most, small effects. While one can argue about whether an “effect” was present or not, in any event, it would have been much smaller than the one originally reported by Delgado and colleagues, and it was nonreproducible in different labs.

Perhaps this confusion was due to statistical effects that Colquhoun and others have pointed out. More likely, it was a result of an inadequately controlled experimental factor or experimental error of unknown nature. For example, the labs used eggs from different sources in different countries, a possible source of variability in results.

Delgado’s study probably did not warrant the attention it received in the first place. Robert Brent, in a 1999 review, concluded that chick embryo studies of this sort were “not useful for predicting reproductive effects in humans” because of lack of biological relevance and other problems [7]. Now, more than three decades later, the evidence for developmental effects of such fields remains “inadequate,” in the judgment of a 2007 review sponsored by World Health Organization.

## P-Hacking and the Texas Sharpshooter

Statistics can address the effects of random sampling errors. In the real world of science, a bigger problem may be unrecognized bias or error in a study. Scientists, as other mortals, are subject to a variety of cognitive biases and illusions whose presence they frequently have difficulty recognizing. Effects that can lead to false discovery include confirmation bias (scientists tend to find what they expected to find) and the file-drawer bias (failure to publish findings that do not seem interesting to the scientists, e.g., with *p* > 0.05).

Scientific research can be manipulated, perhaps unwittingly, in a number of ways to produce “statistically significant” (*p* < 0.05) results. A scientist might continue repeating an experiment until *p* < 0.05 is found but not tell the reader how many times the experiment had been repeated with nonsignificant results (an example of the “Texas Sharpshooter” effect, after an old joke about a Texas farmhand who fires a shotgun at a barn and then paints a bulls eye around where the pellets hit the barn). Or the scientist might analyze the data in multiple ways, and select the one that produces “statistically significant” findings with *p* < 0.05 (an example of what is technically called the *multiple comparison problem*, but has been satirized as the “jellybean effect”) [8].

If the scientist analyzes data as they are acquired in an experiment, and terminates the experiment when the magical “*p* < 0.05” is reached, the chances of a false-positive result will dramatically increase. These problems, which are generally known as *p-hacking*, would be avoided in a carefully designed study with a pre-established protocol, identified hypotheses and methods for analyzing data. P-hacking is much more likely to be present—and difficult for a reader to spot—in a preliminary or proof of concept study of the sort that is often published in the biomedical engineering literature. Certainly, the arbitrary threshold of “*p* < 0.05” for publication is a strong incentive for such manipulations.

Scientists, like other people, tend to find what they are looking for, and overestimate the reliability of their observations. For example, in 1986, Max Henrion and Baruch Fischoff showed that physicists, when measuring fundamental constants such as the velocity of light in multiple studies over the past two centuries, consistently underestimated the uncertainties of their measurements. For example, they consistently found values for the speed of light that agreed with those from previous studies, and reported “error bars” that typically did not include what is now the accepted value. If physicists, who are pursuing the most exact and rigorous science, are prone to such biases, what about the rest of us?

Certain areas in biomedical engineering appear particularly susceptible to false discovery. Consider, for example, the many papers describing machine learning techniques to develop classifiers to diagnose disease. Typically, these involve using a set of attributes of a signal or image to train a classifier to identify patients as “healthy” and “ill.” A search on Google Scholar for “support vector machine” and “diagnosis” will uncover many papers of this sort. The counterpart of replication for such studies would be to test the performance of the classifiers using truly independent sets of data from entirely different groups of subjects in different medical centers, which is seldom done in these preliminary studies.

Machine learning is a well-established method for data analysis that has many capabilities. But it is notoriously “data hungry”, with unstable performance unless the size of the training set is much larger (20 times or more larger) than the number of attributes—a condition that is seldom met in biomedical engineering studies. In many of these papers, it is difficult to know for certain that the test data were rigorously separated from the training data, or that the analysis was not adjusted during the study to obtain the best results—both necessary prerequisites for a valid study. Perhaps the reason that nobody has written about the “replication crisis of biomedical engineering” is that nobody has tried to replicate many of these machine learning studies.

## What Is to Be Done?

Ioannidis’ paper on “why most published research findings are false” has, according to the author, been downloaded more than a million times. In 2014, Ioannidis published a follow-up paper, “How to Make More Published Research True,” with a list of suggestions that are largely aimed at researchers [9]. Many of his recommendations, while important, would be difficult to implement for much biomedical engineering research, a good deal of which is exploratory or designed as proofs of concept. Other recommendations, such as changing the reward structure of science, point to systematic problems with science that are far above most of our job descriptions.

However, at a more modest level, biomedical engineers can take some steps to increase the reliability of experimental studies:

- As biomedical engineers, perhaps the most important contribution we can make is to educate our students about elements of good study design and how to minimize bias. We teach our students not to plagiarize or falsify data. How about also teaching them how to avoid p-hacking and the Texas Sharpshooter effect?
- We should teach students how to estimate power in a study, and the importance of power in influencing the reliability of its findings. As Colquhoun and other statisticians have repeatedly pointed out, a study that is too small—with low statistical power—is dramatically more prone to false discovery and to overestimating the size of real effects, than an adequately powered study. Researchers are inevitably limited in the size of studies that they can conduct, and this limitation is particularly severe in exploratory research. But at least authors reporting effects of some kind of treatment should also estimate the statistical power of their studies. This is rarely done, perhaps in part because the statistical power of many exploratory studies is embarrassingly low. “Your chance of making a fool of yourself increases enormously when experiments are underpowered,” Colquhoun pointed out.
- The t-test and other tests based on NHST are probably always going to be with us—they are easy to use and are the prevailing practice. However, authors should also show the original data where possible, and shift the focus from NHST to descriptive statistics to characterize their results. Authors should use the term “statistically significant” with respect to findings where
*p*< 0.05, and avoid the simple but deeply ambiguous term “significant.” - Replication is important—but not enough. As Project Henhouse showed, replicating a study can be a contentious, expensive, difficult business without clear resolution. We need to shift the focus from “finding effects” (i.e., excluding the null hypothesis) to formulating and testing hypotheses to understand the effects and assessing the generalizability of the findings beyond the facts of the original study.

In fact, health agencies do not demand a strict replication of studies as a prerequisite for considering them in health risk assessments. For example, the World Health Organization review of health effects of powerline fields called for “at least some evidence of replication or confirmation” as well as relevance to health, which falls short of strict replication [10].

Replication is, in fact, only the first step. Ultimately, the significance of a scientific discovery is determined by how far it leads, and what further productive research it motivates. Replication is important, but generalizability of knowledge is an even more important goal. In biomedical engineering studies, developing new diagnostic techniques using machine learning can be interesting intellectual exercises—but can these tests be used reliably with a relevant patient population?

Perhaps if we pay close attention to proper application of statistics and avoiding the Texas Sharpshooter effect and other biases, and perhaps if we teach our students to take a longer view towards establishing the generalizability of their results, in the future we might avoid the embarrassment of articles about a “crisis” in our profession. Since difficulty with statistical reasoning and susceptibility to bias may be inherent human characteristics, don’t count on it.

## References

- A. A. Aarts et al., “Estimating the reproducibility of psychological science,”
*Science*, vol. 349, no. 6251, 28 Aug. 2015. - J. P. A. Ioannidis, “Why most published research findings are false,”
*PLoS Med.*, vol. 2, pp. 696–701, Aug. 2005. - A. Gelman, “Working through some issues,”
*Significance*, vol. 12, pp. 33–35, 2015, doi: 10.1111/j.1740-9713.2015.00828.x. - D. Trafimow and M. Marks, “Editorial,”
*Basic Appl. Soc. Psychol.*, vol. 37, no. 1, pp 1–2, Feb. 2015. - D. Colquhoun, “An investigation of the false discovery rate and the misinterpretation of p-values,”
*R. Soc. Open Sci.*, vol. 1, no. 3, p. 140216, Nov. 2014. - E. Berman, L. Chacon, D. House, B. Koch, W. Koch, J. Leal, S. Lovtrup, E. Mantiply, A. Martin, G. Martucci, K. Mild, J. Monahan, M. Sandstrom, K. Shamsaifar, R. Tell, M. Trillo, A. Ubeda, and P. Wagner, “Development of chicken embryos in a pulsed magnetic field,”
*Bioelectromagnetics*, vol. 11, pp. 169–187, 1990. - R. Brent, W. Gordon, W. Bennett, and D. Beckman, “Reproductive and teratologic effects of electromagnetic fields,”
*Reprod. Toxicol.*, vol. 7, pp. 535–580, Nov./Dec. 1993. - xkcd, “Significant.” [Online].
- J. P. A. Ioannidis, “How to make more published research true,”
*PLoS Med.*, vol. 11, p. e1001747, Oct. 2014. - E. van Deventre et al., Eds.,
*Environmental Health Criteria 238: Extremely low frequency fields.*Geneva, Switzerland: World Health Organization, 2007. - J. M. R. Delgado, J. Leal, J. L. Monteagudo, and M. G. Gracia, “Embryological changes induced by weak, extremely low-frequency electromagnetic fields,”
*J. Anat.*, vol. 134, pp. 533–551, 1982.