A triple whammy
On March 20, the British science journal Nature published an important editorial “It’s time to talk about ditching statistical significance” which argued against the indiscriminate use of statistical testing in health studies.
Second, the same edition contained a commentary “Scientists rise up against statistical significance” signed by 853 scientists worldwide, with about 80 in the UK. It called for call for an end to, inter alia, “the dismissal of possibly crucial effects” in health studies through the inappropriate use of statistical testing.
Third, the Nature editorial simultaneously reported that scientists at the American Statistical Association (ASA) had just published a scientific article with the same end.
The ASA decision had been in the pipeline for some time. Three years earlier, in 2016, it had released a statement in The American Statistician warning against the misuse of statistical significance and p values (A p value is the probability of a finding which is a fluke ie due to chance alone.). The 2016 issue had included many articles on the matter and had attracted 150,000 online viewers.
In March 2019, a special issue in the same journal pushed this warning further and actually advised statisticians “don’t say ‘statistically significant’”. It presented more than 40 papers on ‘Statistical Inference in the 21st century: a World Beyond P < 0.05’. Another article with dozens of signatories also called on authors and journal editors to disavow those terms.
In accord with past reports
The above three simultaneous developments are a triple whammy against statistical testing in health studies, ie in clinical trials but also in epidemiology studies. However they are unsurprising as scientists have been criticising the indiscriminate use of statistical significance in health studies for many years, in particular their use in epidemiology studies. According to Wikipedia, between 300 and 400 primary scientific articles have been published in recent decades containing criticisms of statistical testing.
In my own studies (see for example page 9 of my 2015 paper) I found four particularly trenchant and illuminating critiques of the use of statistical significance, as follows.
Axelson O. Negative and non-pos epidemiological studies. Int J Occup Med Environ Health. 2004;17:115-121.
Everett DC, Taylor S, Kafadar K. Fundamental Concepts in Statistics: Elucidation and Illustration. J of Applied Physiology 1998; 85(3):775-786.
Sterne JAC, Smith GD. Sifting the evidence–what’s wrong with significance tests? Phys Ther (2001) 81(8):1464-1469.
Whitley E, Ball J. Statistics Review 1: Presenting and summarising data. Crit. Care 2002; 6:66-71.
Some readers may ask what is all the fuss about? Why the need for significance tests anyway?
The answer is long and complicated partly because views on statistical significance are polarised, even nowadays. (The current silence of the BMJ and the Lancet on the matter is notable.) For specialist readers, Wikipedia has some useful, albeit lengthy, entries. See here and here.
The theory and practice of significance testing came about (as long ago as the 1930s) because chance findings are ubiquitous throughout science and a means was needed to distinguish between observations that were real and those that could just be a coincidence, ie a chance finding.. The ability to deal with this uncertainty is important in some mechanical applications, eg in quality control procedures in industry and elsewhere. For example, in a batch lot of widgets, a random sample will be taken, their weights measured, a statistical test or tests applied, and a decision (yes/no) taken on whether the widget batch was good enough for quality control purposes.
Accordingly the Nature editorial states that statistical tests will continue to be needed in some applications where a yes/no decision is needed, but crucially not in the area of health research, including epidemiology studies and clinical trials.
Why? Because their use in health studies is often insufficiently nuanced. As the Nature article says in the past many health researchers “chucked their results into a bin marked ‘not significant’ without further thought”. Instead researchers should have considered matters such as “background evidence, study design, data quality and an understanding of underlying mechanisms, as these are often more important than P values or confidence intervals”. In particular, they should have discussed the health implications of their non-statistically significant findings.
 To give a simple example, supposing you wanted to find out whether a coin had been doctored to give more heads than tails when tossed. If you flipped it 100 times and there were 70 heads and 30 tails, then you could carry out a maths test and show that you were pretty certain that the coin was doctored. Or, the same thing, that there was a low probability that the observations were due to chance.
Is this Important?
Yes, for three reasons. First, because the use of statistical significance tests often leads to the wrong result, especially in clinical trials, and the same is true in epidemiology studies in my experience. See the diagram below published in the Nature article which mainly deals with clinical trials.
Second, it is important because as Nature states “the rigid focus on statistical significance encourages researchers to choose data and methods that yield statistical significance for some desired (or simply publishable) result, or that yield statistical non-significance for an undesired result, such as potential side effects of drugs — thereby invalidating conclusions.”
This damning verdict also applies to the undesired result of observed increases in health effects from an epidemiology study. For decades, some scientists, including those employed at UK government agencies, have dismissed risk findings in epidemiology studies near nuclear facilities by concluding they showed “no significant” raised risks or that excess risks were “not significant”, or similar phrases. Now, in theory, they will not be able to do this as statistical significance in health studies is on its way to the dustbin of science.
A third reason is specifically mentioned in the Nature article, viz- we must re-examine past studies which used lack of statistical significance to dismiss observed increases as these conclusions are now unreliable. This verdict applies, for example, to recent studies by the UK Government’s Committee on the Medical Aspects of Radiation in the Environment (COMARE) studies which observed leukemia increases near UK nuclear facilities but dismissed them because they were not statistically significant. See for example
COMARE (2011) Committee on Medical Aspects of Radiation in the Environment Fourteenth Report. Further Consideration of the Incidence of Childhood Leukaemia Around Nuclear Power Plants in Great Britain. HMSO: London.
COMARE (2016) Committee on Medical Aspects of Radiation in the Environment (COMARE) Seventeenth Report. Further consideration of the incidence of cancers around the nuclear installations at Sellafield and Dounreay HMSO: London.
The Importance of Size in Epidemiology
An important factor in epidemiological studies is their size, ie the numbers of observed cases of ill effects in a population. This is because the probability (ie p-value) that an observed effect may be a fluke is affected by both the magnitude of effect and the size of study (Whitely and Ball 2002, Sterne and Smith, 2001). If the study size is small, its findings often will be found to be not statistically significant regardless of the presence of the adverse effect. (Everett et al, 1998). Sadly, the rejection of findings for significance reasons can often hide real risks (Axelson, 2004; Whitley and Ball, 2002).
So what should researchers do with a study having positive findings which do not meet their significance test? First, they should NOT reject the findings. Instead they should report the observed increase and go on to discuss the confidence interval within which 95% (nowadays often 90%) of the observed values lie, so that readers can make up their own minds. They should also consider background evidence, study design, data quality and possible underlying mechanisms, as these are more important than P values or confidence intervals. In particular, they should discuss the health implications of their findings.
Other problems exist with significance testing. The statement that a finding is “not significant” often misleads lay readers into thinking that a reported increase is unimportant or irrelevant. But in statistics the word “significant” is a specialist adjective used to convey a narrow meaning, viz that the likelihood of an observation being a fluke is less than 5% (assuming a p = 5% test were used). It does NOT mean important or relevant.
Second, such phrases are often used without explaining that the chosen significance level is arbitrary. There is no scientific justification for using a 5% or any other test level: it is merely a matter of convenience. In other words, it is quite possible for results which are “not significant” when a 5% test is applied, could become “significant” when a 10% or other test level were used.
The existence of this bad practice has historical parallels. In the past, dozens of health studies financed by tobacco companies acted to sow seeds of doubt about the health effects of cigarette smoking for several decades, as described in US books. See here and here. The dubious use of statistical significance tests was a common stratagem in these studies.
Similarly, big pharmaceutical companies have been shown to run bad trials on their own drugs, which were designed to distort and/or exaggerate their benefits and to minimise their side effects. See here. Again the lack of statistical significance was often used as a ploy in these trials.
I have argued for some time that tests for statistical significance have been misused in epidemiological studies on cancers near nuclear facilities. These in the past have often concluded that such effects do not occur, or they downplayed any effects which did occur.
In fact, copious evidence exists throughout the world – over 60 studies – of raised cancer levels near NPPs. This is discussed in my scientific article in 2014 on a hypothesis to explain cancers near NPPs.
Most (>75%) of these studies found cancer increases but because they were small, their findings were often dismissed as not statistically significant. In other words, they were chucked in the bin marked “not significant” without further consideration.
I would conclude by asking open-minded scientists and observers to reconsider their views about the above 60+ studies and the COMARE reports showing raised cancer levels near NPPs. Just as people were misled about tobacco smoking in previous decades, perhaps we are being misled about raised cancers near NPPs nowadays.
 This is less important in clinical trials where researchers can more easily control the sizes of the populations they compare.