If you follow the science blogging community, you may have noticed a lot of talking about sample size in the past couple of weeks. So I did my share of mulling things over and this is what I came up with.
1- The study in question had a small sample size but reported a significant p-value (<0.05). Such study is NOT underpowered. An underpowered study is a study that does not have a sufficiently large sample size to allow detection of a significant result. A significant result is by definition a p-value less than 5%, which the study in question had. So, even though in general small sample size studies are indeed underpowered, that wasn't the issue in this particular case. In general, you are not likely to see many underpowered studies published (see point 5 below).
2- The issue with ANY small sample size study is the fact that you are not capturing the whole fluctuation in the population. And if you are not capturing the whole fluctuation, chances are, your error model is wrong, and a wrong error model leads to a wrong p-value. In other words, even if you do get a significant p-value, there's a question of whether or not that particular p-value is at all meaningful.
3- Why publish a study with a small sample size, then? Welcome to the life of a scientist. You set off with a grand plan, write a grant to sequence say 100 individuals, get the money to sequence 50, then you clean the data and end up with 30. Okay, those are made-up numbers, but you get the idea. So now you got your 30 sequences and you try to make the best out of them. You state all the caveats in the discussion section of your paper and advocate for further analyses and discuss future directions. If your paper gets published you have some leverage in your next grant, as in: "Look! I saw something with 30 sequences, which is clearly not enough, so now I'm applying to get money to sequence 100." Many scientific advances have ben made following exactly this route.
4- I've been talking a lot about p-values, but... What the heck is a p-value? A p-value of, say, 0.05 boils down to the following: if your results were completely random, and you were to repeat your experiment 100 times, you would observe your original result 5% of the time just out of pure chance. Suppose for example you want to see if a particular gene allele is associated with cancer. You do your experiment and come up with a p-value of 0.03. This means that if there really was no association whatsoever between the trait you measured and cancer, you would see your particular population distribution 3% of the time out of pure chance. Now, you see why anything above 5% is not significant: to observe something 10% of the time out of pure chance means that whatever you are trying to measure is a random effect. But to see it 3% of the time makes it rare enough that we are allowed to believe that there may be something in there after all. Notice that this is pretty much how science works. Many science outsiders think that "scientific" means "certain." Not true. Scientific means we can measure the uncertainty and when it's small enough we believe the result.
5- Now that we understand what p-values are we get to another issue: publication bias. Follow the logic: I just said that we start believing a result whenever the p-value is less than 5%. Basically, you can forget publishing anything that has a p-value above 5%. But, you won't know your p-value unless you do the experiment, and you won't publish unless you get a low p-value. Which means, you will never see all the similar studies that were carried out and yielded a high p-value. Suppose an experiment were repeated across different labs 100 times. Then, just by chance alone, 5% of these experiments yield a p-value of 5% or less. However, what you end up seeing in print are the experiments that yielded the "good" p-value, not the ones that yielded the negative results. As Dirnagl and Lauritzen put it [1],
"Only data that are available via publications‚ and, to a certain extent, via presentations at conferences‚ can contribute to progress in the life sciences. However, it has long been known that a strong publication bias exists, in particular against the publication of data that do not reproduce previously published material or that refute the investigators‚ initial hypothesis."People address the issue with meta-analyses, in which several studies are examined and both positive and negative results are pooled together in order to estimate the "true" effects.
"In many cases effect sizes shrink dramatically, hinting at the fact that very often the literature represents the positive tip of an iceberg, whereas unpublished data loom below the surface. Such missing data would have the potential to have a significant impact on our pathophysiological understanding or treatment concepts."A new movement is rising, which advocates the publication of negative results (i.e. results that did not substantiate the alternative hypothesis), and more journals are integrating this into either a "Negative Result" section or, as BioMed Central has done, even dedicating a journal to it, the Journal of Negative Results in Biomedicine.
I welcome and embrace the change in thinking. It's the same logic I advocate for mathematical models. My new motto: "Negative results? Bring them on!" Maybe I'll have a T-shirt made -- anyone want one too?
[1] Dirnagl, U., & Lauritzen, M. (2010). Fighting publication bias: introducing the Negative Results section Journal of Cerebral Blood Flow & Metabolism, 30 (7), 1263-1264 DOI: 10.1038/jcbfm.2010.51
"Now, you see why anything above 5% is not significant: to observe something 10% of the time out of pure chance means that whatever you are trying to measure is a random effect."
ReplyDeleteThis language makes it sound like if the p value is above .05, the hypothesis can be considered disproven; not so. If there's an effect size that would be clinically significant (assuming a small biomedical study) but the p value is .10, you can't say the effect is real but you also can't say it's not real; you don't know if a larger study would have shown the same difference between groups with a p value of .05. That's very different from the situation in which there is just no apparent numerical difference between the two groups. Both authors and bloggers are sometimes motivated to conflate these. (I don't know what the original paper you're referring to is, by the way.)
You're right, the 5% is a pure nominal threshold, and it is often arbitrarily chosen. In fact, I myself get annoyed when people tell me that p=0.04 is significant and p=0.06 is not. Thanks for pointing this out.
ReplyDeleteBTW, I'm a statistician, not a blogger. :-)