More is not better–when statistics turn bad (it’s just not as entertaining as when animals do)

On Apr 28, 2010 In Tags: abstracts, alpha level, beta-alanine, critical appraisal, multiple comparisons, p-value, tutorial, type I error

As with everything, more is not usually better. In statistics, more actually makes you more prone to being accidentally wrong (or, in statistics-lingo, spurious). Today, I’d like to talk about a basic concept that is taught to students in their introductory statistics courses (and hence, you would think, most researchers): The effect of multiple significance testing.

Taking off from last week’s post: In an experiment (or bet) where a coin is flipped 60 times, the chances that less than 22 heads or more than 37 heads will come up are less than 5%, or 0.05 (0.025 for less than 22 heads + 0.025 for more than 37 heads), PROVIDED the coin is a fair coin.

However, as you probably very astutely noted, even if the coin is fair, it IS possible to flip less than 22 heads or more than 37 heads. It certainly isn’t impossible. It’s even possible to flip 60 heads or 0 heads with a fair coin (it’s just highly improbable). When we incorrectly conclude that the coin is NOT fair when, in fact, it is (because we happened to flip a high improbable result), we are committing a Type I error. In statistic-speak, we reject the null hypothesis incorrectly.

Remember, that if a statistical test yields a p-value of less than 0.05, we make the inference that there is something going on between the two groups, because for us to observe the difference that is present between them would be highly unlikely if there wasn’t anything going on. But it is possible that we are wrong in making that inference. And the likelihood that we are making that mistake increases as we do more tests:

If the null hypothesis (i.e. there is no difference between the groups) is true and we consider 0.05 to be the critical level at which we would infer that there is something going on between groups, there is a 0.95 probability of yielding a non-significant p-value when we do a single test (i.e. that the statistic would fail to show a significant value by chance alone).

If we do two tests, the probability of BOTH of them being non-significant is 0.90. We get this number just from multiplying the probabilities of both tests being non-significant (as per the laws of probability):

0.95 x 0.95 = (0.95)^2 = 0.90 (remember the ^ means, “to the power of”)

If we do 20 tests of significance, the probability that ALL of them will be non-significant is:

(0.95)^20 = 0.36

Since the sum of all probabilities has to be 1, the probability that at least one p-value that is less than 0.05 when in fact, nothing special is going on (the null hypothesis is true) is:

1-(0.95^20) = 0.64

So, in a study (like most of the beta-alanine trials) in which 20 or more p-values are generated, there is more than a 50/50 chance that at least one of the p-values less than 0.05 is not reflective of the reality.

There ARE ways to correct a p-value for multiple comparisons; and there are ways to minimize detecting a “spurious” p-value. Unfortunately, many researchers don’t know or understand this fact and will often hone in on the one, or handful of significant p-values in the sea of non-significant ones to say, “Ah ha! See? See? There IS something going on!”

A good example of this is the beta-alanine study in which 36 subjects were given either beta-alanine or a placebo and then tested for multiple variables. I stopped counting the number of tests after 20. There were several significant p-values, amongst which was lean body mass. The authors concluded that beta-alanine increased lean body mass (despite it not doing so more then the placebo group).

This is also why reading ONLY the abstract of a study is generally a bad idea, because an abstract will usually only contain the positive findings due to limited space (usually 200-500 words, depending on the journal). Reading the back of the book isn’t the same as reading the book itself. The same goes for reading only the abstract of a paper.

P.S. If you’re coming to my blog for the first time from the Phi Life podcast, welcome! Alas, I have nothing to sell you. And if you haven’t heard the Phi Life podcast, you’re missing a great show. Not that me being on it makes it that way…

More is not better–when statistics turn bad (it’s just not as entertaining as when animals do)

Archives [+]

Tag Cloud

Links