The importance of sample size is well known in medical research. Use very large samples when comparing two treatments and you will find “true” differences so small as to be unimportant. This month we are going to explore the concept of sample size and discuss ways to read between the lines when analyzing study results.

Are you relying on data that has been skewed by a too large or small sample size?

The importance of sample size is well known in medical research. Use very large samples when comparing two treatments and you will find “true” differences so small as to be unimportant. Utilize a very small sample and you’ll have the opposite problem: you’ll fail to statistically prove even important differences. This month we are going to explore the concept of sample size and discuss ways to read between the lines when analyzing study results.

Let’s start by referring back to last month’s coin analogy. If you flipped two coins, one of which had both sides heads and the other both sides tails, four times each, you’d obviously get four vs. zero heads. But no statistical test would say there was a “significant difference,” because there is a reasonable chance that with two identical coins you would get just this outcome. Therefore, with just four flips (patients, subjects, etc) you can’t possibly help “fail to reject the null hypothesis.” But some people would report this result as p >0.05 and inappropriately conclude that the coins being compared were the same!

Conversely, let’s say you took 10,000 patients with high blood pressure and treated them with two different pills. One pill lowered diastolic pressure from 95 to 84 while the other “only” lowered it to 85. A statistical test would say the difference was “significant,” with a very low p value (i.e. very unlikely to be due to chance alone when it persisted over such a large group). Some people would therefore conclude that the first drug was really “better” than the second in lowering blood pressure. It would be appropriate to conclude that the first drug “really” lowers blood pressure by a tiny bit more than the second (in patients with initial diastolics of 95) and that there is almost certainly a true difference between the drugs in this regard (only a very small chance that the two are absolutely identical). On the other hand, it would be absurd to say it was a better treatment, since the added blood pressure lowering effect is so minimal as to be meaningless.

You will notice that everything said here relates to the chance that something would occur, or not occur, if the groups being tested were really the same. When that chance is tiny – less than 5% likelihood is often used in trials – we tend to “reject the null hypothesis” and say the difference is real. Conversely, we “fail to reject the null hypothesis” when there is more than a small chance (again, typically 5%) that these results could have occurred even if the groups were the same.

Even if the “P” value is greater than 0.05 (a greater than 5% chance results this different would be found even if the two groups were really the same) it doesn’t mean they are the same. Results that yield a “P” of only 0.06 are still pretty unlikely to have been based on chance alone, and obviously different than when p=0.99. You’d virtually always find a difference this large with two identical groups. Conversely, even when a “P” equals 0.01, one in a hundred times that you did the comparison involved in your study you would find such results by chance alone, even if the two groups were the same. That is, for every hundred studies of identical groups one should happen to yield results that look this different.

This reminds us that statistical testing like this was never really meant to be a “yes/no” proposition, but rather a guideline as to the likelihood that a true difference exists. Authors and journal editors, unfortunately, misapply “P” values routinely. They try to get us to believe that a measured difference is “real” if it would occur only 4% of the time, by chance alone, with two groups that are really the same, but “false” if it would occur 6% of the time in those same like groups. For this reason, it would be much preferable if “P” values were reported precisely, rather than as “p<0.05” or “p=NS.” It would be even better if we were given confidence intervals, which tell us what range of values is likely to reflect the “true” value in the entire population of interest, based on the results of the study sample. As an economist recently noted, we tend to use statistics like a drunk uses a lamppost: for support rather than illumination.

So, failure to find a difference statistically doesn’t mean one doesn’t exist, just that we can’t prove it with the group we tested. Furthermore, the smaller our test group, the less faith we can put in negative results. Be cautious. Even very small “P” values don’t prove the difference we found is real (not based on chance alone), but merely that it’s likely to be so. If you flip a normal coin 100 times over and over again, at some point you’ll get a series of 100 with very unlikely results.

To illustrate this point, a recent study of dog bites found that there were two infections in 82 patients treated with antibiotics, compared to five infections in 70 other patients. It was a bit misleading to present this as a three-fold increase in infections when antibiotics were not given (7.1% vs 2.4%), since the “P” value of 0.08 means chance alone could explain it eight times out of 100, which by convention is not unlikely enough to allow us to reject the null hypothesis. On the other hand, it was equally unreasonable for the authors to then turn around in their discussion and state that the treatment groups were equal and that there is no value to antibiotics, since they were associated with a great decrease in infections in this sample group, which was likely (but not certain) to have been due to a treatment effect. The real point is that by doing a power calculation, the authors of this otherwise very well done study could have discovered that the study wasn’t worth the effort without including many more patients. They were almost certain not to have found any statistical differences unless antibiotics were overwhelmingly the most important factor in decreasing infections, and there was no reason to hypothesize that prior to the study.

Another frequently encountered statistical problem has to do with something called data snooping or “data torturing.” In a typical (imaginary) study, investigators want to see what characteristics correlate with mortality in patients admitted for chest pain. So, they prospectively gather data on every imaginable characteristic of 10,000 such patients. They discover that the ED variables with a statistically significant correlation with increased mortality include EKG abnormalities, CHF, hypotension and blue eyes. The investigators tell us that each of these characteristics is important, and given the impressive size of the study, and the fancy use of regression analyses, we’re tempted to believe it.

However, the more tests you do, the more likely you are to find what looks like a difference between groups, differences that actually popped up just by chance. When you say that P=0.05, it means that you “only” have a 1 in 20 chance of it occurring by chance alone. However, it also means that when you test many different variables in two identical groups you should find differences of this size, just by chance, in a small proportion of the tests. It’s possible that the genetic characteristics that produce blue eyes also increase mortality in patients admitted with chest pain. However, it’s even more likely that we only found this association because we tested so many variables that a few differences popped up by chance alone. The same is true, even for initial CHF, or any

of the other associations found in a s tudy of this design, even though they seem much more plausible as factors capable of influencing mortality.

So, how do we interpret results like these? Well, first we recognize the problem created by this type of data gathering. Then, we look to see whether the authors did either of the two things available to lessen the problem. There are ways of revising your statistical tests to account for multiple comparisons, or, even better, you can try to validate your results by applying them to a different group.

Thus, if you look at enough variables in a group (the testing set) you’re likely to find some chance associations that aren’t reflective of the entire universe of patients with the condition in question. But it’s very unlikely that that exact same association will appear again in the next group of patients (validation set) in whom you ask the same questions. Maybe this time dark hair will seem to influence mortality, but blue eyes would be very unlikely (really less than 5% chance) to pop up again unless it in fact did have some influence on outcome.

I hope this helps clarify the application of “P” values and illustrates how they can be misleading when utilized, applied or interpreted incorrectly. Next month, we’re ready to move into a discussion on sensitivity and specificity.