Some of the major problems in studies have to do with misuse of statistics. Descriptive statistics were originally applied to medical research to control for ways in which the study group, by chance alone, might not exactly reflect the whole universe of patients to whom the results might be applied.
Part 4 in a primer on how to read – and understand – medical literature.
Some of the major problems in studies have to do with misuse of statistics. Descriptive statistics were originally applied to medical research to control for ways in which the study group, by chance alone, might not exactly reflect the whole universe of patients to whom the results might be applied.
Such statistics are not typically used in physics, because no one believes that a study of magnetic effects of electrons might be invalid because the electrons tested might act differently than other electrons not included in the study. Statistics in medicine are used not to know what happened in a group being tested (we know that, from the raw results alone), but rather how well we can expect these results to apply to other patients with the same problem who were not part of the experiment.
When we evaluate every single person in a group, we can describe what we find, and there is no need for statistical testing to describe that group. If we gathered every person in the world with a rare genetic condition we could simply paint a picture of that condition. In the same way, if the Lakers beat the Celtics four games to three in the championship, we can say the Lakers are the champs, pure and simple. When we want to extrapolate from a sampled group to a larger, untested group, however, we run into other problems.
We cannot say, for example, that the Lakers are the “better” team, because that implies that out of an infinite number of possible games, they would win more than half, and we don’t know that this is true. It’s intuitively obvious that if the teams were absolutely equal in ability, one of them would have to win four of the first seven. Even if one team was minimally worse than the other (such that if they played an infinite number of games it would win only 49.999%) it might win four (or even five, or six, or all) of the first seven.
This is because in the absence of absolutes (one team is so much better it would win 100% of the games) there is always the chance that an unlikely event will occur. Thus, with flipping coins you might get four heads in a row, or seven tails, even though the two sides of the coin are identical. Since in medicine we usually can’t test the entire universe of people with a given condition, or treated with a given drug (as opposed to the way a seven-game series is defined as the absolute universe of the championship), we make observations on a sample group and try to extrapolate to all the other untested folks with the same condition. In order to do so, we need to decide whether the results we got occurred because they “really” describe a characteristic common to the whole group of people with the condition or just because chance had it that this subset we tested happened to have these characteristics, while if you had the opportunity to test the whole group the results might be different.
So, when we see that more patients admitted for chest pain with initial EKG changes die than those without such changes, we know that in the tested group, EKG changes were associated with increased mortality. But, what we really want to know is whether the same would hold true for all those patients out there who will be admitted with chest pain some day, and whom we didn’t get to test in this study. Will they also have a higher mortality if they have EKG changes, or did it happen in this group just by chance alone, despite no “real” relationship?
In order to answer that, we can do certain mathematical tests to tell us how likely it would be to find the difference that we did if the two groups (those with EKG changes and those without) were really exactly the same with regard to outcome. That’s the famous null hypothesis: there is no way to test how different the groups are likely to be, but merely how likely or unlikely the given results are if the groups were really the same.
If we find a huge difference in mortality, 80% vs. 20%, it is intuitively obvious (and also true) that this isn’t likely related to chance (although it could happen — think about heads and tails). The flip side is that it is also obvious, and true, that for the same degree of difference, the larger the number tested the less likely that chance is the cause. If we flip two coins and one is heads 4 out of 5 while the other is heads 1 out of 5 we can easily imagine that the two are really the same. However, if one were heads 800 times out of 1000, and the other only 200 out of 1000, it would be very unlikely that this occurred just by chance alone.
Next month, we’ll go a bit further and discuss the use of “P” values in assessing the power of a study, applying this statistical tool to the null hypothesis. Small sample sizes are unlikely to show a difference, even when one exists, and large studies are likely to show a difference when such a relationship is nonexistent. So, can “P” values sort all of this out? Perhaps, but misapplying this test may result in inappropriate conclusions. We’ll discuss how.
Dr. Hoffman is a professor at the UCLA School of Medicine and faculty at the UCLA ED. He is also the associate medical editor for Emergency Medical Abstracts.