Understanding Internal Validity | Emergency Physicians Monthly

Part 2 in a series – Continuing our discussion on how to understand the literature that we read, we move to a way in which internal validity is threatened: improper classification of results. This can be because of lack of an adequate gold standard, because of biased estimation of results (usually in the absence of blinding), or because of imprecise or irreproducible measurement of results.

Part 2 in a series

Continuing our discussion on how to understand the literature that we read, we move to a way in which internal validity is threatened: improper classification of results. This can be because of lack of an adequate gold standard, because of biased estimation of results (usually in the absence of blinding), or because of imprecise or irreproducible measurement of results.

EP MONTHLY APRIL 2025 PEDI BALLOON SQUARE

A gold standard is a test by which study results are classified. It should represent the best available way of determining the presence or absence of a condition (or diagnosis, or outcome, etc). Thus comparing clinical findings to autopsy findings allows us to be sure that the clinical findings really do or do not represent features of a given disease, assuming the pathologic diagnosis of that disease is accurate. (Of course even here there’s a problem with external validity, since clinical findings in patients who die from a disease, and get autopsied, may be very different than findings with the same disease in patients who do not die!) The less pure “gold” the gold standard, the less we can trust the accuracy of a study which uses it to judge some other parameter. Let’s look at a few examples.

In a 1970s study of pulmonary embolism (PE), a world-famous surgeon described the clinical characteristics of patients with this disease. He did this by reviewing the records of 1000 consecutive patients with a discharge diagnosis of PE hospitalized at his institution. In this case, the gold standard for the diagnosis was the opinion of clinicians, based on their prejudices about how PE presents. Since they didn’t use angiograms (or even V/Q scans) to make the diagnosis, this “study” represents obvious circular reasoning: here’s what I think PE looks like, so when someone looks like this I call it PE; then I go back and look at the characteristics of patients I called PE, and lo and behold they’re just the same as what I thought they would be!

The authors of this study can’t be faulted for not having better tests available at the time, and it would have been OK (though not terribly useful) for them to write “here’s the type of patient characteristics that we feel suggest PE.” But the fact that they called this a study, and claimed it defined what PE looked like, shows that they didn’t understand that a gold standard has to be defensible, and it has to be different than the independent variable against which it is being judged. (In this instance clinical characteristics were both the gold standard for making the diagnosis and the parameter “defined” by the “study.”)

By the way, whatever is compared to the gold standard can never do better in a study than the gold standard itself, because to the extent that it performs differently will be assumed to be worse. Thus, if angiograms were subsequently done on these 1000 patients, many of them would surely have proven negative. We would now of course assume that the angiogram is more accurate than the clinical guesswork, but if the accuracy of the newfangled procedure of angiogram was evaluated by comparing it to the gold standard of “diagnosis of PE” (on clinical grounds) it would have looked like it had many false-negatives.

Sometimes the problem with gold standards has to do with the inexactness of the criteria used to decide whether results are positive or negative. We all know how hard it often is to decide whether or not there’s a pneumonia on the CXR of a toddler with a fever. This leads to the concept of reliability, or the likelihood that a test will be read the same way twice in a row. When we want to know if two or more people reading the same test (or evaluating the quality of care in the same patient, etc) would always, or usually, agree, we need to find some way to evaluate interobserver variability. But it’s also important to know about intraobserver variability as well, since with many tests the same person might not come to the same conclusion every time.

Another great study about scaphoid fractures illustrates the same principle from a different angle. We’re always told that initial films may be false negative for scaphoid fracture, so we should splint patients, and repeat films at two weeks, by which time they’re supposedly definitive, in any equivocal cases. However, when senior orthopedists and radiologists were given a whole bunch of films, from day 1 and day 14, they absolutely couldn’t tell which were which. They were no more accurate on the later than the earlier films. They read them differently from each other, and they sometimes read the same film differently when given a second copy of it!! This shows that the reason we’re so much better in real life with the 2-week film is not because it’s a clearer film (as classically taught), but because by 2 weeks we know the answer clinically, so it’s easy to read the film “correctly.” When radiologists keep asking for clinical information, it’s not only so they’ll know where to look, but also so they’ll know how to read the films!!

At UCLA, a member of the faculty once did a study comparing plain x-ray findings in patients who were later proven to have either 2nd or 3rd degree lateral ankle sprains. He made all sorts of measurements, and concluded that at one area of the mortise the distance was always enlarged with 3rd degree sprains and never so with 2nd degrees. This seemed to be a very useful finding, since it would allow separation of the two on initial presentation.
But the measurements, which were in millimeters, had to be made using instructions like “measure from the densest portion of the white line at the level of the indentation in the lateral malleolus, etc…” Using such inexact criteria and looking for tiny differences might not be easy to replicate. Not surprisingly, a second person, measuring the same distances but blinded to the final diagnosis, could not reproduce the results at all. And when the principal investigator himself was given a sub-sample of the group and not told the final diagnosis or his initial measurement, he too came out with results that not only showed total overlap between 2nd and 3rd degree sprains, but also didn’t at all match his own earlier measurements.

Finally, let’s not forget that bias can be intentional or subliminal. When some UCLA investigators showed that journal ads for drugs are (deliberately) misleading it reflected more on the journals (which should supposedly be interested in accuracy) than on the advertisers, who can be expected to want to put their products in the best light. The same thing is common in studies!

Whenever you see a pain study without a placebo, which is a very powerful effect, comparison group, blinding of the investigator to treatment group or blinded estimation of treatment effects, you should place very little credence in its results. This is because, for a variety of reasons, authors can tremendously impact the result because of their desire to prove their hypothesis.

Investigators can greatly impact outcomes, in many ways, even when they are not deliberately misjudging things. In a fascinating study of placebo effects, some physicians were told their chronic pain patients would be getting (in a double-blind fashion) either a pain-killer or placebo, while another group of physicians were told their patients would be getting either naloxone (which could be anti-analgesic) or placebo. The clinicians, who neither knew what any individual patient would get, nor that there was a separate group of patients getting different medic
ations, then enrolled their own patients. In the analgesic group, patients who received placebo got some benefit compared to baseline, not surprisingly. What was astounding though was that in the naloxone group, patients who received placebo got worse, (as did patients getting naloxone)!

This shows not only that placebo effect has to do with more than merely endorphin release (since it can be anti-analgesic as well), but that it can be transmitted merely through the expectations of the physicians: when doctors thought some patients, even not knowing which ones would get some relief, they managed to convey something that made all the patients have a positive effect, while all the patients got worse when doctors thought some of them were getting a drug which could increase pain.

Since most diagnoses of OM probably occur in patients without actual OM, and since even when there is actual middle ear infection it’s frequently viral, and since many (of the relatively uncommon) bacterial infections clear up even without treatment, and finally since “clinical cure” frequently does not match bacteriologic cure, a “Pollyanna Phenomenon” has been described showing that almost any treatment will seem to produce clinical cure about as well as even an excellent antibiotic. Thus, one could not expect, when comparing a drug with 90% microbiologic efficacy to one with 30% efficacy, to find statistical differences in “clinical cure” rates unless 500 or more patients were enrolled, and it would take many thousands to find statistical benefits of a 90% effective drug over one that cured 75% of bacterial OM infections.

Therefore, it is virtually impossible that the drug cited could really have produced such dramatic differences in outcome, unless there was an extraordinary statistical coincidence, or the people who decided which patients were improved were profoundly biased in their assessments. They didn’t have to deliberately alter results, but merely interpret non-specific clinical parameters in a manner that happened to be consistent with what they wished to find.

One final example shows the types of distortion that can occur when there is a combination of imprecise outcome criteria and a clear political bias on the part of investigators. Much of the impetus for trauma centers in the US came from autopsy studies in which “experts” reviewed care in various places and regularly concluded that preventable trauma deaths occurred much less frequently in facilities where specialized trauma teams were in house compared to typical “non-trauma centers.”. They then speculated about how many trauma patients physicians (surgeons) needed to see before competence (or lack of preventable deaths) could be maximized.

One characteristic common to most of these studies was that the expert reviewers typically came from trauma centers, and were vocal advocates of the need for such centers. Furthermore, they knew what type of facility delivered the care for each of the patients whose charts they reviewed. Could their belief that trauma centers provide better care have influenced their decisions as to whether deaths were preventable?? Might they have been more understanding of bad outcomes when they knew these had occurred at a major teaching facility and more critical of the same event at a community hospital??

The answers are obviously yes, and critical readers should have been greatly suspicious of the results even as they were published. But, several studies of the autopsy-review methodology itself clarify the issue beyond any doubt. That is because when teams of “experts” are asked to review the same cases, each blinded to the type of facility and to the opinions of the other reviewers, they agree with each other about whether deaths were preventable only slightly more commonly than if they had just flipped coins!! That is to say, such implicit review of cases is extremely subjective, and the stand an expert takes is in great part just an opinion, which a different but equally competent expert may often disagree with.

So, the only outcomes that should be considered even somewhat valid, if estimated by non-blinded observers, are those that are truly objective, such as how many subjects survived, or how many points the hematocrit changed. The more subjective the outcome, the more critical it is that control groups be employed and that investigators be unaware of treatment allocations.

Internal validity is also impacted by confounding variables, differences in the study populations other than the issue being studied or difference being tested. Next month, we’ll explore the problems associated with these confounders.

Jerome Hoffman, MD
Professor at the UCLA School of Medicine; Faculty at the UCLA ED; Associate Medical Editor for Emergency Medical Abstracts

Leave A Reply Cancel Reply