**Part II: Small Samples Create Questionable Results **How many patient satisfaction surveys are necessary to obtain a statistically reliable look at the performance of hospitals and health care providers? Press Ganey states that only 30 survey responses are needed to draw meaningful conclusions, although they prefer to have at least 50 responses before analyzing the data.

**Part II: Small Samples Create Questionable Results**

How many patient satisfaction surveys are necessary to obtain a statistically reliable look at the performance of hospitals and health care providers? Press Ganey states that only 30 survey responses are needed to draw meaningful conclusions, although they prefer to have at least 50 responses before analyzing the data. We asked Dr. Eric Armbrecht, a statistician and Assistant Professor for St. Louis University’s Center for Outcomes Research and Dana Oliver, a biostatistician at St. Louis University if they agreed.

Dr. Armbrecht suggested that analyzing only 30-50 responses would lead to unacceptably wide confidence intervals and would substantially limit the generalizability and use of the data obtained, regardless of whether 3,000 or 10,000 patients were surveyed. Dr. Armbrecht explained that low response rates could create confidence intervals as wide as 50%, which could be similar to just flipping a coin to determine whether the data is representative of an entire population’s perceptions. Breaking down those same 30-50 responses in an attempt to analyze satisfaction scores of individual physicians would create even less reliable results as the number of responses per physician would be even less. Ms. Oliver also disagreed with Press Ganey’s assertion that 30 or 50 responses would result in statistically sound data, noting that those numbers could be “arbitrarily chosen” by some survey methdologists.

How many responses are necessary in order to have statistically reliable data? The answer depends upon the size of the sample population. Assuming a margin of error of 4% (which is double the margin of error that Press Ganey would like to use) and assuming a statistical standard 95% confidence interval, the minimum sample sizes that Dr. Armbrecht recommended for populations of 1000, 2500, 5000, and 7500 would be 375, 484, 536, and 556 respectively. He noted how the response rate tends to flatten out with larger sample sizes and cautioned that these response rates would only apply to “yes/no” questions (such as whether or not a doctor was “very good”). In order to measure the validity of rating scales (such as those from 1-5), the calculations become somewhat more difficult and are dependent upon the standard deviation in the sample population. Dr. Armbrecht gave an example that using a 1-5 scale with a standard deviation of 0.7 and a margin of error of 10% (which is five times higher than Press Ganey seeks), 188 responses would be needed in order to reliably estimate the responses from the general population. Dr. Armbrecht recommended online statistical calculators such as those available at Creative Research Systems (http://www.surveysystem.com/sscalc.htm) to help determine the statistical significance of most data.

Aside from low response rates, Dr. Armbrecht and Ms. Oliver described additional problems that can occur when using a 1-5 scale in satisfaction surveys.

If hospital administrators seek to be at or above the 90th percentile in satisfaction scores, asking patients to grade performance on a 1-5 scale essentially creates a system with one passing grade and four failing grades. If patients are not aware that a score of “4” is a failing grade, the data that they provide may be misinterpreted when being analyzed. In addition, patients may perceive a small relative difference between a grade of “4” and “5” on a survey, but may perceive a larger relative difference between a “3” and a “4” on the same survey, creating a system in which they grade “so-so” care with the same score as “just less than perfect” care. Finally, with small sample sizes, one unhappy customer can turn many “passing” grades into failing grades. Four patient scores of “perfect” fives can be brought down to “failing” fours by one extremely unhappy patient who grades a provider or hospital with scores of all zero. Our experts noted that a simple way to avoid these analytical problems was to create a dichotomous scoring system with “yes-no” questions. For example, “Did your care meet your expectations?”

**Clarifying Terms**

Press Ganey’s literature contains several other statistical terms that our experts felt it was important to understand when analyzing the utility of patient satisfaction scores.

The “standard error of the mean” is the standard deviation of a sample population’s mean. Ms. Oliver noted that before performing any type of statistical testing, it is a good idea to first plot a histogram of multiple sample responses to determine whether survey data will be distributed in a normal bell curve pattern. If the survey responses are not distributed in a bell curve pattern, conclusions cannot be drawn from the data – unless the variability of the data is low.

Press Ganey literature relies on the “central limit theorem” in justifying a reliance upon sample sizes as low as thirty. Ms. Oliver explained that the central limit theorem holds that the mean and median scores from very large survey samples tend to form a typical bell curve. In most cases, the central limit theorem only applies if there is a similar distribution of variables in each survey. Because patient satisfaction survey samples from specific hospitals are generally not large and because the surveys do not always have a similar distribution of variables, the central limit theorem probably would not apply to satisfaction survey data.

Analysis of survey results depends in part on the “margin of error” of the survey data. Margin of error is used to express the confidence with which survey responses can be relied upon when an entire survey population is incompletely sampled. For example, suppose that five percent of a sample population is surveyed and one question has a mean score of 50. If the margin of error for the question is 30, then the actual value for the response in the sample population could be anywhere between 20 and 80 (the mean score of 50 plus or minus 30). Dr. Armbrecht stated that a good estimate of a margin of error is given by the formula 1/[square root of the number of partipants in the sample size](Niles, 2006). In other words, for a sample size of 100, the margin of error would be roughly 10% and for a sample size of 9, the margin of error would be roughly 33%. Achieving Press Ganey’s goal margin of error of 2% or less would require a sample size of approximately 2500.

**Understanding Survey Limitations**

So are satisfaction surveys a useful tool for assessing the quality of medical care? Dr. Armbrecht compared analysis of survey data to sampling a pot of soup.

If you want to see how good the soup in a pot tastes, first the ingredients in the pot must be well mixed. The “mixing” of the soup is analagous to obtaining completely random data from a sample population. If you only mix the top layers of the pot, you might not get the beans and pasta on the bottom of the pot, so your sample taste will not be representative of the true flavor of the soup. Similarly, failure to completely randomize data samples by excluding certain segments in a population (such as admitted, transferred or LWOBS patients) significantly increases the likelihood that the results will be inaccurate.

If the soup is fully mixed, but you only taste a drop or two of soup, you probably won’t get a good flavor for the soup, either.

Similarly, small sample sizes from a large population are likely to provide misleading data.

Once an appropriate sample is taken, surveys

can only be used to determine whether there has been a change in the sample population. Using the soup analogy, you tweak the recipe by adding or changing ingredients and take another sample to see if people like the new recipe better. Surveys can only be used to measure how the soup in a single pot is changing over time.

Sometimes survey data can be misused, though. For example, sampling the soup in two different pots can’t tell you whether one soup is better than another soup or whether one ingredient is better than the same ingredient in a different pot. Satisfaction survey statistics likewise should not be used to compare and rank different hospitals or different health care providers. Dr. Armbrecht noted that a 90% ranking at one hospital cannot be deemed better or worse than a 70% ranking at a different hospital. The demographics and variance in patient populations being sampled don’t allow such a comparison as it is more likely that variables independent of the services provided (such as patient literacy, lack of forwarding address, language barriers, payment issues, and population homogeneity) will have an effect on the data being sampled. In other words, taking the staff from a hospital with 90% satisfaction score and placing them into a different hospital would probably not create a 90% satisfaction score in the new hospital. The only information that satisfaction surveys can provide is a determination whether a specific hospital or a specific provider at a hospital is getting better or worse over time. In order for even that determination to be made, the sample sizes must be large enough to be statistically significant.

What are the takeaway points about analysis of satisfaction survey data?

First, small sample sizes can lead to significantly unreliable data. Last month, we showed how small sample sizes resulted in a 99% change in a hospital’s percentile rank in just two months. Simply put, small response sizes lead to inaccurate results.

Second, when sample sizes are large enough, satisfaction surveys can be an important tool to gauge and improve patients’ perception of the medical care they receive. However, using survey data to compare one hospital to another or to compare one provider to another is a misuse of survey data and is likely to create misleading and unreliable results.

**Glossary of Statistical Terms**

**Mean:** The average of all the responses.

**Median:** The middle value in all the responses when those responses are arranged in numerical order. The closer that the mean and the median get to each other, the less variance there is in the data.

**Dichotomous data:** Contains only two possible choices, such as whether the light was on or off.

**Non-dichotomous data:** Consists of multiple possible values, such as rating scales used in satisfaction surveys.

**Normal or gaussian distribution: **Another way of describing a typical bell curve.

**Standard deviation:** The square root of the variance in a data set. Low standard deviations mean that the data points are close to the mean while high standard deviation values mean that the data points are spread out over a large range of values. When there is a normal distribution of data, about 68% of the data values will fall within one standard deviation of the mean and about 95% of the data values will fall within two standard deviations from the mean.

**Confidence interval:** A measure of survey reliability. The narrower the confidence interval, the more reliable the survey results. A confidence interval of 95% is the conventional standard in medical and social science research and reflects a high likelihood that the sample data reflects the population from which it was sampled.

## 13 Comments

Nearly 15 years ago, and multiple times since, I have argued this very point with administration personnel who simply did not have a clue about how to use statistics. My point is now, how can hospitals continue to use this crappy methodology to garner patient feedback? How can they justify spending precious income on such claptrap? Can someone explain?

Garbage in garbage out. Press Ganey has told me that the hospitals can do the selection of patients however they want.

Recently I became aware of a nurse manger who simply sent a PG form to all patients who complained.

Like Hoyt, I have been arguing this point for some time. I have seen hospital staff make lists of pts to loby so they complete Press Ganey surveys when they appear and try to discourage unhappy pts from doing the same. When a patient is treated by an ED physician, two hospitalists, and three specialists, a 5 point scale asking how well your doctorn treatedyou is useless, but expensive.

Congratulations> excellent 2-part series by Sullivan & DeLucia. About 15 years ago Demming’s associate was at teh university giving a conference. Had lunch and he analyzed our Press-“Gagme” me data and determined it was all “noise” and of no value to draw conclusions. Schuman & Wolfe (Pittsburgh EMS researchers of the 70s & 80s)- one participated in an Am Society for Quality (ASQ)weeklong seminar along with another presenter in Chciago. Dealt with surveys – same conclusion – most were junk – 2 giid systems, but P-G was not one. Poor analysis of too little data.

When I approached hospital board with this data and the learned conclusions – they were not willing to admit that they had made a mistake investing over $100,000/year in P-G. Got a slock statistician (adjunct part-timer) from the unviersity to discredit ASQ and Demming. Wonder what that cost ?

Common sense can tell you too little data points does not allow for drastic conclusions. When actual data gets worse but percentiles get better – duh !

Keep up the meaningful editorial coverage.

John C. Johnson, MD, FACEPe, FACPE

Past President – ACEP

EM doc for 35 years

Considering that the desired outcome is that a typical patient isn’t in the ED very often, what’s the point of a numerical scale? I’d be hesitant to put a “5” or a “4” on the only time I ever got treated in an ED; how should I know whether it was awesome or lousy? What am I going to compare it to?

I like the soup-pot analogy. To carry it a bit further: Let’s say that every spoonful you tasted had a bean in it. “confidence interval” is the likelihood that it’s bean soup, rather than only having a few beans that you were lucky enough to get one every time.

Despite this complete lack of validity, Press Ganey has a commanding market lead in the ‘Patient Satisfaction Survey’ industry, and is thus hugely profitable. this comes at the potential expense of innumerable emergency physicians whose contracts directly or indirectly provide for bonus compensation based on these scores. We should file a class-action against this organization, and further immediately demand that it cease operations until it remedies this injustice; we should further demand that until such time, it publishes full disclosure (its own ‘black-box warning’) that its methodology is statistically flawed and should not be relied upon to determine physician compensation.

Does anyone have a customer satisfaction survey that they think is worthy? Our group is trying to persuade the hospital to change the survey for the ED. If so, please email it to me. THANK YOU nmwlcyc1971@yahoo.com

P-G admits their data is useless with small sample size but claims an obligation to put it out there anyway. That’s like telling a patient they have brain cancer based on a urinalysis. It causes administrators to jump to conclusions with incomplete data. They pay a fortune for the information and feel compelled to use it no matter how flawed the information is.

Are any others having surveys returned for patients that you did not see? PG seems to be pretty cavalier regarding reliability of data, so not much effort made to make sure that the survey returned actually saw the provider indicated on the form.

I read this with my MPH/Biostats/Epidemiology hat on and these two individuals reflect all the basic principles that I learned in my biostats and epidemiology courses during my MPH. Their assessments are very basic and very sound. Too bad PG disagrees and more sad that the administrators understand less about the survey limitations than PG…. who must have done a good job at selling it. What especially sung true was using the dichotomous scale of yes-no. Another question that should always be asked when developing and later assessing any survey—-“does it ask what we want to know”. I agree that the PG survey being used always sets us up for failure. My hospital will get these scores each month with 15 survey samples. They claim that at the end of the quarter when they have 45-50 responses, it is then reliable. My question then is….”why to we get meaningless monthly results then?”

Bottom line, this survey would NEVER stand muster for a population based public health survey that could be used for any policy or resource allocation….for one simple reason….it is not statistically reliable data.

I received my first Press-Ganey survey earlier this month. I didn’t have a clue what a P-G survey was, so I asked the staff member in charge to explain the numbers. By then I had a really big clue that it was totally lacking as a measure of patient satisfaction, much less as a measure of quality of care. Later that week I was sitting at the lunch table where two top hospital honchos were “discussing” the surveys and realized, to my chagrin, that, not only were they clueless about the fundamentals of statistics, they were making decisions based on the results of the survey. Although my mouth wanted to hang open with disbelief, it had to remain shut or it would emit words I would regret (like “ignorant” idiotic, unscientific, “total waste of time and money” and “garbage”). I’m amazed that the same species of primate that got us to the moon and back could take this survey seriously. BTW, my ‘n” was 6 out 0f 35 surveys sent. Need I say more?

I really like this course of action. They should be held accountable for this error. Furthermore I seriously doubt the president or the CEO of the hospital didnt get their bonus based on the PG score, if they did I’m sure PG survey wouldnt be used. In the long run I believe this wrong will be corrected, but only after a class-action lawsuit with follow up government involvemt and interaction in how to properly find a means to measure patient satisfaction!

I think this paper hits it right on the mark focusing on small sample sizes and unreliable conclusions. This is one of the more intuitive but definitely less appreciated laws of statistics.

There are points in the article that I don’t find to be helpful, and maybe even misleading: ”However, using survey data to compare one hospital to another or to compare one provider to another is a misuse of survey data and is likely to create misleading and unreliable results.” This is such a overreach its ridiculous. Using survey data to compare one hospital to another might be the only way to do such comparisons. Can’t throw the baby out with the bath water.

And then when he talks about the 1-5 scale and 90th percentile (having a score of 4 or less would mean failing), he’s assuming that the distribution of scores is going to be even across the scale. You can be in the 90th percentile with any score between 1-5. Percentiles describe your sample’s score distribution, not raw score.

The part about 30-50 responses needed being ‘arbitrarily chosen by the survey methodologists’ is also misleading. That sample size is a well-known standard for surveyors wishing to have precision in their sample mean within a standard deviation. Its based on assumptions made by the central limit theorem, so its not arbitrary. Certainly if the data randomly had more variance or noise then those sample sizes would need to change too.

The whole article seems to be picking apart Press Ganey based on their attempt to simplify really complex statistical methodology. We all know how difficult it is to do technical writing for a lay audience, and most doctors are lay when it comes to stats and probability. So when they talk about it they have to just skim the surface, which leads them open to criticism of oversimplification by the author.

There also seems to be something else going on in this article: I feel like its using the uncertainty on which statistical methods rely against the field of surveying altogether. This is similar to how climate deniers use the fact that nothing in science can ever be proven in order to throw out all of our work to describe things in our environment (like “Defund NOAA!”). The fact that we can’t “prove” things is the reason why science works in the first place. Indeed, it’s a necessary condition of the scientific method.

It reads like someone who just learned about statistics and saw how dirty the kitchen was, and now never wants to eat at the only restaurant in town. The fact is surveying is really hard, but we have no other choice sometimes when we’re required to answer questions like “what is our patient satisfaction for a large hospital system with hundreds or thousands of physicians, nurses, and staff.” The stakes involved now with giant health systems has led to some really amazing methodology to deal with these problems (the endoscopes of the field). Which is why statistics for surveying is often taught in separate programs and always in separate classes. I would give Press Ganey the benefit of the doubt that they understand this, hired the right people for it, and chose not to get too much into the weeds in their public statements because it would take forever to explain.

But, I also understand after the Trump thing why people don’t trust this field very much right now…(even though fivethirtyeight.com did give him a 30% chance of winning the day before).