Do the results of studies generalize to new situations?
For instance, suppose a study is conducted on a technique or intervention (e.g., providing health education to parents) and the study finds it to be effective for a particular outcome (e.g., improving the health of children). When the next study is conducted on (what appears to be) the same intervention and outcome, should we expect that study to ALSO find the intervention to be effective?
There are a lot of reasons why it may NOT:
(1) sampling error – studies are almost always done on only a subset of the population of interest, which creates random variation from study to study. For instance, it could have just been a fluke that this particular group of children happened to have their health improve more than the control group of children. So the two results might differ purely by chance. Fortunately, statistical methods allow us to analyze how likely this is as an explanation, and the larger the population is that we test the technique on, the lower the chance is of this happening.
(2) questionable research practices – the first study’s apparent positive result may have been a false positive produced largely by bad research practices, such as analyzing many different outcomes and reporting only the one that seemed to work. If the study and its analysis plan are publicly pre-registered before the study is conducted, then this problem can be mostly avoided. In that case, researchers won’t be able to fool themselves about their original intention, and others can also hold them accountable to their original plan. On the other hand, it can be very useful to explore questions about data that you had not thought to explore before you looked at the data, or to engage in open-ended analyses where you aren’t certain what your hypothesis should be. But doing so comes with a greater risk of false positives.
(3) quality – the quality of the intervention and competence of the implementation may vary. For instance, if the first educational intervention studied was well designed, but the second one taught the information poorly or taught information that wasn’t actually useful, then it would hardly be surprising if the first succeeded and the second failed. More generally, a sufficiently bad implementation of any intervention will always fail! So an intervention of a certain type failing is NOT good evidence for the type of intervention not working more generally, unless a low-quality implementation is unlikely (or future implementors are unlikely to be able to create a more competent implementation). If the intervention went through multiple rounds of iterative improvement and user feedback before the study was conducted, this will help rule out (but not eliminate) low quality as the explanation for a failure.
(4) technique – the two interventions could have applied different techniques in order to produce change, despite very similar-sounding names and equal levels of quality. For instance, the first intervention might have focused more on the health benefits of purifying water, while the second might have focused more on the value of hand washing. These are both “water-related health education interventions for parents”, which sound extremely similar, but one may be much more effective than the other.
(5) dosage – the two interventions could have been administered in different amounts. For instance, the first might have been a 1-week course with 6 hours of education per day, and the second a 3-day course with 2 hours of education per day. Since the former in that case has a much higher dose, it would be little surprise if it had the potential to be much more effective (though it also might be substantially more costly to implement).
(6) format – the interventions could have used the same technique but delivered it using a different mechanism. For instance, the interventions might have been teaching the same information for the same duration, but the first might have been an in-person training in groups, whereas the second could have been an online training done individually.
(7) follow-up – the two studies could differ in how long they waited to collect the outcomes. For instance, the first might have looked at health outcomes 6 months later, whereas the second might have looked at health outcomes 3 weeks later. It’s possible that 3 weeks is too soon to find effects, or that the effects only last for a few months, so 6 months is too late to find them!
(8) measurement – different methods of measuring the outcomes could have been used. For instance, the first study might have measured health by looking at children’s medical usage records at nearby clinics, whereas the second might have asked parents to self-report the health of their children. These measurements may produce results that contradict each other even if every other detail of the studies is identical, simply because they are measuring different (though related) things.
(9) pre-existing attributes – the two populations studied might have a different distribution of relevant characteristics. For instance, in the first study, the children might have had a higher level of pre-existing illness than in the second study, making it easier to improve health outcomes. If the children are already really healthy, it’s going to be tough to produce a large effect on health. But on the flip side, if the children are already extremely ill, in some cases, it may be hard to help them recover compared to helping less sick people.
(10) culture – the two populations might differ culturally or in their personalities, language, attitudes, or beliefs. For instance, in the first study, the local culture may have made people more interested in the learning material, whereas in the second study, the local culture may have made people less trusting of the information because it contradicts traditionally held beliefs.
(11) environment – the surrounding environment that the populations exist in, or the resources already available, might have differed in the two studies. For instance, the first study group might live in an area where health resources are scarce, whereas the second group might live in one where alternative health resources are more abundant, making it harder to add value for the latter population. Or, for instance, in one area, certain easily treated childhood diseases might be common, whereas in the other region, the common childhood diseases might be harder to treat.
(12) control group – oftentimes, studies use control groups to help determine whether an intervention caused a change beyond what would have otherwise happened, but the choice of control can differ across studies. For instance, the first study might have used a wait-list control where people get nothing for a while, whereas the second group might have received an educational intervention as a control that is not expected to produce health improvements (i.e., an “active control”). Or it could be that the first study used statistical control after the fact (e.g., using linear regression), whereas the second study randomized some of the study participants to a control group. The former method is typically weaker in that it may not fully account for confounding variables, but in some instances, the latter method may be weaker if the size of the control group is very small.
— Can we standardize effect sizes to make results more compatible across studies? —
How do we fix the problems above?
Well, one approach that is intended to help is to try to make interventions and outcomes that are more comparable using statistical tricks for “standardizing” outcomes. For instance, by converting the size of the effects to “Cohen’s d” scores.
Example: calculate how much people getting the intervention changed in the outcome of interest, subtract away the change in the control group, and then divide by the pooled standard deviation (from both groups). Use this as the measure of how well each intervention worked.
So rather than having one intervention’s effect measured in one unit (like fraction of people cured) and another measured in another unit (like reduction in symptoms on a 1-7 symptom scale), making these outcomes hard to compare, all the effect sizes now are measured as “number of standard deviations that the intervention group changed relative to the change in the control group.”
This certainly improves the situation of comparing two interventions, but only *somewhat*. It doesn’t address most of the issues mentioned in the list above, and those it does help are only partially addressed by such standardization.
Two Cohen’s d values that are the same can mean quite different things with respect to how “effective” an intervention really is. For instance, suppose that the outcome of one intervention was recorded as 0 for each participant not completely cured, and 1 for each participant completely cured, whereas the outcome for the second intervention was recorded as 0 when there was no improvement, and 1 when there was any improvement. Even if the interventions and populations are identical and the Cohen’s d for both is 0.50, this does not imply that the interventions are anywhere close to as useful as each other!
That being said, standardization is useful in that it at least somewhat reduces the problem of incomparability of outcomes.
— effect size doesn’t exist —
A potentially deeper and more disturbing point (one seemingly often overlooked) is that the “effect size” (i.e., strength of effect) of an intervention doesn’t really exist.
What does exist is the effect size of an intervention that’s been implemented in a particular way for a particular duration of implementation, and that’s been calculated using a particular choice of effect size measure, when applied to a particular population that’s being sampled using a particular sampling procedure.
Horrible to state, but its consequences are more horrible still.
That means: the “effect size of parent health education on child health” isn’t really a thing.
But: the “Cohen’s d of this particular 3-month health education intervention applied to randomly sampled families in a particular city using the outcome of medical clinic usage” does exist!
If we talk as though the “effect size of parent health education on child health” exists, we are either being sloppy, or optimistically hoping that all of those other details don’t matter much in terms of the effect size, or implicitly assuming some specific set of study attributes and procedures that we simply aren’t bothering to make explicit.
Now insofar as the “effect size of parent health education on child health” doesn’t depend on the messy details of things like which health education intervention we’re talking about, and which population of children we’re talking about, and what duration of intervention we’re talking about, etc. etc. it can be sensible to treat things as though there really is an “effect size” of that intervention. But really, how often do we expect that to be true?
— the average over what? —
This issue of the effect size not existing is actually an example of a more general mathematical problem, which is that if you say “this is the average of variable X”, a mathematician’s natural response might be, “The average of variable X taken over what probability distribution?” By that, they will mean that the average depends on the sampling procedure used. In other words, an average does not exist unless a sampling procedure is stated (or implicitly assumed).
Now, if someone says “the average age of people in this city”, the implied sampling procedure is probably choosing people uniformly at random from this city, getting the age of each such person, then taking an average of those, but if the way we attempted to collect ages was to stand on a particular street and ask people their age as the people passed, this would not be a uniformly random sampling procedure.
Presumably, there is some correlation between someone’s age and the chance that they bother to talk to you on that particular street about their age. This “average age” you calculate would be “the average age of people in this city who will stop to talk to me when I’m standing in this spot over this period” rather than “the average age of people in this city.
Now, it might be the case that the sampling procedure used turns out to be a pretty good proxy for the average age overall (if one were to sample uniformly at random), but then again, it might not be (e.g., you might dramatically under-sample newborns who rarely approach you on the street).
— So…can we not generalize the results of studies? —
All of the above may sound very pessimistic. It may sound like doing a study on one intervention tells us almost nothing about future similar studies on similar interventions.
But what I’ll call the “five C’s of generalizability” should make us somewhat more optimistic:
(1) Constants – many psychological and biological features of humans transcend time and place (because they are hard-wired into us via genetics, or because the intervention is only planned to be used on a particular group of people that share a lot of common characteristics). Insofar as we’re working with these more basic mechanisms (or across homogeneous groups), we can expect a higher degree of generalization.
That being said, when we’re used to being embedded in one culture, it can sometimes be surprising what varies across cultures, so we have to be cautious about positing human universals.
(2) Causality – the more deeply we understand the causality underlying why an intervention works, the more we can predict when it will or won’t generalize. For instance, if we know it will only work where assumptions A, B, and C are true, then we can ask about the extent to which those assumptions hold in a particular time and place. So if we are in a situation where our model of the underlying causal factors predicts the intervention will work, we can be more confident that it really will.
That being said, causality can be very tricky to uncover.
(3) Consistency – if studies find consistent effects for an intervention across varied contexts, then we may not know quite how well it will work in a new context, but we at least have evidence that the effect is fairly robust, and may generalize across new contexts as well
That being said, this only helps if we have found a fairly consistent pattern, and we still might get unlucky and find a new context where the intervention doesn’t actually work.
(4) Capability – if an intervention is found to be highly effective, it is more likely to generalize to other contexts. Even if variations in contexts reduce the effectiveness, hopefully, some of that very large effect will still remain.
That being said, it is certainly possible for something to work extremely well in one situation and fail totally in another.
(5) Circumstances – If the circumstance that an intervention was found to work in previously is very similar to the one that it is being applied to now (including the sort of people it’s being applied to, the precise intervention being applied, the way outcomes are measured, the surrounding culture, etc.), we can be more confident that it will work about as well as the prior one. For instance, if you take an intervention that worked in one school district and apply that exact same intervention in the same way in a similar school district nearby, it is more likely to work as it did previously than if you apply a modified version of the intervention in a different country.
That being said, it is often too limiting to only apply interventions in similar contexts to those in which they were previously tested.
— empirical data —
So how well, empirically speaking, does an intervention study done in one context tend to generalize to another?
Eva Vivalt has done some really interesting work on this topic, with an emphasis on international and economic development (e.g., see this paper: https://bit.ly/2yWQpRG). Also see this chart from the paper (https://i.imgur.com/6SVoVnr.png), which shows the standardized effect sizes for many different intervention/outcome pairs, such as bed nets for malaria and conditional cash transfers for attendance rate. The chart can help you gain an intuitive feeling for the extent to which generalizability does or does not occur for studies that are ostensibly on the same intervention and outcome, and the paper gives a lot more detail.
— What do we do about inconsistent effects? —
As you can see from Eva’s chart, it is pretty frequently the case that one study of an intervention will find a positive effect on an outcome, while another study finds approximately no effect or even a negative effect.
To understand why this happened, we can return again to our list of reasons that studies don’t generalize. We can ask ourselves, was this due to…
(1) sampling error?
(2) questionable research practices?
Or perhaps differences in…
(3) quality of the implementation?
(4) the technique used?
(5) dosage?
(6) format?
(7) follow up?
(8) measurement?
(9) pre-existing attributes?
(10) culture?
(11) environment?
(12) control group?
We can’t always tell what the causes of different results are, but at least we can TRY to figure it out by considering the different possibilities and attempt to hone in on likely candidates.
What’s more, we can apply the “five C’s of generalizability”.
The more that our intervention relies on human *constants*, the more our model of the underlying *causality* predicts we’ll get generalizability in this case, the more we have *consistency* of past effects across varied situations, the more that past studies produced high *capability* of achieving large effects, and the more similar this situation is to the *context* where the effect worked in the past, the less of a problem we’ll tend to find with generalization.
— one of my favorite solutions for generalizability —
One of my favorite solutions to this problem of poor generalizability is different than any of those I’ve discussed above.
When feasible and not prohibitively expensive or difficult, I think it is great for intervention developers to study their exact intervention using an outcome they actually care about on a population that is the same as (or much like) the real one that the intervention will be applied to.
In other words, merging the study with the real-world deployment of the intervention.
This neatly side steps generalizability to a large degree, not by ensuring generalizability to new groups, but by testing your results on the group you most care about!
Then, as you “generalize” your intervention to new contexts or groups, if you keep collecting data of the right form as you deploy it more broadly, you may be able to detect whether its effectiveness is successfully generalizing as well, and even better, try to discover what is or isn’t causing it to work.
This piece was first written on July 19, 2018, and first appeared on my website on December 11, 2025.
Comments