When you’re predicting how much a variable changes over time using a regression, do you tend to add the baseline value of that variable as a predictor to control for it? If you do, you can end up with misleading results.
For example, if you’re trying to predict change in anxiety in 2025 vs. last year (anxiety in 2025 – anxiety in 2024), you’ll get misleading results if you enter anxiety in 2024 as one of the predictor variables. If this sounds counterintuitive to you, read on. I’m interested in how many researchers might do this and how widely it’s known that this is a problem.
Quite a few papers have found a negative correlation between the signed change in the variable and its baseline value (e.g., see here, here, here, and here). For the reasons we outline below, such results can be expected even if there are no actual changes in x during the study, as long as: (1) the measurements for x1 and x2 are sufficiently noisy, and (2) there isn’t some mechanism whereby higher values of x1 somehow facilitate larger increases (or smaller decreases) in x.
You can grasp the result intuitively by looking at the formula for Δx:
Δx = x2 – x1
The larger x1 happens to be due to noisy measurements, then, as long as the noise associated with measuring x2 is independent of the noise affecting x1, the lower the value of the signed difference, (x2 – x1), will be. And the smaller x1 happens to be, the greater the value of (x2 – x1) will be. In other words, due to regression to the mean, if you take two noisy measurements and calculate the signed difference between them, you can expect that x2 – x1 will be inversely correlated with x1. But in case the intuitive explanation doesn’t work for you, we explain the result in other ways below. We also explain why this result might be a problem.
The Issue:
The issue comes into play for longitudinal studies when the outcome of interest is a signed change in some quantity (e.g., “income in 2020 – income in 2019” or “post-intervention anxiety scores minus pre-intervention anxiety scores”), specifically in situations where you try to control for the baseline value of the same variable (e.g., you include “2019 income” or “pre anxiety scores” as an independent variable in the regression).
We think this problem has implications for how we interpret the results of any papers that predict changes in a variable in this way. We explain the issue in detail below.
Suppose your goal is to study the signed change in some quantity over time. To keep things simple and concrete, let’s suppose you want to know what traits were associated with people becoming more anxious from 2019 to 2020 (on a self-reported anxiety scale). Hence, you might define your primary outcome of interest to be Δanxiety, the signed change in anxiety across time, like this:
Δanxiety = 2020_reported_anxiety – 2019_reported_anxiety
Now, you run a linear regression predicting Δanxiety using various factors to see what is linked to people’s anxiety changing. However, a person’s initial level of anxiety in 2019 (i.e., 2019_reported_anxiety) could be linked to how much their anxiety changes. For instance, if you don’t have a particularly high level or low level of anxiety in 2019, then maybe we should expect it not to change very much, or maybe if you have very extreme 2019 anxiety, we should expect a greater likelihood of it changing. Therefore, in your linear regression, you include as a control (i.e., an additional independent variable) 2019_reported_anxiety.
—
This all sounds very sensible (it’s precisely what we did in a recent study). You may think that controlling for the baseline value is a best practice. The unfortunate thing is that this can seriously bias your results for very subtle reasons!
Before we explain why this is a problem, please consider this puzzle and see what you think the answer is. Note that only 21% of my Twitter followers got this question correct!
—
A test of your math intuition: we use a very noisy but unbiased scale (accurate to +-40 pounds) to measure the weight of 1000 18-yr-old men on Monday and again on Tuesday. By noisy, we mean that each time you use the scale, you can think of it as giving the real weight plus random noise. By unbiased, we mean that on average, the scale gives the right answer (so if you weighed the same person repeatedly hundreds of times and took the average result, it would be very accurate).
Let
Δweight = Tue_weight – Mon_weight
What do you predict for the value of this correlation:
r = Correlation(Mon_weight, Δweight)
Option 1: r ≈ 0.00
Option 2: 0.1 < r < 0.9
Option 3: -0.9 < r < -0.1
Option 4: r ≈ 1.00
—
Answer: once you’ve thought about the question above, here is the answer (turn your phone upside down to read it):
ɹǝʍsuɐ ʇɔǝɹɹoɔ ǝɥʇ sᴉ ǝǝɹɥʇ ɹǝqɯnu uoᴉʇdO
—
Despite the issue we’ve outlined, it seems not uncommon for papers to predict the signed change in a variable using a set of variables that includes the baseline value of that variable. They don’t seem to discuss this problem, either. For example, see here, here, here, and here.
—
Why, if you are predicting a signed change in a variable, is it not okay to include the baseline value of that variable in linear regression?
The issue is that the signed change in a variable (e.g., Δweight = Tue_weight – Mon_weight) and the baseline value of that variable (e.g., Mon_weight) are going to be negatively correlated automatically in the absence of other factors, just as a result of how those two variables are defined (not due to any empirical fact about the world). So including both in a regression not only gives a misleading coefficient (i.e., it may show a negative relationship between them even when empirically one does not exist), but it may cause your whole regression to be misleading (e.g., p-values don’t have the interpretation you expect).
The negative correlation might lead some to conclude, for example, that the higher someone’s anxiety was prior to some intervention, the more their anxiety level dropped after the intervention – yet such a negative correlation (between baseline anxiety and the [anxiety at time point 2 – anxiety at time point 1] value) could arise even if there are no real changes in anxiety from one time point to the next!
—
The intuition here is that this is due to a form of regression to the mean. We usually think of regression to the mean as occurring when you purposely select for some subset of a population (e.g., those who perform best on a test), and then in the next time period, we expect to mean reversion. In this case, though, measurements that, due to chance, happen to be high will tend to fall (due just to regression to the mean), meaning the signed change between the two years will tend to be negative. And measurements at time one that happened to be smaller than normal due to noise will tend to rise the next time period (again due to regression to the mean), meaning the second value minus the first value will tend to be higher. So high time period one values will tend to have negative changes, and low time period one values will tend to have positive change values, leading to a negative correlation between the two of them.
If that’s still not intuitive for you, consider the opposite relationship – the one between the value of a variable at time point 2 and the signed change, calculated as (value at time point 2 – value at time point 1). It is true (and hopefully also intuitive for you) that larger values at time point 2 will be positively correlated with larger improvements from time point 1 to time point 2. (And, similarly, if time point 2 happens to be worse, then (value at time point 2 – value at time point 1) will be negative.)
—
What can be done about this?
There seem to be a few ways to resolve this issue.
(i) Instead of predicting the signed change in a variable, predict the final (time point 2) values. Then it’s okay (and often advisable!) to control (i.e., include as an independent variable) the baseline value. The issue discussed above is only a problem if you’re predicting the signed change in a variable, not if you’re predicting the value of the variable at time point two.
(ii) If it’s important to predict the signed change value (for instance, because that’s what you fundamentally care about), then it’s better not to control for the baseline value at all than to control for it and distort your results.
(iii) Several other approaches are summarized in the following paper: Chiolero et al. 2013
—
Thanks for reading – we’d love to hear your thoughts about this issue! Please comment below or contact us directly if you have any thoughts.
—
Some time after we wrote this, we later came across some papers talking about this phenomenon. (For an overview and some approaches to responding to it, see: Chiolero et al. 2013: https://www.frontiersin.org/journals/public-health/articles/10.3389/fpubh.2013.00029/full)
__
Acknowledgements:
I noticed this phenomenon in a dataset and my colleague Clare hypothesized that this phenomenon was to blame. We then confirmed this together. I wrote most of this post, and she added examples from the literature and the Chiolero et al. paper.
Comments