Correlations provide a very useful, quick way to summarize the relationship between two things (as a single number between -1 and 1). For instance, if we find that self-reported anxiety and self-reported depression have a high correlation (which they, in fact, do), then this suggests a substantial link between the two conditions.
But something odd happens when we calculate correlations involving rare traits: the correlation usually has a small magnitude (i.e., is close to zero)!
To illustrate this, let’s imagine we have some rare trait, Y, which you either have (Y=1) or you don’t have (Y=0). It could be, for instance, whether or not someone has a rare disease (1 means they have it, 0 means they don’t).
Furthermore, let’s suppose a discovery is made that there’s an important link between that disease and a numerical trait, which we’ll call X. X could be, for instance, the percentile a person gets on a certain blood test that relates to the disease. So let’s suppose that X can take on integer values from 0 to 100. Moreover, imagine that it’s discovered that X ALWAYS equals 100 when a person has the disease, but when the person doesn’t have the disease, X can take on any value from 0 to 100 (uniformly at random). In this setup, it seems like X and Y should be strongly correlated – after all, X is always 100 for people who have the disease, and very rarely 100 for people who don’t. But do we find a substantial correlation between X and Y in this case? Well, it depends on the rarity of the disease. So let’s examine the correlation between X and Y (via simulation) as we vary the disease’s rarity from 1 in 2 (a probability of 0.50 of having the disease) all the way down to about 1 in 32,000:


As we can see in the table and chart, when Y is not rare (i.e., 1 occurs with reasonably high probability), then the correlations between X and Y are high. For instance, if the trait occurs in 1 in 2 people, then the correlation between X and Y is r=0.77. But as the trait gets rarer, this correlation drops, until we get to a rarity of about 1 in about 32,000, at which point the correlation between X and Y is minuscule, at r<0.01.
Situations like this can pose a big problem in research using correlations. If we are studying rare traits, we may use correlations to examine what they are associated with. But, as in the example above, using correlations can make rare traits look like they aren’t really meaningfully linked, even to other traits they’re highly related to!
But is the problem I described really a general problem of rare traits, or does it only apply in specific instances, such as the one we simulated above?
Well, first it’s worth noting that it’s POSSIBLE for even an extremely rare trait to have a strong correlation with a separate variable. For instance, if we modify the example above so that, as before, when Y=1 we have X=100, but unlike before, when Y=0 we have X be 0, then we always get a correlation of 1 regardless of how rare it is for Y to be 1. This case is easily explained because here, X and Y perfectly mirror each other, with X simply being 100 times Y.
Let’s consider another example. Suppose that, as before, X=100 whenever Y=1, but now let’s make it so that when Y=0, X has a 50/50 chance of being either 0 or 1. In that case, we get a correlation (r) between X and Y of almost 1 if Y has a 1 in 2 chance of being 1, and a correlation of r=0.41 when Y has only a 1 in 131072 chance of being 1.
We therefore see that this is not a universal phenomenon – it’s possible for a variable to have a high correlation with a rare trait, even though rare traits have a tendency to create low correlations.
What can we say about when these low correlations with rare traits will occur?
We’re going to look at this from two perspectives.
First, let’s consider the special case where X, like Y, is also a binary variable (i.e., it only takes on the values 0 and 1). Furthermore, in this setup, let’s suppose that X is guaranteed to always give half 0’s and half 1’s. For instance, it could be that X represents sex assigned at birth (X=1 corresponds to female, and X=0 corresponds to male, with each occurring about half the time) and Y is whether a blood test came out positive for a disease that only females can get (Y=1 for a positive test result, Y=0 for a negative result).
In this setup, what can we say about the correlation between X and Y, and how does this vary with the rarity of Y=1 occurring? Well, it turns out that the size of the correlation between X and Y in this setup is strictly bounded based on how often Y takes on the value 1 (let’s refer to the probability that Y takes on the value 1 using the symbol “p”). Then, we have this upper bound on the correlation, r, that I derived:

In other words, if X is forced to take on 1 half the time and 0 half the time, the rarer it is that Y takes on 1, the tighter the restriction on their correlation. In that situation, if Y takes on 1 only rarely, then they can’t have a big correlation. Here is a plot of what this upper bound looks like as p varies, showing the maximum possible correlation for each such p:

Why does the rarity of Y=1 place an upper bound on the correlation in this situation? I think it’s because Y is almost always equal to 0 (due to 1 being rare), but since X is restricted to take on the value 0 only 1/2 of the time, that means it has to give a 1 on close to half of Y’s, meaning that it can’t match Y all that often. The less rare that Y=1 is, the less this is a problem.
What about more general cases, though, where Y is still binary but X is a numerical variable that’s unrestricted (unlike the setup above)?
Well, for this, we have a different upper bound on the correlation that I derived. It is only guaranteed to hold when p is sufficiently small (relative to the other terms) – it may be violated if p isn’t small enough. But under those conditions, we have this upper bound on the correlation:

The variables in this equation are two conditional means: the mean value of X that occurs when Y=1, and the mean value of X that occurs when Y=0. There is also a conditional standard deviation: the standard deviation of X when Y=0 (i.e., how much X varies when Y takes on its more common result, which is 0).
If we treat the conditional means and conditional standard deviation of X as fixed, and consider what happens then as p (the probability of Y being 1) shrinks, then we see that this upper bound on the correlation (which, recall, only applies for p small enough) applies a multiplicative factor of sqrt(p(1-p)). That factor sqrt(p(1-p)) is always less than 1, and it approaches 0 as p approaches 0, effectively putting a cap on the size of the correlation. Here’s a chart of the multiplicative effect that sqrt( p (1-p)) has on the equation, forcing a lower correlation as p shrinks:

This bound won’t always be informative, because it’s possible for the first factor to be so big that the whole equation gives an upper bound greater than 1, which means there is no restriction on the correlation. But when the first factor is moderate in size, and p is small enough, this really does enforce a bound on how big the correlation between X and Y can be based on how rare it is that Y takes on the value 1. It’s also worth noting that when p is small, sqrt(p*(1-p)) falls, as a function of p, approximately like the simpler formula sqrt(p).
It’s also worth thinking for a moment about what this upper-bound formula means, on an intuitive level. If the standard deviation of X is small conditioned on Y=1 (that is, when Y=1, there is little variation in X), then the denominator will be small, so the upper bound will not be informative (any correlation is possible). On the other hand, if the standard deviation of X conditioned on Y=1 is large (relative to the gap between the mean of X when Y=1 and the mean of X when Y=0), then our bound will come into effect, and as p (the frequency of Y=1) shrinks, we’ll have a tighter and tighter maximum possible correlation. This means that a critical factor here in the maximum possible correlation is how much variation there is in X when Y is fixed to take on its most common class (that is, when Y=0).
Putting this all together: while rare traits aren’t guaranteed to have low correlations to other variables, they often will be forced to have low correlations to other variables merely as a consequence of being rare traits! And the rarer the trait is, the smaller those correlations with other variables will typically be.
But what should you do when this happens? Well, one simple approach is to switch from measuring correlation to calculating Cohen’s d instead. It gives you the “effect size” of the binary variable Y on your other variable X – in other words, it calculates how much bigger X is, on average, when Y=1, compared to how big X is, on average, when Y=0, measured in units of standard deviations. This removes the effect of rarity – with Cohen’s d, it doesn’t matter how often Y takes on 1 vs. 0. Here’s the formula for it, where s is the “pooled standard deviation”:

But could a problem similar to the one we’re describing happen for variables Y that are not binary? So far, we’ve only thought about “rarity” in terms of a binary variable Y, where the value 1 occurs rarely. But there are other ways that a variable could be similar but non-binary. For instance, suppose that Y is a continuous numerical variable, but that it almost always takes on the value 0, with a small probability it can also take on other values in the range 1 to 100. An example of this kind of variable might be responses to a question like “How many days during the past 30 days did you drive a motorcycle that you own?” The vast majority of respondents will respond with 0, whereas a small fraction will give responses in the tens, twenties, or even say 30 if they ride every day.
In that case, could we have a similar situation where, because the variable is 0 so often, it behaves like the binary variable Y we’ve been discussing above, and usually ends up with lower correlations to other variables? The answer is: probably yes! And, unfortunately, the Cohen’s d strategy won’t work for that situation, because Cohen’s d, as it’s usually conceived, requires that one of the variables is binary.
What is one to do in that case? Well, in the case where our Y variable is bounded (i.e., it is known it can’t go below some fixed value Y_min and can’t go above some maximum value Y_max), one approach is to use this “Cohen’s d” like formula that I derived which, unlike normal Cohen’s d, can be applied when neither variable is binary. I’ll call it the “Generalized Cohen’s d.” It’s given by:

Here, sigma_Y is the standard deviation of Y. Additionally, r is simply the regular correlation between X and Y. A neat aspect of this formula is that in the special case where Y is a binary variable, it simply becomes the formula for calculating Cohen’s d using the correlation, r. Here is the standard formula for converting a correlation, r, to a Cohen’s d for a binary variable Y that takes on the value 1 with probability 1/2 and 0 with probability 1/2:

And here’s the more powerful r-to-d conversion formula for a binary variable Y that takes on the value 1 with probability p and the value 0 with probability 1-p:

What do we take away from all of this?
Well, if you’re studying rare traits and what other variables they are linked to, it can be very misleading to use correlations to do so. You may find that the correlations are low even with variables that have a strong link to the rare trait. In that case, you can switch to Cohen’s d, or when the original trait is not binary (but still mostly just takes on one value), you can try the Generalized Cohen’s d that I introduce in this article.
Please let me know if you find any errors in this article – I didn’t have as much time as I would have liked to check all the formulas. I’d also be very interested to know if you have other good ways of handling these sorts of situations.
Comments