false positives – Spencer Greenberg

Demystifying p-values

admin — Sat, 31 Dec 2022 20:40:00 +0000

There is a tremendous amount of confusion around what a p-value actually is, despite their widespread use in science. Here is my attempt to explain the concept of p-values concisely and clearly (including why they are useful and what often goes wrong with them).

— What’s a p-value? —

If you run a study, then (all else equal, aside from rare edge cases) the lower the p-value, the lower the chance that your results are due to random chance or luck.

More precisely: a p-value is the probability you’d get a result at least as extreme as what you got IF there were actually no effect (or if some other pre-specified “null hypothesis” is true).

So it’s a probability calculated based on assuming that there is no effect (or assuming that a pre-specified “null hypothesis” is true). Here the phrase “no effect” would mean, in the case of a study on a new medicine, that the medicine doesn’t do anything.

To put it in terms of coin flips: suppose you’re trying to decide if a coin is fair (i.e., if it has an equal chance of landing on heads and tails – so that’s your “null hypothesis” in this context). You flip the coin 100 times and get 60 heads. You calculate the p-value (p=0.06).

This p-value tells you there’s a 6% chance you’d get 60 or more heads OR 60 or more tails out of 100 flips if the coin were actually fair.

What makes p-values useful is that when they are high, you usually can’t rule out your effect being due to random chance or luck. And, when they are very low, random chance is (in most cases) unlikely to be the explanation for your result.

— What’s the problem with p-values? —

In social science, p<0.05 is often used as the cutoff for a “successful” result (i.e., they treat the effect as real and potentially publishable). This is an arbitrary cutoff; there’s nothing special about 0.05. The phrase “statistically significant” is defined simply to mean that p<0.05.

There are many ways that p-values get commonly misused, creating lots of problems. For instance:

• p-values often get misinterpreted as the probability that an effect is not real (recall: p-values are actually the probability of getting a result this extreme if there is no effect, which is not the same thing)

• If you see one study where the main finding’s p-value is, say, 0.05, and another study where the main finding’s p-value is, say, 0.01, it’s tempting to conclude that the finding of the 2nd study is much less likely to be the result of chance (e.g., 1/5th as likely) than the 1st study’s finding. Unfortunately, we can’t draw this conclusion. The probability that a study’s finding is the result of chance is not the same as the p-value, and in fact, it can’t even be calculated just by knowing the p-value.

• Because a p-value threshold is often used for a result to be publishable (p<0.05 in social science), researchers sometimes engage in fishy methods to get their p-values below the threshold. This is known as “p-hacking.:

• A result’s p-value (or “statistical significance”) is sometimes focused on instead of focusing on other factors that are also important. For instance, a result may have a low p-value but be such a weak effect that it’s totally useless or uninteresting.

• While a low p-value helps you rule out the possibility that your effect is merely due to random chance, unfortunately, that’s all it helps you with. But researchers sometimes act as though it tells them more than that. Even an extremely low p-value doesn’t mean an effect is “real” or that the effect means what you think. Low p-values can result from a variety of causes, including mistakes in experimental design or confounds.

Here’s another way to think about what a p-value is and isn’t that some people find helpful: a p-value does not tell you the probability that your result is due to chance. It tells you how consistent your results are with being due to chance. (I’m paraphrasing from here.) So, the lower the p-value, the less consistent your results are with them being due to chance.

It’s interesting to note that, empirically, results with lower p-values are more likely to be genuine effects (i.e., not false positives). I looked at results for 325 psychology study replications, and when the original study p-value was at most 0.01, about 72% replicated. When p>0.01, only 48% did.

Ultimately, p-values are a useful (though often abused) statistical tool.

— BONUS APPENDIX: what’s the chance of a hypothesis being “true” if p<0.05? —

One annoying thing about p-values is that they don’t answer the question we are usually interested in. Usually, we want to know something like “What’s the probability that my hypothesis is true?” or “What’s the probability that the effect of this drug is bigger than X?” but p-values don’t tell us those things.

However, we can put a different spin on p-values to get them to answer questions that are closer to what we’re really interested in. Let’s think of p-values as giving us a decision procedure (in an overly simplified world where you either “believe” in an effect or you fail to believe in it).

Suppose you test 100 totally separate, previously unexplored hypotheses about humans, and suppose that you commit to “believe” a hypothesis is true if and only if you get p<0.05 (and otherwise, you don’t believe it).

I think it’s realistic that in a social science context, most hypotheses studied will be false since discovering novel, publishable hypotheses about humans is hard. So let’s suppose that 80% of the hypotheses you test are *not* true.

Finally, suppose that you use a large enough number of participants in your studies so that if you are testing for the presence of a real effect, there is an 80% chance you’ll be able to find it (this 80% figure is a common recommendation for “statistical power”).

Under these assumptions, if you test 100 hypotheses, then you will end up believing in 20 hypotheses, and 80% of those you believe will be true (with the other 20% being false positives). That means that of the results you believe in, 80% will be correct! Of course, this assumes no mistakes are made in the process of designing the experiment, running the statistics, and so on.

Here’s how the math works out if you’re curious:

• Out of the 100 hypotheses, 20 will be true, and of those, you’ll believe 16 = 0.80 * 20 (these are the true positives) and fail to believe 4 (these are the false negatives).

• Out of the 100 hypotheses, 80 will be false, and of those, you’ll believe 4 = 0.05 * 80 (these are the false positives), and you’ll reject 76 (these are the true negatives).

Of course, if the numbers here had been different, the conclusions would be different as well. For instance, imagine if you started with 2000 hypotheses, and this time, imagine that only 1% of them were true. If the power was still 80%, then:

• Out of the 2000 hypotheses, 20 of them would be true, and of those, you’d believe 16 (0.80 * 20) of them (these are true positives) and fail to believe 4 of them (these are false negatives).

• Out of the 2000 hypotheses, 1980 would be false, and of those, you’d believe 99 (0.05*1980) of them (these are false positives), and you’d reject the other 1881 of them (these are true negatives).

• So, altogether, you’d believe 115 (16 + 99) hypotheses, of which only 16 would’ve actually been true, so of the results you believe in, less than 14% would be correct!

From analyses like these, we can see that the probability that a specific hypothesis is true, given that we’ve found p<0.05, depends on a variety of factors, including the sample size, the true effect size, the base rate probability that a new hypothesis tested by that researcher is true, the probability of errors being made in the experimental design or statistical analysis, and so on.

In real life:

(1) Studies often don’t use large enough numbers of participants (and so are underpowered).

(2) Researchers sometimes engage in p-hacking to artificially lower their p-values to help their papers get published.

(3) Researchers often don’t carefully track how many hypotheses they’ve really tested.

(4) The decision procedure described above is often not adhered to so strictly (e.g., a result of p=0.08 might be treated as suggestive evidence for the hypothesis, and hence the hypothesis is not rejected).

(5) Real hypotheses often have auxiliary assumptions beyond what the p-value accounts for (such as an assumption that there is a lack of confounders, a lack of serious errors in the experimental setup, and so on).

I personally don’t like thinking in terms of this decision procedure for p-values because I believe that modeling hypotheses as “true” or “false” is not a good approach to thinking clearly. This is because I believe it’s usually much better to think in terms of probabilities rather than a “true”/”false” dichotomy when trying to understand the answers to complex questions.

Some people have argued that we should switch to a Bayesian approach to hypothesis testing since such an approach avoids many of the issues of p-values (including avoiding the problematic “true”/”false” dichotomy). But it also introduces other challenges, such as how to come up with an appropriate “prior” (which represents one’s belief about the probability of the hypothesis having different strengths of effects prior to seeing the study results).

This piece was first written on December 31, 2022, and first appeared on this site on April 2, 2023.

If you read this line, please do us a favor and click here to answer one quick question.

Bias based on facial attractiveness

admin — Fri, 03 Jul 2020 03:15:00 +0000

There’s a deeply-rooted, incredibly superficial aspect of human nature that is rarely discussed: our obsession with small variations in bone structure/skin smoothness on a person’s face. At extremes, people are desired or shunned due to tiny, otherwise almost meaningless facial details.

In the attached image, there are two non-existent women (generated by a face generation AI set to generate “brown hair white adult female”). If these were real people, they would likely be treated differently throughout their lives due to very minor differences in facial structure and skin smoothness.

Based on their faces alone, there’s no way to know with non-negligible accuracy which of these people (if they existed) would be more hard-working, more moral, wiser, or otherwise in possession of personal traits that we actually might care about. So why are humans so obsessed with faces? It seems likely to be caused by a combination of two factors:

(1) Runaway Sexual Selection

If peacocks find large tail plumage sexually attractive, then even if those feathers are not useful for anything else, that still creates an evolutionary selection pressure where those with larger tail plumage are more likely to pass on their genes (due to improved chances of mating). Similarly, if certain humans are found to be more attractive based on their faces, that creates an evolutionary selection pressure in favor of mating with those people because then their children have a higher probability of finding mating success themselves (and hence passing on their genes). This phenomenon reinforces faces being attractive (because those attracted to “good-looking” faces mate with “good-looking” people more often, therefore their children are more good-looking and so have an easier time mating, plus have a preference for “good-looking” faces).

Today, this selection pressure is likely much weaker than it once was since most people now end up having children. For instance, now the vast majority of people in the US live to be at least 50, and only about 15% of women and 25% of men in the 40-50-year age bracket are childless. In contrast, tens of thousands of years ago, far fewer would make it to the point where they would have children.

(2) Health Correlations

In the environment we lived in tens of thousands of years ago, some aspects of a person’s face correlated with the likelihood of the survival of their genes, in particular ones related to disease (some diseases impact the face), genetic disorders (some of them cause facial changes), and development in the womb (where abnormal development can cause facial changes).

The correlation between health and facial features is likely to be lower now than it used to be back then. Today, a person’s facial features might still help to predict someone’s age, their most probable gender identity, and whether they have certain health conditions – but, of course, none of these give us any legitimate justification for treating some people better and others worse based just on their face.

It has been found that certain facial features do correlate with hormone levels (like testosterone). While testosterone levels may play a role in aggression (they may be part of the explanation for why men are violent so much more often than women), using these small correlations to make judgments about any one person is going to be both highly inaccurate and highly unjust. Some other personality traits may also be very weakly correlated with a person’s facial features, but talking to the person for 20 minutes will, of course, give you dramatically more information about what that person is like. Yet, we are prone to read so much into the way a person looks.

Note: there is an additional effect when it comes to faces, which is that we are sometimes taught by our culture to value certain facial attributes more than others. This can act above and beyond the previously mentioned two factors.

We humans act as though faces are incredibly important despite them being a substantially arbitrary mask our genes have programmed for us. And they often impact how we humans treat each other, despite this unequal treatment being both unjust and unjustified. If you ever notice yourself treating someone less well because of their face, take note and adjust your behavior.

I am not saying that people should, for example, date people they are not attracted to. Obviously, attraction is an important part of relationships for most people, and the face is one part of what determines attraction. (You may also care about your children one day having attractive faces, so they can more easily find life partners they like.) Rather, what I’m saying is that we should be very wary about making negative inferences about any individual person based on their face (which is something that, unfortunately, the human mind seems to do often). The face says too little about a person’s character to be useful for predicting at the level of any individual.

This essay was first written on July 2, 2020, and first appeared on this site on December 17, 2021.