replication crisis – Spencer Greenberg

Demystifying p-values

admin — Sat, 31 Dec 2022 20:40:00 +0000

There is a tremendous amount of confusion around what a p-value actually is, despite their widespread use in science. Here is my attempt to explain the concept of p-values concisely and clearly (including why they are useful and what often goes wrong with them).

— What’s a p-value? —

If you run a study, then (all else equal, aside from rare edge cases) the lower the p-value, the lower the chance that your results are due to random chance or luck.

More precisely: a p-value is the probability you’d get a result at least as extreme as what you got IF there were actually no effect (or if some other pre-specified “null hypothesis” is true).

So it’s a probability calculated based on assuming that there is no effect (or assuming that a pre-specified “null hypothesis” is true). Here the phrase “no effect” would mean, in the case of a study on a new medicine, that the medicine doesn’t do anything.

To put it in terms of coin flips: suppose you’re trying to decide if a coin is fair (i.e., if it has an equal chance of landing on heads and tails – so that’s your “null hypothesis” in this context). You flip the coin 100 times and get 60 heads. You calculate the p-value (p=0.06).

This p-value tells you there’s a 6% chance you’d get 60 or more heads OR 60 or more tails out of 100 flips if the coin were actually fair.

What makes p-values useful is that when they are high, you usually can’t rule out your effect being due to random chance or luck. And, when they are very low, random chance is (in most cases) unlikely to be the explanation for your result.

— What’s the problem with p-values? —

In social science, p<0.05 is often used as the cutoff for a “successful” result (i.e., they treat the effect as real and potentially publishable). This is an arbitrary cutoff; there’s nothing special about 0.05. The phrase “statistically significant” is defined simply to mean that p<0.05.

There are many ways that p-values get commonly misused, creating lots of problems. For instance:

• p-values often get misinterpreted as the probability that an effect is not real (recall: p-values are actually the probability of getting a result this extreme if there is no effect, which is not the same thing)

• If you see one study where the main finding’s p-value is, say, 0.05, and another study where the main finding’s p-value is, say, 0.01, it’s tempting to conclude that the finding of the 2nd study is much less likely to be the result of chance (e.g., 1/5th as likely) than the 1st study’s finding. Unfortunately, we can’t draw this conclusion. The probability that a study’s finding is the result of chance is not the same as the p-value, and in fact, it can’t even be calculated just by knowing the p-value.

• Because a p-value threshold is often used for a result to be publishable (p<0.05 in social science), researchers sometimes engage in fishy methods to get their p-values below the threshold. This is known as “p-hacking.:

• A result’s p-value (or “statistical significance”) is sometimes focused on instead of focusing on other factors that are also important. For instance, a result may have a low p-value but be such a weak effect that it’s totally useless or uninteresting.

• While a low p-value helps you rule out the possibility that your effect is merely due to random chance, unfortunately, that’s all it helps you with. But researchers sometimes act as though it tells them more than that. Even an extremely low p-value doesn’t mean an effect is “real” or that the effect means what you think. Low p-values can result from a variety of causes, including mistakes in experimental design or confounds.

Here’s another way to think about what a p-value is and isn’t that some people find helpful: a p-value does not tell you the probability that your result is due to chance. It tells you how consistent your results are with being due to chance. (I’m paraphrasing from here.) So, the lower the p-value, the less consistent your results are with them being due to chance.

It’s interesting to note that, empirically, results with lower p-values are more likely to be genuine effects (i.e., not false positives). I looked at results for 325 psychology study replications, and when the original study p-value was at most 0.01, about 72% replicated. When p>0.01, only 48% did.

Ultimately, p-values are a useful (though often abused) statistical tool.

— BONUS APPENDIX: what’s the chance of a hypothesis being “true” if p<0.05? —

One annoying thing about p-values is that they don’t answer the question we are usually interested in. Usually, we want to know something like “What’s the probability that my hypothesis is true?” or “What’s the probability that the effect of this drug is bigger than X?” but p-values don’t tell us those things.

However, we can put a different spin on p-values to get them to answer questions that are closer to what we’re really interested in. Let’s think of p-values as giving us a decision procedure (in an overly simplified world where you either “believe” in an effect or you fail to believe in it).

Suppose you test 100 totally separate, previously unexplored hypotheses about humans, and suppose that you commit to “believe” a hypothesis is true if and only if you get p<0.05 (and otherwise, you don’t believe it).

I think it’s realistic that in a social science context, most hypotheses studied will be false since discovering novel, publishable hypotheses about humans is hard. So let’s suppose that 80% of the hypotheses you test are *not* true.

Finally, suppose that you use a large enough number of participants in your studies so that if you are testing for the presence of a real effect, there is an 80% chance you’ll be able to find it (this 80% figure is a common recommendation for “statistical power”).

Under these assumptions, if you test 100 hypotheses, then you will end up believing in 20 hypotheses, and 80% of those you believe will be true (with the other 20% being false positives). That means that of the results you believe in, 80% will be correct! Of course, this assumes no mistakes are made in the process of designing the experiment, running the statistics, and so on.

Here’s how the math works out if you’re curious:

• Out of the 100 hypotheses, 20 will be true, and of those, you’ll believe 16 = 0.80 * 20 (these are the true positives) and fail to believe 4 (these are the false negatives).

• Out of the 100 hypotheses, 80 will be false, and of those, you’ll believe 4 = 0.05 * 80 (these are the false positives), and you’ll reject 76 (these are the true negatives).

Of course, if the numbers here had been different, the conclusions would be different as well. For instance, imagine if you started with 2000 hypotheses, and this time, imagine that only 1% of them were true. If the power was still 80%, then:

• Out of the 2000 hypotheses, 20 of them would be true, and of those, you’d believe 16 (0.80 * 20) of them (these are true positives) and fail to believe 4 of them (these are false negatives).

• Out of the 2000 hypotheses, 1980 would be false, and of those, you’d believe 99 (0.05*1980) of them (these are false positives), and you’d reject the other 1881 of them (these are true negatives).

• So, altogether, you’d believe 115 (16 + 99) hypotheses, of which only 16 would’ve actually been true, so of the results you believe in, less than 14% would be correct!

From analyses like these, we can see that the probability that a specific hypothesis is true, given that we’ve found p<0.05, depends on a variety of factors, including the sample size, the true effect size, the base rate probability that a new hypothesis tested by that researcher is true, the probability of errors being made in the experimental design or statistical analysis, and so on.

In real life:

(1) Studies often don’t use large enough numbers of participants (and so are underpowered).

(2) Researchers sometimes engage in p-hacking to artificially lower their p-values to help their papers get published.

(3) Researchers often don’t carefully track how many hypotheses they’ve really tested.

(4) The decision procedure described above is often not adhered to so strictly (e.g., a result of p=0.08 might be treated as suggestive evidence for the hypothesis, and hence the hypothesis is not rejected).

(5) Real hypotheses often have auxiliary assumptions beyond what the p-value accounts for (such as an assumption that there is a lack of confounders, a lack of serious errors in the experimental setup, and so on).

I personally don’t like thinking in terms of this decision procedure for p-values because I believe that modeling hypotheses as “true” or “false” is not a good approach to thinking clearly. This is because I believe it’s usually much better to think in terms of probabilities rather than a “true”/”false” dichotomy when trying to understand the answers to complex questions.

Some people have argued that we should switch to a Bayesian approach to hypothesis testing since such an approach avoids many of the issues of p-values (including avoiding the problematic “true”/”false” dichotomy). But it also introduces other challenges, such as how to come up with an appropriate “prior” (which represents one’s belief about the probability of the hypothesis having different strengths of effects prior to seeing the study results).

This piece was first written on December 31, 2022, and first appeared on this site on April 2, 2023.

If you read this line, please do us a favor and click here to answer one quick question.

Importance Hacking: a major (yet rarely-discussed) problem in science

admin — Tue, 20 Dec 2022 01:45:00 +0000

I first published this post on the Clearer Thinking blog on December 19, 2022, and first cross-posted it to this site on January 21, 2023.

You have probably heard the phrase “replication crisis.” It refers to the grim fact that, in a number of fields of science, when researchers attempt to replicate previously published studies, they fairly often don’t get the same results. The magnitude of the problem depends on the field, but in psychology, it seems that something like 40% of studies in top journals don’t replicate. We’ve been tackling this crisis with our new Transparent Replications project, and this post explains one of our key ideas.

Replication failures are sometimes simply due to bad luck, but more often, they are caused by p-hacking – the use of fishy statistical techniques that lead to statistically significant (but misleading or erroneous) results. As big a problem as p-hacking is, there is another substantial problem in science that gets talked about much less. Although certain subtypes of this problem have been named previously, to my knowledge, the problem itself has no name, so I’m giving it one: “Importance Hacking.”

Academics want to publish in the top journals in their field. To understand Importance Hacking, let’s consider a (slightly oversimplified) list of the three most commonly-discussed ways to get a paper published in top psychology journals:

Conduct valuable research – make a genuinely interesting or important discovery, or add something valuable to the state of scientific knowledge. This is, of course, what just about everyone wants to do, but it’s very, very hard!
Commit fraud – for instance, by making up your data. Thankfully, very few people are willing to do this because it’s so unethical. So this is by far the least used approach.
p-hack – use fishy statistics, HARKing (i.e., hypothesizing after the results are known), selective reporting, using hidden researcher degrees of freedom, etc., in order to get a p<0.05 result that is actually just a false positive. This is a major problem and the focus of the replication crisis. Of course, false positives can also come about without fault, due to bad luck.

But here is a fourth way to get a paper published in a top journal: Importance Hacking.

4. Importance Hack – get a result that is actually not interesting, not important, and not valuable, but write about it in such a way that reviewers are convinced it is interesting, important, and/or valuable, so that it gets published.

For research to be valuable to society (and, in an ideal world, publishable in top journals), it must be true AND interesting (or important, useful, etc.). Researchers sometimes p-hack their results to skirt around the “true” criterion (by generating interesting false positives). On the other hand, Importance Hacking is a method for skirting the “interesting” criterion.

Importance Hacking is related to concepts like hype and overselling, though hype and overselling are far more general. Importance Hacking refers specifically to a phenomenon whereby research with little to no value gets published in top journals due to the use of strategies that lead reviewers to misinterpret the work. On the other hand, hype and overselling are used in many ways in many stages of research (including to make valuable research appear even more valuable).

One way to understand importance hacking is by comparing it to p-hacking. P-hacking refers to a set of bad research practices that enable researchers to publish non-existent effects. In other words, p-hacking misleads paper reviewers into thinking that non-existent effects are real. Importance Hacking, on the other hand, encompasses a different set of bad research practices: those that lead paper reviewers to believe that real (i.e., existent) results that have little to no value actually have substantial value.

This diagram illustrates how I think Importance Hacking interferes with the pipeline of producing valuable research:

There are a number of subtypes of Importance Hacking based on the method used to make a result appear interesting/important/valuable when it’s not. Here is how I subdivide them:

Types of Importance Hacking

1. Hacking Conclusions: make it seem like you showed some interesting thing X but actually show something else (X′) which sounds similar to X but is much less interesting/important. In these cases, researchers do not truly find what they imply they have found. This phenomenon is also closely connected with validity issues.

Example 1: showing X is true in a simple video game but claiming that X is true in real life.
Example 2: showing A and B are correlated and claiming that A causes B (when really A and B are probably both caused by some third factor C, which makes the finding much less interesting).
Example 3: if a researcher claims to be measuring “aggression,” and couches all conclusions in these terms but is actually measuring milliliters of hot sauce that a person puts in someone else’s food. Their result about aggression will be valid only insofar as it is true that this is a valid measure of aggression.
Example 4: some types of hacking conclusions would fall under the terms “overclaiming” or “overgeneralizing;” Tal Yarkoni has a relevant paper called The Generalizability Crisis.

2. Hacking Novelty: refer to something in a way that makes it seem more novel or unintuitive than it is. Perhaps the result is already well known or is merely what just about everyone’s common sense would already tell them is true. In these cases, researchers really do find what they claim to have found, but what they found is not novel (despite them making it seem so). Hacking Novelty is also connected to the “Jingle-jangle” fallacy – where people can be led to believe two identical concepts are different because they have different names (or, more subtly, because they are operationalized somewhat differently).

Example 1: showing something that is already well-known but giving it a new name that leads people to think it is something new. The concept of “grit” has received this criticism; some people claim it could turn out to be just another word for conscientiousness (or already known facets of conscientiousness) – though this question does not yet seem to be settled (different sides of this debate can be found in these papers: 1, 2, 3 and 4).
Example 2: showing that A and B are correlated, which seems surprising given how the constructs are named, but if you were to dig into how A and B were measured, it would be obvious they would be correlated.
Example 3: showing a common-sense result that almost everyone already would predict but making it seem like it’s not obvious (e.g., by giving it a fancy scientific name).

3. Hacking Usefulness: make a result seem useful or relevant to some important outcome when in fact, it’s useless and irrelevant. In these cases, researchers find what they claim to have found, but what they find is not useful (despite them making it sound useful).

Example: focusing on statistical significance when the effect size is so small that the result is useless. Clinicians often distinguish between “statistical significance” and “clinical significance” to highlight the pitfalls of ignoring effect sizes when considering the importance of a finding.

4. Hacking Beauty: make a result seem clean and beautiful when in fact, it’s messy or hard to interpret. In these cases, researchers focus on certain details or results and tell a story around those, but they could have focused on other details or results that would have made the story less pretty, less clear-cut, or harder to make sense of. This is related to Giner-Sorolla’s 2012 paper Science or art: How aesthetic standards grease the way through the publication bottleneck but undermine science. Hacking beauty sometimes reduces to selective reporting of some kind (i.e., selective reporting of measures, analyses, or studies) or at least of selective focus on certain findings and not others. This becomes more difficult with pre-registration; if you have to report the results of planned analyses, there’s less room to make them look pretty (you could just say they’re pretty, but that seems like overclaiming)

Example: emphasizing the parts of the result that tell a clean story while not including (or burying somewhere in the paper) the parts that contradict that story

Science faces multiple challenges. Over the past decade, the replication crisis and subsequent open science movement have greatly increased awareness of p-hacking as a problem. Measures have begun to be put in place to reduce p-hacking. Importance Hacking is another substantial problem, but it has received far less attention.

Digital art created using the A.I. DALL·E

If a pipe is leaking from two holes and its pressure is kept fixed, then repairing one hole will result in the other one leaking faster. Similarly, as best practices increasingly become commonplace as a means to reduce p-hacking, so long as the career pressures to publish in top journals don’t let up, the occurrence of Importance Hacking may increase.

It’s time to start the conversation about how Importance Hacking can be addressed.

If you’re interested in learning more about Importance Hacking, you can listen to psychology professor Alexa Tullett and me discussing it on the Clearer Thinking podcast (there, I refer to it as “Importance Laundering,” but I now think “Importance Hacking” is a better name) or me talking about it on the Two Psychologists Four Beers podcast. We also discuss my new project, Transparent Replications, which conducts rapid replications of recently published psychology papers in top journals in an effort to shift incentives and create more reliable, replicable research. If you enjoyed this article, you may be interested in checking our replication reports and learning more about the project.

Did you like this article? If so, you may like to explore the ClearerThinking Podcast, where I have fun, in-depth conversations with brilliant people about ideas that matter. Click here to see a full list of episodes.

Many global challenges arise due to collective action problems or incentive misalignment

admin — Mon, 16 Nov 2020 19:40:00 +0000

Many of the biggest challenges that we face in society are due to one or both of these types of problems:

(A) Collective Action Problems, where many individuals or groups are currently better off taking action X, even though they’d be better off in the long-term if everyone agreed not to take action X.

Some of the big challenges with Collective Actions Problems are (i) getting people or groups to agree to stop the behavior in the first place, and then (ii) creating a very strong commitment mechanism so that the parties don’t just later defect against the agreement.

(B) Incentive Misalignment Problems, where many individuals or groups are currently better off taking action X, even though such actions harm the rest of society broadly.

The big challenge with an Incentive Misalignment Problem is finding some way to align the incentives of individuals and groups with the incentives of society as a whole.

Unfortunately, we humans seem to be quite bad at solving these kinds of problems on a society-wide scale much of the time (we are better at solving them on smaller scales, such as within companies and within families).

I think that figuring out how to better solve Collective Action Problems and Incentive Misalignment Problems is extremely important for the future of humanity.

Here’s my list of some of the major unsolved problems in the world right now that are of these types:

(A) Collective Action Problems

1. The development of nuclear weapons (e.g., North Korea developed them pretty recently, Iran may still develop them, the U.S. and China may still develop more of them, etc.). We would all be safer if countries could somehow credibly commit to getting rid of their nuclear weapons and never making trying to make them again.

2. Arms race dynamics in the development of advanced A.I. (e.g., where players feel rushed by competition instead of proceeding cautiously and cooperating on the numerous safety challenges).

(B) Incentive Misalignment Problems

3. Publication of risky forms of “gain-of-function” research (where researchers find ways to make viruses more infectious or more deadly to humans) and other forms of bioresearch that might threaten our species.

4. Products being optimized to generate craving, immediate hyper-stimulation, and/or addiction at the expense of long-term consumer benefit (e.g., junk food, some social media, clickbait, some video games, etc.).

5. The creation of animal products involving what seems to be vast amounts of suffering (e.g., keeping an animal in a tiny cage its whole life so that a human can spend 30 mins enjoying eating it).

7. The incentive that many groups and individuals have to spread false, misleading, and/or politically biased information; this bad information sometimes overwhelms the spread of true information on the same topics (leading to bad decisions, misinformed voters, and polarization).

8. Over-focus on the short-term welfare of society at the expense of the long term (including our own lives in 10-20 years’ time and future generations).

(C) Hybrid Collective Action Problems + Incentive Misalignment Problems

9. Running a highly risky environmental experiment with the stupendous quantity of greenhouse gases we dump into our atmosphere. If we could cooperate on taking the most cost-effective actions that reduce greenhouse gas emissions, we could greatly reduce the risk. But this is not strictly a collective action problem, as there are some companies and people who would be individually better off continuing to pollute tremendous amounts (and the personal benefits to them would likely offset the projected negative impacts of global warming on them).

10. The disturbingly high levels of false results that seem to be present in various branches of science (e.g., ~40% of claims from social science papers in top journals appear not to replicate, and an even higher percentage fail to replicate in preclinical cancer biology). Even those who want to change their practices often find themselves stuck in a system that rewards shoddy practices that produce compelling-seeming results.

This piece was first written on November 16, 2020, and first appeared on this site on May 13, 2022.