Comments on: Testing Too Many Hypotheses

By: Spencer

Spencer — Sun, 03 Feb 2013 20:47:13 +0000

In reply to Paul. Great question. Treating the 20th hypothesis as a promising lead and gathering fresh data works well. Another approach that is rarely taken in most fields: after collecting data but before doing any data analysis hold some of that data off to the side that you don't touch. At the very end, test that really promising 20th hypothesis (and only that 20th hypothesis) against this never-been-touched data, and see if it still holds up!

By: Paul

Paul — Sun, 03 Feb 2013 17:38:30 +0000

Mind blown. I knew about the risks of sampling bias at the outset of a study and selective reporting of results at the end, but it never occurred to me that you could, in a sense, bake both of those flaws in to the heart of it!

It’s understandable that a researcher would not want the effort of collecting a data set to “go to waste”, though. Is there a possibility for rehabilitating the 20th hypothesis, once found? E.g., treat it as a potentially promising lead and gather a fresh data set to see if it repeats?

By: Spencer

Spencer — Sat, 12 Nov 2011 16:24:17 +0000

In reply to Hendy. Hi Hendy, thanks for the comment. It's definitely difficult to write about rationality without having some overlap with Less Wrong, for obvious reasons. Hopefully when that happens I'll at least explain the concepts in a somewhat different way or put my own spin on them. The extent to which the problems in this post occur vary a lot from field to field and researcher to researcher. I know of cases where results were purely data mined, with the researcher trying hypothesis after hypothesis on a fixed data set until something statistically significant was found. Unless this is done very carefully, it is very bad science. What is probably more common is that the researcher does come up with hypotheses beforehand, but they come up with quite a number of them. Then, their paper emphasizes those few results that are statistically significant, but don't adjust p-values for the many hypotheses tested. What is perhaps more common still is when a researcher runs multiple experiments and only publishes the ones that lead to statistically significant outcomes. This is quite understandable, as journals are less likely to accept negative results for publication, and plus, it doesn't always seem like a good investment of time to write up a boring, negative result. Of course, the more unpublished negative results they have for each positive result, the more likely it is that the positive result is just a consequence of testing so many hypotheses. Then, there is the macro scale problem of many researchers testing many different hypotheses, with a selection bias in favor of publishing the ones that work out. It is difficult to generalize about the prevalence of all of this activity though, because it depends so much on the field, and because it is not always obvious when you read a paper how those results are constructed. When you're in a research area where you can easily replicate a promising result in order to get a p value of 0.00001 these issues are obviously much less of a problem. Here is a disturbing statistic from the New York Times: "In a survey of more than 2,000 American psychologists scheduled to be published this year, Leslie John of Harvard Business School and two colleagues found that 70 percent had acknowledged, anonymously, to cutting some corners in reporting data. About a third said they had reported an unexpected finding as predicted from the start, and about 1 percent admitted to falsifying data."

By: Hendy

Hendy — Sat, 12 Nov 2011 14:47:54 +0000

First off — nice blog. I was skeptical at first, as it seemed that you were reproducing themes that clearly existed in LessWrong posts. I retract that skepticism. I really like what you did with ending “How Great We Are” with the practical worksheet, as well as your synthesis of both LW material with the work of Ariely, Wiseman, etc. I’m a fan!

This post surprised me greatly! I had no idea this was occurring or even potentially common. I’ve read some papers in which it is stated, “The data collected aligned well with H1, but not with H2.” I, perhaps wrongly, read this as the researchers having made their predictions well before the data was collected.

It seems that you are suggesting that:
1) Data is collected and then potential hypotheses are examined for best fit?
2) Data is collected, and if original, pre-data-collection hypotheses fail, they a) don’t state this and b) carry on with #1 like nothing happened?
3) The same as #2, except that new hypotheses are not sought; the results are simply never made public.

Is that a reasonable read of the land?

By: afadfa

afadfa — Tue, 11 Oct 2011 02:45:55 +0000

Someone got me a book with lots of “facts” about left-handers. Many of them just didn’t make sense, things along the lines of left-handers are more likely to exel at tennis and do poorly at pool. I suspected that those statistics weren’t accurate, but I could imagine why. What you’re describing seems very likely. People want to find differences between lefties and righties, but they don’t care what those differences are.

By: Alrenous

Alrenous — Mon, 10 Oct 2011 17:57:17 +0000

Unfortunately, when you’re reading a paper, there is no way to tell how many hypotheses the researcher tested on his dataset unless he chooses to publish it.

Because this idea is well-established, it should have become well-established to report the number of hypotheses. Anyone who doesn't is obviously engaged in sophistry. Come to think, this is probably why most statistical science is just wrong. The p=0.05 value they calculated is mathematically wrong because they calculated as if they only tested one hypothesis, and didn't modify it to take into account the number of hypotheses checked. (Before just now, I knew the 0.05 was wrong, I just thought it was due to something else.) I'm going to have to double-check that I properly adjust my theories, as well. Okay, it matches the data...but on what try? It's not too important because I always test predictively as well, which means on new data. But I'd still like to use it to correctly assess how likely my prediction is a priori, even if I'm going to test it regardless.

This is due to the fact that hypothesis disconfirmations (e.g. “no association was found between cabbage eating and longevity”) are generally less interesting and harder to publish than confirmations

There is a journal of negative results. But frankly it should be the most prestigious journal. Number one. Since we know no natural prestige accrues to negative results, to get them widely known, we're gonna have to lavishly adorn them with artificial prestige. I'd rather null results be over-reported than the reverse. How about you?