Publication bias and p-hacking are two shorthand, jargony ways to describe journal and researcher behaviors that make it more likely that the results we observe in published papers are occurring just by chance.
First: Academic journals are more likely to publish papers that find significant results. It’s not hard to see why this might be true. It’s not very interesting to publish a result saying that M&M color doesn’t impact multiplication speed — that’s kind of what we expected. But a result that says it does matter — that’s more surprising, and more likely to spark the interest of a journal editor.
This is what we call publication bias, and it turns out that this pattern means that the results we see in print are actually a lot more likely to be statistical accidents. Often, many researchers are looking into the same question. It’s not just my research team who is interested in the M&M-multiplication relationship — imagine there are 99 other teams doing the same thing. Even if there is no relationship, on average 5 of those teams will find something significant.
These 5 “successful” teams are more likely to get their results published. That’s what we all see in journals, but what we do not see is the 95 times it didn’t work. When we read these studies, we’re assuming, implicitly, that we are seeing all the studies that were run. But we’re not, and we’re more likely to see the significant-by-chance results.
The issue of publication bias would be problematic just on its own. But it’s even more problematic when it interacts with researchers’ incentives. Researchers need to publish, and (see above) it is easier to do so when results are significant. This can lead to what people sometimes call p-hacking (the “p” stands for probability).
When researchers run a study, there are often a lot of ways to analyze the data. You can analyze the impact on different subgroups of the population. You can analyze the impact of different circumstances. You can test many different treatments. The idea of the xkcd cartoon is that you could test the impact of all the different M&M colors on some outcome.
The more of these tests you do, the more likely you are to get a significant effect by chance. If you do 100 tests, you expect 5 of them to be significant at the 5% level. And then, because of publication bias, you write up the results focusing only on the significant groups or significant M&M colors. Of course, those are just accidental. But as a consumer of research, we do not see all the other things that happened in the background.
For these two reasons: some of what we see published, even if it is from a randomized experiment, is likely to be a result of statistical chance. There is a somewhat notorious paper that suggests that “most” research findings are false; I think this is overkill, but it’s a perspective.