10 September 2018

Replication crisis: Cultural norms in the lab

I recently listened to an excellent episode of the "Stuk Rood Vlees" ("Hunk of Red Meat") podcast that is hosted by the Dutch political scientist Armèn Hakhverdian (@hakhverdian ‏on Twitter). His guest was Daniël Lakens (@lakens) and they talked at great length --- to the extent that the episode had to be split into two 1-hour sections --- about the replication crisis.

This podcast episode was recorded in Dutch, which is reasonable since that's the native language of both protagonists, but a little unfortunate for more than 99.5% of the world's population who don't understand it. (Confession: I'm a little bit lukewarm on podcasts --- apart from ones with me as a guest, which are fantastic --- because the lack of a transcript make them hard to search, and even harder to translate.)

This is a particular shame because Daniël is on sparkling form in this podcast. So I've taken the liberty of transcribing what I thought was the most important part, just over 15 minutes long, where Armèn and Daniël talk about publication bias and the culture that produces it. The transcription has been done rather liberally, so don't use it as a way to learn Dutch from the podcast. I've run it past both of the participants and they are happy that it doesn't misrepresent what they said.

This discussion starts at around 13:06, after some discussion of the Stapel and Bem affairs from 2010-2011, ending with surprise that when Stapel --- as Dean --- claimed to have been collecting his data himself, everybody thought this was really nice of him, and nobody seemed to find it weird. Now read on...

Daniël Lakens: Looking back, the most important lesson I've learned about this --- and I have to say, I'm glad that I had started my career back then, around 2009, back when we really weren't doing research right, so I know this from first-hand experience --- is just how important the influence of conforming to norms is. You imagine that you're this highly rational person, learning all these objective methods and applying them rigorously, and then you find yourself in this particular lab and someone says "Yeah, well, actually, the way we do it round here is X", and you just accept that. You don't think it's strange, it's just how things are. Sometimes something will happen and you think "Hmmm, that's a bit weird", but we spend our whole lives in the wider community accepting that slightly weird things happen, so why should it be different in the scientific community? Looking back, I'm thinking "Yeah, that wasn't very good", but at the time you think, "Well, maybe this isn't the optimal way to do it, but I guess everyone's OK with it".

Armèn Hakhverdian: When you arrive somewhere as a newbie and everyone says "This is how we do it here, in fact, this is the right way, the only way to do it", it's going to be pretty awkward to question that.

DL: Yes, and to some extent that's a legitimate part of the process of training scientists. The teacher tells you "Trust me, this is how you do it". And of course up to some point you kind of have to trust these people who know a lot more than you do. But it turns out that quite a lot of that trust isn't justified by the evidence.

AH: Have you ever tried to replicate your own research?

DL: The first article I was ever involved with as a co-author --- so much was wrong with that. There was a meta-analysis of the topic that came out showing that overall, across the various replications, there was no effect, and we published a comment saying that we didn't think there was any good evidence left.

AH: What was that study about?

DL: Looking back, I can see that it was another of these fun effects with little theoretical support ---

AH: Media-friendly research.

DL: Yep, there was a lot of that back then. This was a line of research where researchers tried to show that how warm or heavy something was could affect cognition. Actually, this is something that I still study, but in a smarter way. Anyway, we were looking at weight, and we thought there might be a relation between holding a heavy object and thinking that certain things were more important, more "weighty". So for example we showed that if you gave people a questionnaire to fill in and it was attached to a heavy clipboard, they would give different, more "serious" answers than if the clipboard was lighter. Looking back, we didn't analyse this very honestly --- there was one experiment that didn't give us the result we wanted, so we just ignored it, whereas today I'd say, no, you have to report that as well. Some of us wondered at the time if it was the right thing to do, but then we said, well, that's how everyone else does it.

AH: There are several levels at which things can be done wrong. Stapel making his data up is obviously horrible, but as you just described you can also just ignore a result you don't like, or you can keep analysing the data in a bunch of ways until you find something you can publish. Is there a scale of wrongdoing? We could just call it all fraud, but for example you could just have someone who is well-meaning but doesn't understand statistics --- that isn't an excuse, but it's a different type of problem from conscious fraud.

DL: I think this is also very dependent on norms. There are things that we still think are acceptable today, but which we might look back on in 20 years time and think, how could we every have thought that was OK? Premeditated fraud is a pretty easy call, a bit like murder, but in the legal system you also have the idea of killing someone, not deliberately, but by gross negligence, and I think the problems we have now are more like that. We've known for 50 years or more that we have been letting people with insufficient training have access to data, and now we're finally starting to accept that we have to start teaching people that you can't just trawl through data and publish the patterns that you find as "results". We're seeing a shift --- whereas before you could say "Maybe they didn't know any better", now we can say, "Frankly, this is just negligent". It's not a plausible excuse to pretend that you haven't noticed what's been going on for the past 10 years.

   Then you have the question of not publishing non-significant results. This is a huge problem. You look at the published literature and more than 90% of the studies show positive results, although we know that lots of research just doesn't work out the way we hoped. As a field we still think that it's OK to not publish that kind of study because we can say, "Well, where could I possibly get it published?". But if you ask people who don't work in science, they think this is nuts. There was a nice study about this in the US [Nick: I believe that this is Pickett and Roche, 2018; PDF available here as of 2020-02-04], where they asked people, "Suppose a researcher only publishes results that support his or her hypotheses, what should happen?", and people say, "Well, clearly, that researcher should be fired". That's the view of dispassionate observers about what most scientists think is a completely normal way to work. So there's this huge gap, and I hope that in, say, 20 years time, we'll have fixed that, and nobody will think that it's OK to withhold results. That's a long time, but there's a lot that still needs to be done. I often say to students, if we can just fix this problem of publication bias during our careers, alongside the actual research we do, that's the biggest contribution to science that any of us could make.

AH: So the problem is, you've got all these studies being done all around the world, but only a small fraction gets published. And that's not a random sample of the total --- it's certain types of studies, and that gives a distorted picture of the subject matter.

DL: Right. If you read in the newspaper that there's a study showing that eating chocolate makes you lose weight, you'll probably find that there were 40 or 100 studies done, and in one of them the researchers happened to look at how much chocolate people ate and how their weight changed, and that one study gets published. And of course the newspapers love this kind of story. But it was just a random blip in that one study out of 100. And the question is, how much of the literature is this kind of random blip, and how much is reliable.

AH: For many years I taught statistics to first- and second-year undergraduates who needed to do small research projects, but I never talked about this kind of thing. And lots of these students would come to me after collecting their data and say, "Darn, I didn't get a significant result". It's like there's this inherent belief that you have to get statistical significance to have "good research". But whether research is good or not is all about the method, not the results. It's not a bad thing that a hypothesis goes unsupported.

DL: But it's really hypocritical to tell a first-year student to avoid publication bias, and then to say "Hey, look at my impressive list of publications", when that list is full of significant results. In the last few years I've started to note the non-significant results in the Discussion section, and sometimes we publish via a registered report, where you write up and submit in advance how you're going to do the study, and the journal says "OK, we'll accept this paper regardless of how the results turn out". But if you look at my list of publications as a whole, that first-year student is not going to think that I'm very sincere when I say that non-significant results are just as important as significant ones. Young researchers come into a world that looks very different to what you just described, and they learn very quickly that the norm is, "significance means publishable".

AH: In political science we have lots of studies with null results. We might discover that it wouldn't make much difference if you made some proposed change to the voting system, and that's interesting. Maybe it's different if you're doing an experiment, because you're changing something and you want that change to work. But even there, the fact that your manipulation doesn't work is also interesting. Policymakers want to know that.

DL: Yes, but only if the question you were asking is an interesting one. When I look back to some of my earlier studies, I think that we weren't asking very interesting questions. They were fun because they were counterintuitive, but there was no major theory or potential application. If those kind of effects turn out not to exist, there's no point in reporting that, whereas we care about what might or might not happen if we change the voting system.

AH: So for example, the idea that if people are holding a heavier object they answer questions more seriously: if that turns out not to be true, you don't think that's interesting?

DL: Right. I mean, if we had some sort of situation in society whereby we knew that some people were holding heavy or light things while filling in important documents, then we might be thinking about whether that changes anything. But that's not really the case here, although there are lots of real problems that we could be addressing.

   Another thing I've been working on lately is teaching people how to interpret null effects. There are statistical tools for this ---

AH: It's really difficult.

DL: No, it's really easy! The tools are hardly any more difficult than what we teach in first-year statistics, but again, they are hardly ever taught, which also contributes to the problem of people not knowing what to with null results.

(That's the end of this transcript, at around the 30-minute mark on the recording. If you want to understand the rest of the podcast, it turns out that Dutch is actually quite an easy language to learn.)