Stephanie Lee of BuzzFeed has just published another excellent article about the tribulations of the Cornell Food and Brand Lab. This time, her focus is on the p-hacking, HARKing, and other "questionable research practices" (QRPs) that seem to have been standard in this lab for many years, as revealed in a bunch of e-mails that she obtained via Freedom of Information (FoI) requests. In a way, this brings the story back to the beginning.
It was a bit more than a year ago when Dr. Brian Wansink wrote a blog post (since deleted, hence the archived copy) that attracted some negative attention, partly because of what some people saw as poor treatment of graduate students, but more (in terms of the weight of comments, anyway) because it described what appeared to be some fairly terrible ways of doing research (sample: 'Every day she came back with puzzling new results, and every day we would scratch our heads, ask "Why," and come up with another way to reanalyze the data with yet another set of plausible hypotheses'). It seemed pretty clear that researcher degrees of freedom were a big part of the business model of this lab. Dr. Wansink claimed not to have heard of p-hacking before the comments started appearing on his blog post; I have no trouble believing this, because news travels slowly outside the bubble of Open Science Twitter. (Some advocates of better scientific practices in psychology have recently claimed that major improvements are now underway. All I can say is, they can't be reviewing the same manuscripts that I'm reviewing.)
However, things rapidly became a lot stranger. When Tim, Jordan, and I re-analyzed some of the articles that were mentioned in the blog post, we discovered that many of the reported numbers were simply impossible, which is not a result you'd expect from the kind of "ordinary" QRPs that are common in psychology. If you decide to exclude some outliers, or create subgroups based on what you find in your data, your ANOVA still ought to give you a valid test statistic and your means ought to be compatible with the sample sizes.
Then we found recycled text and tables of results, and strangely consistent numbers of responses to multiple surveys, and results that correlated .97 across studies with different populations, and large numbers of female WW2 combat veterans, and references that went round in circles, and unlikely patterns of responses. It seemed that nobody in the lab could even remember how old their participants were. Clearly, this lab's output—going back 20 or more years, to a time before Dr. Wansink joined Cornell—was a huge mess.
Amidst all that weirdness, it was possible to lose sight of the fact that what got everything started was the attention drawn to the lab by that initial blog post from November 2016, at which point most of us thought that the worst we were dealing with was rampant p-hacking. Since then, various people have offered opinions on what might be going on in the lab; one of the most popular explanations has been, if I can paraphrase, "total cluelessness". On this account, the head of the lab is so busy (perhaps at least partly due to his busy schedule of media appearances, testifying before Congress, and corporate consulting*), the management of the place so overwhelmed on a day-to-day basis, that nobody knows what is being submitted to journals, which table to include in which manuscript, which folder on the shared drive contains the datasets. You could almost feel sorry for them.
Stephanie's latest article changes that, at least for me. The e-mail exchanges that she cites and discusses seem to show deliberate and considered discussion about what to include and what to leave out, why it's important to "tweek" [sic] results to get a p value down to .05, which sets of variables to combine in search of moderators, and which types of message will appeal to the editors (and readers) of various journals. Far from being chaotic, it all seems to be rather well planned to me; in fact, it gives just the impression Dr. Wansink presumably wanted to give in his blog post that led us down this rabbit hole in the first place. When Brian Nosek, one of the most diplomatic people in science, is prepared to say that something looks like research misconduct, it's hard to imply that you're just in an argument with over-critical data thugs.
It's been just over eight hours since the BuzzFeed article appeared, on a Sunday evening in North America. (This post was half-drafted, since I had an idea of what would Stephanie was going to write about in her piece, having been interviewed for it. I was just about to go to sleep when my phone buzzed to let me know that the article had gone live. I will try to forgive my fellow data thug for scooping me to get the first blog about it online.) The initial social media response has been almost uniformly one of anger. If there is a split—and it would seem to be mostly implicit for the moment—it's between those who think that the Cornell Food and Brand Lab is somehow exceptional, and those who think that it's just a particularly egregious example of what goes on all the time in many psychology labs. If you're reading this on the first day I posted it, you might still be able to cast your vote about this. Sanjay Srivastava, who made that poll, also blogged a while back about a 2016 article by anthropologist David Peterson that described rather similar practices in three (unnamed) developmental psychology labs. The Peterson article is well worth reading; I suspected at the time, and I suspect even more strongly today, that what he describes goes on in a lot of places, although maybe the PIs in charge are smart enough not to put their p-hacking directives in e-mails (or, perhaps, all of the researchers involved work at places whose e-mails can't be demanded under FoI, which doesn't extend to private universities; as far as I know, Stephanie Lee obtained all of her information from places other than Cornell).
Maybe this anger can be turned into something good. Perhaps we will see a social media-based movement, inspired by some of the events of the past year, for people to reveal some of the bad methodological stuff their PIs expect them to do. I won't go into any details here, partly because the other causes I'm thinking about are arguably more important than social science research and I don't want to appear to be hitching a ride on their bandwagon by proposing hashtags (although I wonder how many people who thought that they would lose weight by decanting their breakfast cereal into small bags are about to receive a diagnosis of type II diabetes mellitus that could have been prevented if they had actually changed their dietary habits), and partly because as someone who doesn't work in a lab, it's a lot easier for me to talk about this stuff than it is for people with insecure employment that depends on keeping a p-hacking boss happy.
Back to Cornell: we've come full circle. But maybe we're just starting on the second lap. Because, as I noted earlier, all the p-hacking, HARKing, and other stuff that renders p values meaningless still can't explain the impossible numbers, duplicated tables, and other stuff that makes this story rather different from what, I suspect, might—apart, perhaps, from the scale at which these QRPs are being applied—be "business as usual" in a lot of places. Why go to all the trouble of combining variables until a significant moderator shows up in SPSS or Stata, and then report means and test statistic that can't possibly have been output by those programs? That part still makes no sense to me. Nor does Dr. Wansink's claim that he and all his colleagues "didn't remember" when he wrote the correction to the "Elmo" article in the summer of 2017 that the study was conducted on daycare kids, when in February of that year he referred to daycare explicitly (and there are several other clues, some of which I've documented over the past year in assorted posts). And people with better memories than me have noted that the "complete" releases of data that we've been given appear not to be as complete as they might be. We are still owed another round of explanations, and I hope that, among what will probably be a wave of demands for more improvements in research practices, we can still find time to get to the bottom of what exactly happened here, because I don't think that an explanation based entirely on "traditional" QRPs is going to cover it.
* That link is to a Google cache from 2018-02-19, because for some reason, the web page for McDonald's Global Advisory Council gives a 404 error as I'm writing this. I have no idea whether that has anything to do with current developments, or if it's just a coincidence.
It was a bit more than a year ago when Dr. Brian Wansink wrote a blog post (since deleted, hence the archived copy) that attracted some negative attention, partly because of what some people saw as poor treatment of graduate students, but more (in terms of the weight of comments, anyway) because it described what appeared to be some fairly terrible ways of doing research (sample: 'Every day she came back with puzzling new results, and every day we would scratch our heads, ask "Why," and come up with another way to reanalyze the data with yet another set of plausible hypotheses'). It seemed pretty clear that researcher degrees of freedom were a big part of the business model of this lab. Dr. Wansink claimed not to have heard of p-hacking before the comments started appearing on his blog post; I have no trouble believing this, because news travels slowly outside the bubble of Open Science Twitter. (Some advocates of better scientific practices in psychology have recently claimed that major improvements are now underway. All I can say is, they can't be reviewing the same manuscripts that I'm reviewing.)
However, things rapidly became a lot stranger. When Tim, Jordan, and I re-analyzed some of the articles that were mentioned in the blog post, we discovered that many of the reported numbers were simply impossible, which is not a result you'd expect from the kind of "ordinary" QRPs that are common in psychology. If you decide to exclude some outliers, or create subgroups based on what you find in your data, your ANOVA still ought to give you a valid test statistic and your means ought to be compatible with the sample sizes.
Then we found recycled text and tables of results, and strangely consistent numbers of responses to multiple surveys, and results that correlated .97 across studies with different populations, and large numbers of female WW2 combat veterans, and references that went round in circles, and unlikely patterns of responses. It seemed that nobody in the lab could even remember how old their participants were. Clearly, this lab's output—going back 20 or more years, to a time before Dr. Wansink joined Cornell—was a huge mess.
Amidst all that weirdness, it was possible to lose sight of the fact that what got everything started was the attention drawn to the lab by that initial blog post from November 2016, at which point most of us thought that the worst we were dealing with was rampant p-hacking. Since then, various people have offered opinions on what might be going on in the lab; one of the most popular explanations has been, if I can paraphrase, "total cluelessness". On this account, the head of the lab is so busy (perhaps at least partly due to his busy schedule of media appearances, testifying before Congress, and corporate consulting*), the management of the place so overwhelmed on a day-to-day basis, that nobody knows what is being submitted to journals, which table to include in which manuscript, which folder on the shared drive contains the datasets. You could almost feel sorry for them.
Stephanie's latest article changes that, at least for me. The e-mail exchanges that she cites and discusses seem to show deliberate and considered discussion about what to include and what to leave out, why it's important to "tweek" [sic] results to get a p value down to .05, which sets of variables to combine in search of moderators, and which types of message will appeal to the editors (and readers) of various journals. Far from being chaotic, it all seems to be rather well planned to me; in fact, it gives just the impression Dr. Wansink presumably wanted to give in his blog post that led us down this rabbit hole in the first place. When Brian Nosek, one of the most diplomatic people in science, is prepared to say that something looks like research misconduct, it's hard to imply that you're just in an argument with over-critical data thugs.
It's been just over eight hours since the BuzzFeed article appeared, on a Sunday evening in North America. (This post was half-drafted, since I had an idea of what would Stephanie was going to write about in her piece, having been interviewed for it. I was just about to go to sleep when my phone buzzed to let me know that the article had gone live. I will try to forgive my fellow data thug for scooping me to get the first blog about it online.) The initial social media response has been almost uniformly one of anger. If there is a split—and it would seem to be mostly implicit for the moment—it's between those who think that the Cornell Food and Brand Lab is somehow exceptional, and those who think that it's just a particularly egregious example of what goes on all the time in many psychology labs. If you're reading this on the first day I posted it, you might still be able to cast your vote about this. Sanjay Srivastava, who made that poll, also blogged a while back about a 2016 article by anthropologist David Peterson that described rather similar practices in three (unnamed) developmental psychology labs. The Peterson article is well worth reading; I suspected at the time, and I suspect even more strongly today, that what he describes goes on in a lot of places, although maybe the PIs in charge are smart enough not to put their p-hacking directives in e-mails (or, perhaps, all of the researchers involved work at places whose e-mails can't be demanded under FoI, which doesn't extend to private universities; as far as I know, Stephanie Lee obtained all of her information from places other than Cornell).
Maybe this anger can be turned into something good. Perhaps we will see a social media-based movement, inspired by some of the events of the past year, for people to reveal some of the bad methodological stuff their PIs expect them to do. I won't go into any details here, partly because the other causes I'm thinking about are arguably more important than social science research and I don't want to appear to be hitching a ride on their bandwagon by proposing hashtags (although I wonder how many people who thought that they would lose weight by decanting their breakfast cereal into small bags are about to receive a diagnosis of type II diabetes mellitus that could have been prevented if they had actually changed their dietary habits), and partly because as someone who doesn't work in a lab, it's a lot easier for me to talk about this stuff than it is for people with insecure employment that depends on keeping a p-hacking boss happy.
Back to Cornell: we've come full circle. But maybe we're just starting on the second lap. Because, as I noted earlier, all the p-hacking, HARKing, and other stuff that renders p values meaningless still can't explain the impossible numbers, duplicated tables, and other stuff that makes this story rather different from what, I suspect, might—apart, perhaps, from the scale at which these QRPs are being applied—be "business as usual" in a lot of places. Why go to all the trouble of combining variables until a significant moderator shows up in SPSS or Stata, and then report means and test statistic that can't possibly have been output by those programs? That part still makes no sense to me. Nor does Dr. Wansink's claim that he and all his colleagues "didn't remember" when he wrote the correction to the "Elmo" article in the summer of 2017 that the study was conducted on daycare kids, when in February of that year he referred to daycare explicitly (and there are several other clues, some of which I've documented over the past year in assorted posts). And people with better memories than me have noted that the "complete" releases of data that we've been given appear not to be as complete as they might be. We are still owed another round of explanations, and I hope that, among what will probably be a wave of demands for more improvements in research practices, we can still find time to get to the bottom of what exactly happened here, because I don't think that an explanation based entirely on "traditional" QRPs is going to cover it.
* That link is to a Google cache from 2018-02-19, because for some reason, the web page for McDonald's Global Advisory Council gives a 404 error as I'm writing this. I have no idea whether that has anything to do with current developments, or if it's just a coincidence.
"Why go to all the trouble of combining variables until a significant moderator shows up in SPSS or Stata, and then report means and test statistic that can't possibly have been output by those programs? That part still makes no sense to me."
ReplyDeleteIt makes sense to me on a psychological level. I think Wansink just doesn't like data. He sees it as an obstacle to be overcome, a "stone to squeeze blood out of" in his words. Hence the p-hacking, but it also explains the lack of concern over accurately reporting about old datasets. Once you've squeezed all the blood out, who cares about the stone?