26 February 2018

The Cornell Food and Brand Lab story goes full circle, possibly scooping up much of social science research on the way, and keeps turning

Stephanie Lee of BuzzFeed has just published another excellent article about the tribulations of the Cornell Food and Brand Lab.  This time, her focus is on the p-hacking, HARKing, and other "questionable research practices" (QRPs) that seem to have been standard in this lab for many years, as revealed in a bunch of e-mails that she obtained via Freedom of Information (FoI) requests.  In a way, this brings the story back to the beginning.

It was a bit more than a year ago when Dr. Brian Wansink wrote a blog post (since deleted, hence the archived copy) that attracted some negative attention, partly because of what some people saw as poor treatment of graduate students, but more (in terms of the weight of comments, anyway) because it described what appeared to be some fairly terrible ways of doing research (sample: 'Every day she came back with puzzling new results, and every day we would scratch our heads, ask "Why," and come up with another way to reanalyze the data with yet another set of plausible hypotheses'). It seemed pretty clear that researcher degrees of freedom were a big part of the business model of this lab. Dr. Wansink claimed not to have heard of p-hacking before the comments started appearing on his blog post; I have no trouble believing this, because news travels slowly outside the bubble of Open Science Twitter.  (Some advocates of better scientific practices in psychology have recently claimed that major improvements are now underway. All I can say is, they can't be reviewing the same manuscripts that I'm reviewing.)

However, things rapidly became a lot stranger.  When Tim, Jordan, and I re-analyzed some of the articles that were mentioned in the blog post, we discovered that many of the reported numbers were simply impossible, which is not a result you'd expect from the kind of "ordinary" QRPs that are common in psychology.  If you decide to exclude some outliers, or create subgroups based on what you find in your data, your ANOVA still ought to give you a valid test statistic and your means ought to be compatible with the sample sizes.

Then we found recycled text and tables of results, and strangely consistent numbers of responses to multiple surveys, and results that correlated .97 across studies with different populations, and large numbers of female WW2 combat veterans, and references that went round in circles, and unlikely patterns of responses. It seemed that nobody in the lab could even remember how old their participants were.  Clearly, this lab's outputgoing back 20 or more years, to a time before Dr. Wansink joined Cornellwas a huge mess.

Amidst all that weirdness, it was possible to lose sight of the fact that what got everything started was the attention drawn to the lab by that initial blog post from November 2016, at which point most of us thought that the worst we were dealing with was rampant p-hacking.  Since then, various people have offered opinions on what might be going on in the lab; one of the most popular explanations has been, if I can paraphrase, "total cluelessness".  On this account, the head of the lab is so busy (perhaps at least partly due to his busy schedule of media appearancestestifying before Congress, and corporate consulting*), the management of the place so overwhelmed on a day-to-day basis, that nobody knows what is being submitted to journals, which table to include in which manuscript, which folder on the shared drive contains the datasets.  You could almost feel sorry for them.

Stephanie's latest article changes that, at least for me.  The e-mail exchanges that she cites and discusses seem to show deliberate and considered discussion about what to include and what to leave out, why it's important to "tweek" [sic] results to get a p value down to .05, which sets of variables to combine in search of moderators, and which types of message will appeal to the editors (and readers) of various journals.  Far from being chaotic, it all seems to be rather well planned to me; in fact, it gives just the impression Dr. Wansink presumably wanted to give in his blog post that led us down this rabbit hole in the first place. When Brian Nosek, one of the most diplomatic people in science, is prepared to say that something looks like research misconduct, it's hard to imply that you're just in an argument with over-critical data thugs.

It's been just over eight hours since the BuzzFeed article appeared, on a Sunday evening in North America.  (This post was half-drafted, since I had an idea of what would Stephanie was going to write about in her piece, having been interviewed for it.  I was just about to go to sleep when my phone buzzed to let me know that the article had gone live. I will try to forgive my fellow data thug for scooping me to get the first blog about it online.) The initial social media response has been almost uniformly one of anger.  If there is a splitand it would seem to be mostly implicit for the momentit's between those who think that the Cornell Food and Brand Lab is somehow exceptional, and those who think that it's just a particularly egregious example of what goes on all the time in many psychology labs. If you're reading this on the first day I posted it, you might still be able to cast your vote about this.  Sanjay Srivastava, who made that poll, also blogged a while back about a 2016 article by anthropologist David Peterson that described rather similar practices in three (unnamed) developmental psychology labs. The Peterson article is well worth reading; I suspected at the time, and I suspect even more strongly today, that what he describes goes on in a lot of places, although maybe the PIs in charge are smart enough not to put their p-hacking directives in e-mails (or, perhaps, all of the researchers involved work at places whose e-mails can't be demanded under FoI, which doesn't extend to private universities; as far as I know, Stephanie Lee obtained all of her information from places other than Cornell).

Maybe this anger can be turned into something good.  Perhaps we will see a social media-based movement, inspired by some of the events of the past year, for people to reveal some of the bad methodological stuff their PIs expect them to do. I won't go into any details here, partly because the other causes I'm thinking about are arguably more important than social science research and I don't want to appear to be hitching a ride on their bandwagon by proposing hashtags (although I wonder how many people who thought that they would lose weight by decanting their breakfast cereal into small bags are about to receive a diagnosis of type II diabetes mellitus that could have been prevented if they had actually changed their dietary habits), and partly because as someone who doesn't work in a lab, it's a lot easier for me to talk about this stuff than it is for people with insecure employment that depends on keeping a p-hacking boss happy.

Back to Cornell: we've come full circle.  But maybe we're just starting on the second lap.  Because, as I noted earlier, all the p-hacking, HARKing, and other stuff that renders p values meaningless still can't explain the impossible numbers, duplicated tables, and other stuff that makes this story rather different from what, I suspect, mightapart, perhaps, from the scale at which these QRPs are being appliedbe "business as usual" in a lot of places. Why go to all the trouble of combining variables until a significant moderator shows up in SPSS or Stata, and then report means and test statistic that can't possibly have been output by those programs?  That part still makes no sense to me.  Nor does Dr. Wansink's claim that he and all his colleagues "didn't remember" when he wrote the correction to the "Elmo" article in the summer of 2017 that the study was conducted on daycare kids, when in February of that year he referred to daycare explicitly (and there are several other clues, some of which I've documented over the past year in assorted posts). And people with better memories than me have noted that the "complete" releases of data that we've been given appear not to be as complete as they might be.  We are still owed another round of explanations, and I hope that, among what will probably be a wave of demands for more improvements in research practices, we can still find time to get to the bottom of what exactly happened here, because I don't think that an explanation based entirely on "traditional" QRPs is going to cover it.



* That link is to a Google cache from 2018-02-19, because for some reason, the web page for McDonald's Global Advisory Council gives a 404 error as I'm writing this. I have no idea whether that has anything to do with current developments, or if it's just a coincidence.

06 February 2018

The latest Cornell Food and Brand Lab correction: Some inconsistencies and strange data patterns

[Update 2018-05-12 20:40 UTC: The study discussed below has now been retracted. ]

The Cornell Food and Brand Lab has a new correction. Tim van der Zee already tweeted a bit about it.

"Extremely odd that it isn't a retraction"? Let's take a closer look.

Here is the article that was corrected:
Wansink, B., Just, D. R., Payne, C. R., & Klinger, M. Z. (2012). Attractive names sustain increased vegetable intake in schools. Preventive Medicine55, 330–332. http://dx.doi.org/10.1016/j.ypmed.2012.07.012

This is the second article from this lab in which data were reported as having been collected from elementary school children aged 811, but it turned out that they were in fact collected from children aged 3–5 in daycares.  You can read the lab's explanation for this error at the link to the correction above (there's no paywall at present), and decide how convincing you find it.

Just as a reminder, the first article, published in JAMA Pediatrics, was initially corrected (via JAMA's "Retract and replace" mechanism) in September 2017. Then, after it emerged that the children were in fact in daycare, and that there were a number of other problems in the dataset that I blogged about, the article was definitively retracted in October 2017.

I'm going to concentrate on Study 1 of the recently-corrected article here, because the corrected errors in this study are more egregious than those in Study 2, and also because there are still some very substantial problems remaining.  If you have access to SPSS, I also encourage you to download the dataset for Study 1, along with the replication syntax and annotated output file, from here.

By the way, in what follows, you will see a lot of discussion about the amount of "carrots" eaten.  There has been some discussion about this, because the original article just discussed "carrots" with no qualification. The corrected article tells us that the carrots were "matchstick carrots", which are about 1/4 the size of a baby carrot. Presumably there is a U.S. Standard Baby Carrot kept in a science museum somewhere for calibration purposes.

So, what are the differences between the original article and the correction? Well, there are quite a few. For one thing, the numbers in Table 1 now finally make sense, in that the number of carrots considered to have been "eaten" is now equal to the number of carrots "taken" (i.e., served to the children) minus the number of carrots "uneaten" (i.e., counted when their plates came back after lunch).  In the original article, these numbers did not add up; that is, "taken" minus "uneaten" did not equal "eaten".  This is important because, when asked by Alison McCook of Retraction Watch why this was the case, Dr. Brian Wansink (the head of the Cornell Food and Brand Lab) implied that it must have been due to some carrots being lost (e.g., dropped on the floor, or thrown in food fights). But this makes no sense for two reasons. First, in the original article, the difference between the number of carrots "eaten" was larger than the difference between "taken" and "uneaten", which would imply that, rather than being dropped on the floor or thrown, some extra carrots had appeared from somewhere.  Second, and more fundamentally, the definition of the number of carrots eaten is (the number taken) minus (the number left uneaten).  Whether the kids ate, threw, dropped, or made sculptures out of the carrots doesn't matter; any that didn't come back were classed as "eaten". There was no monitoring of each child's oesophagus to count the carrots slipping down.

When we look in the dataset, we can see that there are separate variables for "taken" (e.g., "@1CarTaken" for Monday, "@2CarTaken" for Tuesday, etc), "uneaten" (e.g., "@1CarEnd", where "End" presumably corresponds to "left at the end"), and "eaten" (e.g., "@1CarEaten").  In almost all cases, the formula ("eaten" equals "taken" minus "uneaten") holds, except for a few missing values and two participants (#42 and #152) whose numbers for Monday seem to have been entered in the wrong order; for both of these participants, "eaten" equals "taken" plus "uneaten". That's slightly concerning because it suggests that, instead of just entering "taken" and "uneaten" (the quantities that were capable of being measured) and letting their computer calculate "eaten", the researchers calculated "eaten" by hand and typed in all three numbers, doing so in the wrong order for these two participants in the process.

Another major change is that whereas in the original article the study was run on three days, in the correction there are reports of data from four days.  In the original, Monday was a control day, the between-subject manipulation of the carrot labels was done on Tuesday, and Thursday was a second control day, to see if the effect persisted. In the correction, Thursday is now a second experimental day, with a different experiment that carries over to Friday. Instead of measuring how many carrots were eaten on Thursday, between two labelling conditions ("X-ray Vision Carrots" and "Food of the Day"; there was no "no-label" condition), the dependent variable was the number of carrots eaten on the next day (Friday).

OK, so those are the differences between the two articles. But arguably the most interesting discoveries are in the dataset, so let's look at that next.

Randomisation #fail


As Tim van der Zee noted in the Twitter thread that I linked to at the top of this post, the number of participants in Study 1 in the corrected article has mysteriously increased since the original publication. Specifically, the number of children in the "Food of the Day" condition has gone from 38 to 48, an increase of 10, and the number of children in the "no label" condition has gone from 45 to 64, an increase of 19.  You might already be thinking that a randomisation process that leads to only 22.2% (32 of 144) participants being in the experimental condition might not be an especially felicitous one, but as we will see shortly, that is by no means the largest problem here.  (The original article does not actually discuss randomisation, and the corrected version only mentions it in the context of the choice of two labels in the part of the experiment that was conducted on the Thursday, but I think it's reasonable to assume that children were meant to be randomised to one of the carrot labelling conditions on the Tuesday.)

The participants were split across seven daycare centres and/or school facilities (I'll just go with the authors' term "schools" from now on).  Here is the split of children per condition and per school:


Oh dear. It looks like the randomisation didn't so much fail here, as not take place at all, in almost all of the schools.

Only two schools (#1 and #4) had a non-zero number of children in each of the three conditions. Three schools had zero children in the experimental condition. Schools #3, #5, #6, and #7 only had children in one of the three conditions. The justification for the authors' model in the corrected version of the article ("a Generalized Estimated Equation model using a negative binominal distribution and log link method with the location variable as a repeated factor"), versus the simple ANOVA that they performed in the original, was to be able to take into account the possible effect of the school. But I'm not sure that any amount of correction for the effect of the school is going to help you when the data are as unbalanced as this.  It seems quite likely that the teachers or researchers in most of the schools were not following the protocol very carefully.

At school #1, thou shalt eat carrots


Something very strange must have been happening in school #1.  Here is the table of the numbers of children taking each number of carrots in schools #2-#7 combined:

I think that's pretty much what one might expect.  About a quarter of the kids took no carrots at all, most of the rest took a few, and there were a couple of major carrot fans.  Now let's look at the distribution from school #1:


Whoa, that's very different. No child in school #1 had a lunch plate with zero carrots. In fact, all of the children took a minimum of 10 carrots, which is more than 44 (41.1%) of the 107 children in the other schools took.  Even more curiously, almost all of the children in school #1 apparently took an exact multiple of 10 carrots - either 10 or 20. And if we break these numbers down by condition, it gets even stranger:

So 17 out of 21 children in the control condition ("no label", which in the case of daycare children who are not expected to be able to read labels anyway presumably means "no teacher describing the carrots") in school #1 chose exactly 10 carrots. Meanwhile, every single child12 out of 12in the "Food of the Day" condition selected exactly 20 carrots.

I don't think it's necessary to run any statistical tests here to see that there is no way that this happened by chance. Maybe the teachers were trying extra hard to help the researchers get the numbers they wanted by encouraging the children to take more carrots than they otherwise would (remember, from schools #2-#7, we could expect a quarter of the kids to take zero carrots). But then, did they count out these matchstick carrots individually, 1, 2, 3, up to 10 or 20? Or did they serve one or two spoonfuls and think, screw it, I can't be bothered to count them, let's call it 10 per spoon?  Participants #59 (10 carrots), #64 (10), #70 (22), and #71 (10) have the comment "pre-served" recorded in their data for this day; does this mean that for these children (and perhaps others with no comment recorded), the teachers chose how many carrots to give them, thus making a mockery of the idea that the experiment was trying to determine how the labelling would affect the kids' choices?  (I presume it's just a coincidence that the number of kids with 20 carrots in the "Food of the Day" condition, and the number with 10 carrots in the "no label" condition, are very similar to the number of extra kids in these respective conditions between the original and corrected versions of the article.)

The tomatoes... and the USDA project report


Another interesting thing to emerge from an examination of the dataset is that not one but two foods, with and without "cool names", were tested during the study.  As well as "X-ray Vision Carrots", children were also offered tomatoes. On at least one day, these were described as "Tomato Blasts". The dataset contains variables for each day recording what appears to be the order in which each child was served with the tomatoes or carrots.  Yet, there are no variables recording how many tomatoes each child took, ate, or left uneaten on each day. This is interesting, because we know that these quantities were measured. How? Because it's described in this project report by the Cornell Food and Brand Lab on the USDA website:

"... once exposed to the x-ray vision carrots kids ate more of the carrots even when labeled food of the day. No such strong relationship was observed for tomatoes, which could mean that the label used (tomato blasts) might not be particularly meaningful for children in this age group."

This appears to mean that the authors tested two dependent variables, but only reported the one that gave a statistically significant result. Does that sound like readers of the Preventive Medicine article (either the original or the corrected version) are being provided with an accurate representation of the research record? What other variables might have been removed from the dataset?

It's also worth noting that the USDA project report that I linked to above states explicitly that both the carrots-and-tomatoes study and the "Elmo"/stickers-on-apples study (later retracted by JAMA Pediatrics) were conducted in daycare facilities, with children aged 35.  It appears that the Food and Brand Lab probably sent that report to the USDA in 2009. So how was it that by March 2012the date on this draft version of the original "carrots" articleeverybody involved in writing "Attractive Names Sustain Increased Vegetable Intake in Schools" had apparently forgotten about it, and was happy to report that the participants were elementary school students?  And yet, when Dr. Wansink cited the JAMA Pediatrics article in 2013 and 2015, he referred to the participants as "daycare kids" and "daycare children", respectively; so his incorrect citation of his own work actually turns out to have been a correct statement of what had happened.  And in the original version of that same "Elmo" article, published in 2012, the authors referred to the childrenwho were meant to be aged 8–11as "preliterate". So even if everyone had forgotten about the ages of the participants at a conscious level, this knowledge seems to have been floating around subliminally. This sounds like a very interesting case study for psychologists.

Another interesting thing about the March 2012 draft that I mentioned in the previous paragraph is that it describes data being collected on four days (i.e., the same number of days as the corrected article), rather than the three days that were mentioned in the original published version of the article, which was published just four months after the date of the draft:


Extract from the March 2012 draft manuscript, showing the description of the data collection period, with the PDF header information (from File/Properties) superposed.

So apparently at some point between drafting the original article and submitting it, one of the days was dropped, with the second control day being moved up from Friday to Thursday. Again, some people might feel that at least one version of this article might not be an accurate representation of the research record.

Miscellaneous stuff


Some other minor peculiarites in the dataset, for completeness:

- On Tuesdaythe day of the experiment, after a "control" dayparticipants 194, 198, and 206 was recorded as commenting about "cool carrots"; it is unclear whether this was a reference to the name that was given to the carrots on Monday or Tuesday.  But on Monday, a "control" day, the carrots should presumably have had no name, and on Tuesday they should have been described as "X-ray Vision Carrots".

- On Monday and Friday, all of the carrots should have been served with no label. But the dataset records that five participants (#199, #200, #203, #205, and #208) were in the "X-ray Vision Carrots" condition on Monday, and one participant (#12) was in the "Food of the Day" condition on Friday. Similarly, on Thursday, according to the correction, all of the carrots were labelled as "Food of the Day" or "X-ray Vision Carrots". But two of the cases (participants #6 and #70) have the value that corresponds to "no label" here.

These are, again, minor issues, but they shouldn't be happening. In fact there shouldn't even be a variable in the dataset for the labelling condition on Monday and Friday, because those were control-only days.

Conclusion


What can we take away from this story?  Well, the correction at least makes one thing clear: absolutely nothing about the report of Study 1 in the original published article makes any sense. If the correction is indeed correct, the original article got almost everything wrong: the ages and school status of the participants, the number of days on which the study was run, the number of participants, and the number of outcome measures. We have an explanation of sorts for the first of these problems, but not the others.  I find it very hard to imagine how the authors managed to get so much about Study 1 wrong the first time they wrote it up. The data for the four days and the different conditions are all clearly present in the dataset.  Getting the number of days wrong, and incorrectly describing the nature of the experiment that was run on Thursday, is not something that can be explained by a simple typo when copying the numbers from SPSS into a Word document (especially since, as I noted above, the draft version of the original article mentions four days of data collection).

In summary: I don't know what happened here, and I guess we may never know. What I am certain of is that the data in Study 1 of this article, corrected or not, cannot be the basis of any sort of scientific conclusion about whether changing the labels on vegetables makes children want to eat more of them.

I haven't addressed the corrections to Study 2 in the same article, although these would be fairly substantial on their own if they weren't overshadowed by the ongoing dumpster fire of Study 1.  It does seem, however, that the spin that is now being put on the story is that Study 1 was a nice but perhaps "slightly flawed" proof-of-concept, but that there is really nothing to see there and we should all look at Study 2 instead.  I'm afraid that I find this very unconvincing.  If the authors have real confidence in their results, I think they should retract the article and resubmit Study 2 for review on its own. It would be sad for Matthew Z. Klinger, the then high-school student who apparently did a lot of the grunt work for Study 2, to lose a publication like this, but if he is interested in pursuing an academic career, I think it would be a lot better for him to not to have his name on the corrected article in its present form.