Back in February 2017, I wrote this post about an article in JAMA Pediatrics from the Cornell Food and Brand Lab entitled "Can Branding Improve School Lunches?". That article has now been "retracted and replaced", which is a thing that JAMA does when it considers that an article is too badly flawed to simply issue a correction, but that these flaws were not the result of malpractice. This operation was covered by Retraction Watch here.
Here is the retraction notice (including the replacement article), and here is the supplementary information, which consists of a data file in Excel, some SPSS syntax to analyze it, and a PDF file with the SPSS output. It might help to get the last of these if you want to follow along; you will need to unzip it.
There are already some inconsistencies in the replacement article, notably in the table.
First, note (d) says that the condition code for Elmo-branded cookies was 2, and note (e) says that the condition code for Elmo-branded cookies was 4. Additionally, the caption on the rightmost column of the table says that "Condition 4" was a "branded cookie". (I'll assume that means an "Elmo-branded cookie"; in my previous blog post I noted that "branded" could also mean "branded with the 'unknown character' sticker" that was mentioned in the first article, but since that sticker is now simply called the "unknown sticker", "unfamiliar sticker", or "control sticker", I'll assume that the only branding is Elmo.) As far as I can tell (e.g., from the dataset and syntax), note (d) is correct, and note (e) and the column caption are incorrect.
Second, although the final sample size (complete cases) is reported as 615, the number of exclusions in the table is 420, and the overall sample was reported as 1,040. So it looks like five cases are missing (1040 - 420 = 620). On closer inspection, though, we can see that 10 cases were eliminated because they could not be tied to a school, so the calculation is (1030 - 420 = 610) and we seem to have five cases too many. After trawling through the dataset for an hour or so, I identified five cases that had been excluded for more than one reason, but counted as having been excluded for both reasons. (Yes, this was the authors' job, not mine. I occasionally fantasise about one day billing Cornell for all the time I've spent trying to clean up their mess, and I suspect my colleagues feel the same way.)
I'll skip over the fact that, in reporting their only statistically significant results, the authors reported percentages of children choosing an apple that were not the actual percentages observed, but the estimated marginal means from their generalized estimating equations model (which had a more impressive gap between the Elmo-branded apple and the control condition: 13.1 percentage points versus 9.0). Were those actual percentages to be cited in a claim about how many more children actually took an apple in the experiment then there would be a problem, but for the moment those numbers are just sitting there.
Now, so far I've maybe just been nitpicking about the kind of elementary mistakes that are easy to make (and fail to spot during proofreading) entirely inadvertently when submitting an article to a journal with an impact factor of over 10, and which can be fixed with a simple correction. Isn't there something better to write about?
Well, it's funny you should ask, because yes, there is.
Recall that the design of the study is such that each school (numbered 1 to 7) runs each condition of the experiment (numbered 1 to 6, with 4 not used) on one day of the week (numbered 1 to 5). All schools are supposed to run condition 5 on day 1 (Monday) and condition 6 on day 5 (Friday), but apart from that, not every school runs, say, condition 2 on Tuesday. However, for each school X, condition Y (and only condition Y) is run on day Z (and only on day Z). If we have 34 choice records from school X for condition Y, these records should all have the same number for the day.
To check this, we can code up the school/condition/day combination as a single number, by multiplying the school number by 100, the condition by 10, and the day by 1, and adding them together. For school 1, this might give us numbers such as 112, 123, 134, 151, and 165. (The last two numbers definitely ought to be 151 and 165, because condition 5, the pre-test, was run on day 1 and condition 6, the post-test, was run on day 5.) For school 2, we might find 214, 222, 233, 251, and 265. There are five possible numbers per school, one per day (or, if you prefer, one per condition). There were seven schools with data included in the analysis, so we would expect to see a maximum of 35 different three-digit school/condition/day numbers in a summary of the data, although there may well be fewer because in some cases a school didn't contribute a valid case on one or more days. (I have made an Excel sheet with these data available here, along with SPSS and CSV versions of the same data, but I encourage you to check my working starting from the data supplied by the authors; detailed instructions can be found in the Appendix below.)
Let's have a look at the resulting pivot table showing the frequencies of each school/condition/day number. The column on the left is our three-digit number that combines school, condition, and day, while the column on the right is the number of times it appears. Remember, in the left-hand column, we are expecting up to five numbers starting with 1, five starting with 2, etc.
Oh.
You can see from the Grand Total line that we have accounted for all 615 cases in here. And for schools 5 and 7, we see the pattern we expected. For example, school 7 apparently ran condition 1 on day 3, condition 2 on day 4, condition 3 on day 2, and (as expected) condition 5 on day 1 and condition 6 on day 5. School 6 only had a few cases. (I'm trying to work out if it's meaningful to add them to the model; since we only have cases from one condition, we can have no idea what the effect of the Elmo sticker was in school 6. But that's a minor point for the statisticians.)
School 3 isn't too bad: the numbers opposite 324 and 325 suggest that condition 2 was run on day 4 with 6 cases, and on day 5 for 1 case. Maybe that 325 is just a typo.
However, schools 1, 2, and 4 --- which, between them, account for 73.3% of the cases in the dataset --- are a disaster. Let's start at the top with school 1: 19 cases for condition 1 on day 2, 12 cases for condition 1 on day 3. Conditions 2, 3, and 5 appear to have been run on three different days in that school. Cases from condition 1 appear on four different days in school 4. Here is a simple illustration from the authors' own Excel data file:
Neither the retracted article, nor the replacement, makes any mention of multiple conditions being run on any given day. Had this been an intended feature of the design, presumably the day might have been included as a predictor in the model (although it would probably been quite strongly correlated with condition). You can check the description of the design in the replacement article, or in the retracted article in both its published and draft forms, which I have made available here. So apparently the authors have either (again) inadvertently failed to describe their method correctly, or they have inadvertently used data that is highly inconsistent with their method as it was reported in both the retracted article and the replacement.
Perhaps we can salvage something here. By examining the counts of each combination of school/condition/day, I was able to eliminate what appeared to be the spurious multiple-days-per-condition cases while keeping the maximum possible number of cases for each school. In most cases, I was able to make out the "dominant" condition for any given day; for example, for school 1 on day 2, there are cases for 19 participants in condition 1, five in condition 2, nine in condition 3, and two in condition 5, so we can work (speculatively, of course) on the basis that condition 1 is the correct one, and delete the others. After doing this for all the affected schools, 478 cases were left, which represents a loss of 22.3% of the data. Then I re-ran the authors' analyses on these cases (using SPSS, since the authors provided SPSS syntax and I didn't want to risk translating a GEE specification to R):
Following the instructions in the supplement, we can calculate the percentages of children who chose an apple in the control condition (5) and the "Elmo-branded apples" condition (1) by subtracting the estimated marginal means from 100%, giving us an increase of (1.00 - .757) = 24.3% to (1.00 - .694) = 30.6%, or 6.3 percentage points instead of the 13.1 reported in the replacement article. The test statistic shows that this is not statistically significant, Wald χ2 = 2.086, p = .149. Of course, standard disclaimers about the relevance or otherwise of p values apply here; however, since the authors themselves seemed to believe that a p = .10 result is evidence of no association when they reported that "apple selection was not significantly associated ... in the posttest condition (Wald χ2 = 2.661, P = .10)", I think we can assume that they would have not considered the result here to be evidence that "Elmo-branded apples were associated with an increase in a child’s selection of an apple over a cookie".
Oh. Again. (The table is quite long, but below what's shown here, it's all ones in the Total column.)
Eleven participants were apparently exposed to the same condition twice. One participant (number 26) was assigned to condition 1 on three days (3, 4, and 5). Here are those duplicate cases in the authors' original Excel data file:
(Incidentally, it appears that this child took the cookie on the first two days in the "Elmo-branded apple" condition, and then the apple on the last day. Perhaps that's a demonstration of the longitudinal persuasive power of Elmo stickers to get kids to eat fruit.)
Not to put too fine a point on it, this dataset appears to be a complete mess (other adjectives are available, but some may not be suitable for a family-friendly blog such as this one aspires to be.)
What should happen now? This seems like too big a problem to be fixable by a correction. Maybe the authors should retract their "retract and replace" replacement, and replace it with a re-replacement. (I wonder if an article has ever been retracted twice?) This could, of course, go on for ever, with "retract and replace" becoming "rinse and repeat". But perhaps at some point the journal will step in and put this piece of research out of its misery.
As mentioned above, the dataset file provided by the authors has two tabs ("worksheets", in Excel parlance). One contains all of the records that were coded; the other contains only the complete and valid records that were used in the analyses. Of these, only the first contains the day on which the data were collected. So the problem is to generate a dataset with records from the first tab (including "Day") that correspond to the records in the second tab.
To do this:
1. Open the file ApamJamaBrandingElmo_Dataset_09-21-17.xlsx (from the supplementary information here). Select the "Masterfile" tab (worksheet).
2. Sort by "Condition" in ascending order.
3. Go to row 411 and observe that Condition is equal to 4. Delete rows 411 through 415 (the records with Condition equal to 4).
4. Go to line 671 and observe that Condition is equal to "2,3". Delete all rows from here to the end of the file (including those where Condition is blank).
5. Sort by "Choice" in ascending order.
6. Go to row 2. Delete rows 2 and 3 (records with "Choice" codes as 0 and 0.5)
7. Go to rows 617 and observe that the following "Choice" values are 3, 5, 5, "didn't eat a snack", and then a series of blanks. Delete all rows from 617 through the end of the file.
8. You now have the same 615 records as in the second tab of the file, "SPSS-615 cases", with the addition of the "Day" field. You can verify this by sorting both on the same fields and pasting the columns of one alongside the other.
Here is the retraction notice (including the replacement article), and here is the supplementary information, which consists of a data file in Excel, some SPSS syntax to analyze it, and a PDF file with the SPSS output. It might help to get the last of these if you want to follow along; you will need to unzip it.
Minor stuff to warm up with
There are already some inconsistencies in the replacement article, notably in the table.
First, note (d) says that the condition code for Elmo-branded cookies was 2, and note (e) says that the condition code for Elmo-branded cookies was 4. Additionally, the caption on the rightmost column of the table says that "Condition 4" was a "branded cookie". (I'll assume that means an "Elmo-branded cookie"; in my previous blog post I noted that "branded" could also mean "branded with the 'unknown character' sticker" that was mentioned in the first article, but since that sticker is now simply called the "unknown sticker", "unfamiliar sticker", or "control sticker", I'll assume that the only branding is Elmo.) As far as I can tell (e.g., from the dataset and syntax), note (d) is correct, and note (e) and the column caption are incorrect.
Second, although the final sample size (complete cases) is reported as 615, the number of exclusions in the table is 420, and the overall sample was reported as 1,040. So it looks like five cases are missing (1040 - 420 = 620). On closer inspection, though, we can see that 10 cases were eliminated because they could not be tied to a school, so the calculation is (1030 - 420 = 610) and we seem to have five cases too many. After trawling through the dataset for an hour or so, I identified five cases that had been excluded for more than one reason, but counted as having been excluded for both reasons. (Yes, this was the authors' job, not mine. I occasionally fantasise about one day billing Cornell for all the time I've spent trying to clean up their mess, and I suspect my colleagues feel the same way.)
I'll skip over the fact that, in reporting their only statistically significant results, the authors reported percentages of children choosing an apple that were not the actual percentages observed, but the estimated marginal means from their generalized estimating equations model (which had a more impressive gap between the Elmo-branded apple and the control condition: 13.1 percentage points versus 9.0). Were those actual percentages to be cited in a claim about how many more children actually took an apple in the experiment then there would be a problem, but for the moment those numbers are just sitting there.
Now, so far I've maybe just been nitpicking about the kind of elementary mistakes that are easy to make (and fail to spot during proofreading) entirely inadvertently when submitting an article to a journal with an impact factor of over 10, and which can be fixed with a simple correction. Isn't there something better to write about?
Well, it's funny you should ask, because yes, there is.
The method doesn't match the data
Recall that the design of the study is such that each school (numbered 1 to 7) runs each condition of the experiment (numbered 1 to 6, with 4 not used) on one day of the week (numbered 1 to 5). All schools are supposed to run condition 5 on day 1 (Monday) and condition 6 on day 5 (Friday), but apart from that, not every school runs, say, condition 2 on Tuesday. However, for each school X, condition Y (and only condition Y) is run on day Z (and only on day Z). If we have 34 choice records from school X for condition Y, these records should all have the same number for the day.
To check this, we can code up the school/condition/day combination as a single number, by multiplying the school number by 100, the condition by 10, and the day by 1, and adding them together. For school 1, this might give us numbers such as 112, 123, 134, 151, and 165. (The last two numbers definitely ought to be 151 and 165, because condition 5, the pre-test, was run on day 1 and condition 6, the post-test, was run on day 5.) For school 2, we might find 214, 222, 233, 251, and 265. There are five possible numbers per school, one per day (or, if you prefer, one per condition). There were seven schools with data included in the analysis, so we would expect to see a maximum of 35 different three-digit school/condition/day numbers in a summary of the data, although there may well be fewer because in some cases a school didn't contribute a valid case on one or more days. (I have made an Excel sheet with these data available here, along with SPSS and CSV versions of the same data, but I encourage you to check my working starting from the data supplied by the authors; detailed instructions can be found in the Appendix below.)
Let's have a look at the resulting pivot table showing the frequencies of each school/condition/day number. The column on the left is our three-digit number that combines school, condition, and day, while the column on the right is the number of times it appears. Remember, in the left-hand column, we are expecting up to five numbers starting with 1, five starting with 2, etc.
Oh.
You can see from the Grand Total line that we have accounted for all 615 cases in here. And for schools 5 and 7, we see the pattern we expected. For example, school 7 apparently ran condition 1 on day 3, condition 2 on day 4, condition 3 on day 2, and (as expected) condition 5 on day 1 and condition 6 on day 5. School 6 only had a few cases. (I'm trying to work out if it's meaningful to add them to the model; since we only have cases from one condition, we can have no idea what the effect of the Elmo sticker was in school 6. But that's a minor point for the statisticians.)
School 3 isn't too bad: the numbers opposite 324 and 325 suggest that condition 2 was run on day 4 with 6 cases, and on day 5 for 1 case. Maybe that 325 is just a typo.
However, schools 1, 2, and 4 --- which, between them, account for 73.3% of the cases in the dataset --- are a disaster. Let's start at the top with school 1: 19 cases for condition 1 on day 2, 12 cases for condition 1 on day 3. Conditions 2, 3, and 5 appear to have been run on three different days in that school. Cases from condition 1 appear on four different days in school 4. Here is a simple illustration from the authors' own Excel data file:
Neither the retracted article, nor the replacement, makes any mention of multiple conditions being run on any given day. Had this been an intended feature of the design, presumably the day might have been included as a predictor in the model (although it would probably been quite strongly correlated with condition). You can check the description of the design in the replacement article, or in the retracted article in both its published and draft forms, which I have made available here. So apparently the authors have either (again) inadvertently failed to describe their method correctly, or they have inadvertently used data that is highly inconsistent with their method as it was reported in both the retracted article and the replacement.
Perhaps we can salvage something here. By examining the counts of each combination of school/condition/day, I was able to eliminate what appeared to be the spurious multiple-days-per-condition cases while keeping the maximum possible number of cases for each school. In most cases, I was able to make out the "dominant" condition for any given day; for example, for school 1 on day 2, there are cases for 19 participants in condition 1, five in condition 2, nine in condition 3, and two in condition 5, so we can work (speculatively, of course) on the basis that condition 1 is the correct one, and delete the others. After doing this for all the affected schools, 478 cases were left, which represents a loss of 22.3% of the data. Then I re-ran the authors' analyses on these cases (using SPSS, since the authors provided SPSS syntax and I didn't want to risk translating a GEE specification to R):
Following the instructions in the supplement, we can calculate the percentages of children who chose an apple in the control condition (5) and the "Elmo-branded apples" condition (1) by subtracting the estimated marginal means from 100%, giving us an increase of (1.00 - .757) = 24.3% to (1.00 - .694) = 30.6%, or 6.3 percentage points instead of the 13.1 reported in the replacement article. The test statistic shows that this is not statistically significant, Wald χ2 = 2.086, p = .149. Of course, standard disclaimers about the relevance or otherwise of p values apply here; however, since the authors themselves seemed to believe that a p = .10 result is evidence of no association when they reported that "apple selection was not significantly associated ... in the posttest condition (Wald χ2 = 2.661, P = .10)", I think we can assume that they would have not considered the result here to be evidence that "Elmo-branded apples were associated with an increase in a child’s selection of an apple over a cookie".
The dataset contains duplicates
There is one more strange thing in this dataset, which appears to provide more evidence that the authors ought to go back to the drawing board because something about their design, data collection methods, or coding practices is seriously messed up. I coded up participants and conditions into a single number by multiplying condition by 1000 and adding the participant ID, giving, for example, 4987 for participant 987 in condition 4. Since each participant should have been exposed to each condition exactly once, we would expect to see each of these numbers only once. Here's the pivot table. The column on the left is our four-digit number that combines condition and participant number, while the column on the right is the number of cases. Remember, we are expecting 1s, and only 1s, for the number of cases in every row:
Oh. Again. (The table is quite long, but below what's shown here, it's all ones in the Total column.)
Eleven participants were apparently exposed to the same condition twice. One participant (number 26) was assigned to condition 1 on three days (3, 4, and 5). Here are those duplicate cases in the authors' original Excel data file:
(Incidentally, it appears that this child took the cookie on the first two days in the "Elmo-branded apple" condition, and then the apple on the last day. Perhaps that's a demonstration of the longitudinal persuasive power of Elmo stickers to get kids to eat fruit.)
Conclusion: Action required
Not to put too fine a point on it, this dataset appears to be a complete mess (other adjectives are available, but some may not be suitable for a family-friendly blog such as this one aspires to be.)
What should happen now? This seems like too big a problem to be fixable by a correction. Maybe the authors should retract their "retract and replace" replacement, and replace it with a re-replacement. (I wonder if an article has ever been retracted twice?) This could, of course, go on for ever, with "retract and replace" becoming "rinse and repeat". But perhaps at some point the journal will step in and put this piece of research out of its misery.
Appendix: Step-by-step instructions for reproducing the dataset
As mentioned above, the dataset file provided by the authors has two tabs ("worksheets", in Excel parlance). One contains all of the records that were coded; the other contains only the complete and valid records that were used in the analyses. Of these, only the first contains the day on which the data were collected. So the problem is to generate a dataset with records from the first tab (including "Day") that correspond to the records in the second tab.
To do this:
1. Open the file ApamJamaBrandingElmo_Dataset_09-21-17.xlsx (from the supplementary information here). Select the "Masterfile" tab (worksheet).
2. Sort by "Condition" in ascending order.
3. Go to row 411 and observe that Condition is equal to 4. Delete rows 411 through 415 (the records with Condition equal to 4).
4. Go to line 671 and observe that Condition is equal to "2,3". Delete all rows from here to the end of the file (including those where Condition is blank).
5. Sort by "Choice" in ascending order.
6. Go to row 2. Delete rows 2 and 3 (records with "Choice" codes as 0 and 0.5)
7. Go to rows 617 and observe that the following "Choice" values are 3, 5, 5, "didn't eat a snack", and then a series of blanks. Delete all rows from 617 through the end of the file.
8. You now have the same 615 records as in the second tab of the file, "SPSS-615 cases", with the addition of the "Day" field. You can verify this by sorting both on the same fields and pasting the columns of one alongside the other.
Interesting...
ReplyDeleteSupposing that the spreadsheet is correct and that, at some schools, the conditions were spread across different days, this would raise the question of how students were assigned to each condition/day regime.
If the assignment were not random, problems could arise. E.g. if the arrangement was "first 10 students in the queue get condition 2 on day 2, the rest get condition 3" there would be potential for self-selection because hungrier students might be at the front of the queue... and so on.
That's just one possible example, of course.
Yes, that's presumably why whoever designed the study decided that each school should run just one condition on any given day. The description in the article (original and replacement) makes sense. So the question is whether the description or the coding is wrong, but either is a pretty big deal in terms of the validity of the results.
ReplyDeleteIncidentally, as can be seen from the raw data file (with all 208 students included), the student-day relation is perfect. Every student has a record (even if it's incomplete) for day 1, day 2, etc., through day 5. On the other hand, 26 cases (21 where the student took the cookie and 5 where they took the apple) are missing only the condition, which --- if the study really had involved just one condition per day --- it ought to have been possible to impute from the day.
What we have here is a failure to communicate...
ReplyDeleteNo, seriously, what we have here is a perfect illustration of why data should be shared after collection. Especially for exploratory research like this (in Wansink parlance, that would be "messy science" I assume) the data should be public right away, at least at the time of peer review.
Of course, we can now also have the whole debate about traditional frequentist statistics and whether you can apply those to exploratory research - but I don't care about that. But I think this sort of research emphasises just how important it is that the data are available to scrutiny.
I can accept that this kind of research is exploratory and that it is messy. But don't pretend that it isn't when you publish it with standard statistics.