14 January 2017

In which science actually self-corrects, for once

Amid all the stories of bad behaviour by researchers confronted with demonstrations of errors and other problems with their work --- I'm sure many readers have their own favourite examples of this --- I thought I'd start the year off with a story of somebody doing the right thing.

You may be familiar with our (that's James Heathers and me) GRIM article, in which we demonstrated a technique for detecting certain kinds of reporting errors in journal articles, and showed that there are a lot of errors out there.  The preprint was even picked up by The Economist.  GRIM has caused a very small stir in skeptical science circles (although nothing compared to Michèle Nuijten's statcheck and Chris Hartgerink's subsequent bulk deployment of it with reporting on PubPeer, a project that has been immortalised under the name of PubCrawler).  Some people have started using the GRIM technique to check on manuscripts that they are reviewing, or to look at older published articles.  Even the classic 1959 empirical demonstration of cognitive dissonance by Festinger and Carlsmith succumbed.

Round about the time that we were finalising the GRIM article for publication, I came across Mueller and Dweck's 1998 article [PDF] in JPSP, entitled "Praise For Intelligence Can Undermine Children's Motivation and Performance".  I'm quite skeptical of the whole "mindset" area, for a variety of reasons that don't matter here, but I was especially interested in this article because of the tables of results on page 38, where there are no less than 50 means and standard deviations, all with sample sizes small enough to permit GRIM testing.

This looked like a goldmine.  Unlike statcheck, GRIM cannot be automated (given the current state of artificial intelligence), so running one or two checks typically requires reading and understanding the Method section of an article, then extracting the sample sizes and conditions from the description ("Fifty-nine participants were recruited, but three did not complete all measures and were excluded from the analyses" is often what you get instead of "N=56"; if anyone reading this works in an AI lab, I'd be interested to know if you have software that can understand that), and then matching those numbers to the reported means in the Results section.  So the opportunity to GRIM-check 50 numbers for the price of reading one article looked like good value for my time.

So I did the GRIM checks, taking into account that some of the measures reported by Mueller and Dweck had two items which effectively doubles the sample size, and found... 17 inconsistencies in the means, out of 50.  Wow.  I rechecked - still 17.  And a couple of the standard deviations didn't seem to  be possible, either.  (I have some code to do some basic SD consistency checks, but the real expert here is Jordan Anaya aka OmnesRes, who has taken the idea of GRIM and done some smart things with it).

What to do?  I got James to have a look, and he found the same problems as me.  We decided to contact Dr. Carol Dweck, the senior and corresponding author on the article.  Would she want to talk to us? Would she even remember what happened back in 1998?

To our slight surprise (given some other recent experiences we have had... more to come on that, but probably not any time soon), Dr. Dweck wrote back to us within 24 hours, saying that she was going to look into the matter.  And within less than four weeks, we had an answer, in the form of a 16-page PDF document in which Dr. Dweck and her co-author, Dr. Claudia Mueller, had brought Dr. David Yeager to help them.  They had gone through the entire article, line by line, and answered every one of our points.

For several of the inconsistencies that we had raised, there was a conclusive explanation.  In some cases this was due to some degree of unclear or omitted reporting in the article, some of which the reader (me) ought perhaps have caught, others not.  (To our amazement, two of the study datasets were still available after all this time, as they are being used by a teacher at Columbia.)  A few other problems had no obvious explanation and were recorded as probable typos or transcription errors, which is a little unsatisfying but perhaps not unreasonable after 18 years.  And in one other case, outside the table with 17 apparent inconsistencies, I had highlighted a mean that was (rather obviously) not wrong; getting a long sequence of precise measurements right is hard for everybody.

So for once --- actually, perhaps this happens more often than we might think, and the skeptical "literature" also suffers from publication bias? --- the process worked as advertised.  We found some apparent inconsistencies and wrote a polite note to the authors; they investigated and identified all of the problems (and were very gracious about us calling out the non-problems, too).  With Dr. Dweck's consent, I have written this story up as an example of how science can still do things right.  I'm still skeptical about mindset as a construct, but at least I feel confident that the main people researching it are dedicated to doing the most careful reporting of their science that they can.

You can find the full report here (look in Files/Output).

Here's to a collegial, collaborative, self-correcting 2017!

No comments:

Post a Comment