18 October 2018

Just another week in real-world science: de Venter et al. (2017)

(Note: This post, as with all my blog posts, represents only my own opinions, and not those of any organizations with which I am affiliated, or anyone who works for those organizations.)

Someone sent me a link to this article and asked what I thought of it.

De Venter, M., Illegems, J., Van Royen, R., Moorkens, G., Sabbe, B. G. C., & Van Den Eede, F. (2017). Differential effects of childhood trauma subtypes on fatigue and physical
functioning in chronic fatigue syndrome. Comprehensive Psychiatry, 78, 76–82. http://dx.doi.org/10.1016/j.comppsych.2017.07.006

The article describes an investigation into possible relations between various negative childhood events (as measured by the Traumatic Experiences Checklist [TEC]) and impaired functioning (fatigue, as measured by the Checklist Individual Strength [CIS] scale, and general health and well-being, as measured by the well-known SF-36 scale).  The authors' conclusions, from the abstract, were fairly unequivocal: "... sexual harassment emerged as the most important predictor of fatigue and poor physical functioning in the CFS patients assessed. These findings have to be taken into account [emphasis added] in further clinical research and in the assessment and treatment of individuals coping with chronic fatigue syndrome." In other words, as of the publication of this article, the authors believe that the assessment of past sexual harassment should be an integral part of people with the condition widely known as Chronic Fatigue Syndrome (I will use that term for simplicity, although I appreciate that some people prefer alternatives.)

The main results are in Table 3, which I have reproduced here (I hope that Elsevier's legal department will agree that this counts as fair use):

Table 3 from de Venter et al., 2017. Red lines added by me. Bold highlighting of the p values below .05 by the authors.

The article is quite short, with the Results section focusing on the two standardized (*) partial regression coefficients (henceforth, "betas") that I've highlighted in red here, which have associated p values below .05, and the Discussion section focusing on the implications of these.

There are a couple of problems here.

First, there are five regression coefficients for each dependent variable, of which just one per DV has a p value below .05 (**). It's not clear to me why, in theoretical terms, childhood experiences of sexual harassment (but not emotional neglect, emotional abuse, bodily threat, or sexual abuse) should be a good predictor of fatigue (CIS) or general physical functioning (SF-36). The authors define sexual harassment as "being submitted to sexual acts without physical contact" and sexual abuse as "having undergone sexual acts involving physical contact". I'm not remotely qualified in these matters, but it seems to me that with these definitions, "sexual abuse" would probably be expected to lead to more problems in later functioning than "sexual harassment". Indeed, I find it difficult to imagine how one could be subjected to "abuse with contact" while not also being subjected to "abuse without contact", more or less by the nature of "sexual acts". (I apologise if this whole topic makes you feel uneasy. It certainly makes me feel that way.)

It seems unlikely that the specific hypothesis that "sexual harassment (but not emotional neglect, emotional abuse, bodily threat, or sexual abuse) will be a significant predictor of fatigue and [impaired] general functioning among CFS patients" was made a priori. And indeed, the authors tell us in the last sentence of the introduction that there were no specific a priori hypotheses: "Thus, in the present study, we examine the differential impact of subtypes of self-reported early childhood trauma on fatigue and physical functioning levels in a well-described population of CFS patients" (p. 77). In other words, they set out to collect some data, run some regressions, and see what emerged. Now, this can be perfectly fine, but it's exploratory research. The conclusions you can draw are limited, and the interpretation of p values is unclear (de Groot, 1956). Any results you find need to be converted into specific hypotheses and given a severe test (à la Popper) with new data.

The second problem is a bit more subtle, but it illustrates the danger of running complex multiple regressions, and especially of reporting only the regression coefficients of the final model. For example, there is no measure of the total variance explained by the model (R^2), or of the increase in R^2 from the model with just the covariates to the model where the variable of interest is added. (Note that only 29 of the 155 participants reported any experience of childhood sexual harassment at all. You might wonder how much of the total variance in a sample can be explained by a variable for which 80% of the participants had the same score, namely zero.)  All we have is statistically significant betas, which doesn't tell us a lot, especially given the following problem.

Take a look at the betas for sexual harassment. You will see that they are both greater (in magnitude) than 1.0. That is, the beta for sexual harassment in each of the regressions in de Venter et al.'s article must be considerably larger than the zero-order correlation between sexual harassment and the outcome variable (CIS or SF-36), which of course is bounded between 0 and 1. For SF-36, for example, even if the original correlation was -.90, the corresponding beta is twice as large. If you have trouble thinking about what a beta coefficient greater than 1.0 might mean, I recommend this excellent blog post by David Disabato.

(A quick poll of some colleagues revealed that quite a few researchers are not even aware that a beta coefficient above 1.0 is even "a thing". Such values tend to arise only when there is substantial correlation between the predictors. For two predictors there are analytically complete solutions to predict the exact circumstances under which this will happen --- e.g., Deegan, 1978 --- but beyond that you need a numerical solution for all but the most trivial cases. The rules of matrix algebra, which govern how multiple regression works, are deterministic, but their effects are often difficult to predict from a simple inspection of the correlation table.)

This doubling of the zero-order coefficient to the beta is a very large difference that is almost certainly explained entirely by substantial correlations between at least some, and possible all, of the five predictors. If the authors wish to claim otherwise, they have some serious theoretical explaining to do. In particular, they need to show why the true relation between sexual harassment and SF-36 functioning is in fact twice as strong as the (presumably already substantial) zero-order correlation would suggest, and how the addition of the other covariates somehow reveals this otherwise hidden part of the relation.  If they cannot do this, then the default explanation --- namely, that this is a statistical artefact as a result of highly correlated predictors --- is by far the most parsimonious.

My best guess is that that the zero-order correlation between sexual harassment and the outcome variables is not statistically significant, which brings to mind one of the key points from Simmons, Nelson, and Simonsohn (2011): "If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate" (p. 1362). I also suspect that we would find that sexual harassment and sexual abuse are rather strongly correlated, as discussed earlier.

I wanted to try and reproduce the authors' analyses, to understand which predictors were causing the inflation in the beta for sexual harassment. The best way to do this would be if the entire data set had been made public somewhere, but this is research on people with a controversial condition, so it's not necessarily a problem that the data are not just sitting on OSF waiting for anyone to download them. All I needed were the seven variables that went into the regressions in Table 3, so there is no question of requesting any personally-identifiable information. In fact, all I really needed was the correlations between these variables, because I could work out the regression coefficients from them.

(Another aside: Many people don't know that with just the table of correlations, you can reproduce the standardized coefficients of any OLS regression analysis. With the SDs as well you can get the unstandardized coefficients, and with the means you can also derive the intercepts. For more information and R code, see this post, knowing that the correlation matrix is, in effect, the covariance matrix for standardized variables.)

So, that was my starting point when I set out to contact the authors. All I needed was seven columns of (entirely anonymous) numbers --- or even just the table of correlations, which arguably ought to have been in the article anyway. But my efforts to obtain the data didn't go very well, as you can see from the e-mail exchange below. (***)

First, I wrote to the corresponding author (Dr Maud de Venter) from my personal Gmail account:

Nick Brown <**********@gmail.com>  3 Oct, 16:50

Dear Dr. de Venter,

I have read with interest your article "Differential effects of childhood trauma subtypes on fatigue and physical functioning in chronic fatigue syndrome", published in Comprehensive Psychiatry.

My attention was caught, in particular, by the very high beta coefficients for the two statistically significant results.  Standardized regression coefficients above 1.0 tend to indicate that suppression effects or severe confounding are occurring, which can result in betas that are far larger in magnitude than the corresponding zero-order correlations. Such effects tend to require a good deal of theoretical explanation if they are not to be considered as likely statistical artefacts.

I wonder if you could supply me with a copy of the data set so that I could examine this question in more detail? I would only need the seven variables mentioned in Table 3, with no demographic information of any kind, so I would hope that there would be no major concerns about confidentiality. I can read most formats, including SPSS .SAV files and CSV. Alternatively, a simple correlation matrix of these seven variables, together with their means and SDs, would also allow me to reproduce the results.

Kind regards,
Nicholas Brown

A couple of days later I received a reply, not from Dr. de Venter, but from Dr. Filip Van Den Eede, who is the last author on the article, asking me to clarify the purpose of my request. I thought I had been fairly clear, but it didn't seem unreasonable to send my request from my university address with my supervisor in copy, even though this work is entirely independent of my studies. So I did that:

Brown, NJL ...  05 October 2018 19:32

Dear Dr. Van Den Eede,

Thank you for your reply.

As I explained in my initial mail to Dr. de Venter, I would like to understand where the very high beta coefficients --- which would appear to be the key factors driving the headline results of the article --- are coming from. Betas of this magnitude (above 1.0) are unusual in the absence of confounding or suppression effects, the presence of which could have consequences for the interpretation of the results.

If I had access either to the raw data, or even just the full table of descriptives (mean, SD, and Pearson correlations), then I believe that I would be better able to identify the source of these high coefficients. I am aware that these data may be sensitive in terms of patient confidentiality, but it seems unlikely that any participant could be identified on the basis of just the seven variables in question.

I would be happy to answer any other questions that you might have, if that would make my purpose clearer.

As you requested, I am putting my supervisors in copy of this mail.

Kind regards,
Nicholas Brown

Within less than 20 minutes, back came a reply from Dr. Van Den Eede indicating that his research team does not share data outside of specific collaborations or for a "clear scientific purpose", an example of which might be a meta-analysis. This sounded to me like a refusal to share the data with me, but that wasn't entirely clear, so I sent a further reply:

Brown, NJL ...  07 October 2018 18:56

Dear Dr. Van Den Eede,

Thank you for your prompt reply.

I would argue that establishing whether or not a published result might be based on a statistical artifact does in fact constitute a "clear scientific purpose", but perhaps we will have to differ on this.

For the avoidance of doubt, might I ask you to formally confirm that you are refusing to share with me both (a) the raw data for these seven variables and (b) their descriptive statistics (mean, SD, and table of intercorrelations)?

Kind regards,
Nick Brown

That was 11 days ago, and I haven't received a reply since then, despite Dr. Van Den Eede's commendable speed in responding up to that point. I guess I'm not going to get one.

In case you're wondering about the journal's data sharing policy, here it is. It does not impose what I would call especially draconian requirements on authors, so I don't think writing to the editor is going to help much here.

Where does this leave us? Well, it seems to me that this research team is declining to share some inherently anonymous data, or even just the table of correlations for those data, with a bona fide researcher (at least, I think I am!), who has offered reasonable preliminary evidence that there might be a problem with one of their published articles.

I'm not sure that this is an optimal way to conduct robust research into serious problems in people's lives.

[[ Begin update 2018-11-03 18:00 UTC ]]
I wrote to the editor of the journal that published the de Venter et al. article, with my concerns. He replied with commendable speed, enclosing a report from a "statistical expert" whom he had consulted. Sadly, when I asked for permission to quote that report here, the editor requested me not to do so. So I shall have to paraphrase it, and hope that no inaccuracies creep in.

Basically, the statistical expert agreed with my points, stated that it would indeed be useful to know if the regression coefficients were standardized or unstandardized, but didn't think that much could be done unless the authors wanted to write an erratum. This expert also didn't think there was a problem with the lack of Bonferroni correction, because the reader could fill that in for themselves.
[[ End update 2018-11-03 18:00 UTC ]]

[[ Begin update 2018-11-24 17:20 UTC ]]
Since this blog post was first written, I have received two further e-mails from Dr. Van Den Eede on this subject. In the latest of these, he indicated that he does not approve of the sharing of the text of his e-mails on my blog. Accordingly, I have removed the verbatim transcripts of the e-mails that I received from him before this blog post, and replaced them with what I believe to be fair summaries of their content.

The more recent e-mails do not add much in terms of progress towards my goal of obtaining the seven variables, or the full table of descriptives. However, Dr. Van Den Eede did tell me that the regression coefficients published in the De Venter et al. article were unstandardized, suggesting (given the units of the scales involved) that the effect size was very small.
[[ End update 2018-11-24 17:20 UTC ]]

[[ Begin update 2018-11-27 12:15 UTC ]]
I added a note to the start of this post to make it clear that it represents my personal opinions only.

[[ End update 2018-11-27 12:15 UTC ]]


de Groot, A. D. (1956). De betekenis van “significantie” bij verschillende typen onderzoek [The meaning of “significance” for different types of research]. Nederlands Tijdschrift voor de Psychologie, 11, 398–409. English translation by E. J. Wagenmakers et al. (2014), Acta Psychologica, 148, 188–194. http://dx.doi.org/10.1016/j.actpsy.2014.02.001

Deegen, J. D., Jr. (1978). On the occurrence of standardized regression coefficients greater than one. Educational and Psychological Measurement, 38, 873-888. http://dx.doi.org/10.1177/001316447803800404

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://dx.doi.org/10.1177/0956797611417632

(*) A colleague suggested to me that these numbers might in fact be unstandardized coefficients. My assumption that they are standardized partial regression coefficients is based on the following:
1. The Abstract and the Results section of the article refer to them with the Greek letter β., which is the symbol normally used to denote standardized partial regression coefficients.
2. The scale ranges of the IVs (0–12 in four cases, 0–21 in the fifth) and DVs (maximum values over 100) are such that I would expect considerably larger unstandardized values for a statistically significant effect, if these IVs were explaining a non-trivial amount of variance in the DVs.
3. I mentioned "standardized regression coefficients" in my first e-mail to the authors, and "beta coefficients" in the second. Had these numbers in fact been referring to unstandardized coefficients, I would have hoped that the last author would have pointed this out, thus saving everybody's time, rather than entering into a discussion about their data sharing policies.

I suppose that it is just possible that these are unstandardized coefficients (in which case a correction to the article would seem to be required on that basis alone), but of course, if the authors would agree to share their data with me, I could ascertain that out for myself.

(**) I hope that readers will forgive me if, for the purposes of the present discussion, I assume that identifying whether a p value is above or below .05 has some utility when one is attempting to learn something true about the universe.

(***) I'm not sure what the ethical rules are about publishing e-mail conversations, but I don't feel that I'm breaking anyone's trust here. Perhaps I should have asked Dr. Van Den Eede if he objected to me citing our correspondence, but since he has stopped replying to me about the substantive issue I'm not sure that it would be especially productive to ask about a procedural matter.

10 September 2018

Replication crisis: Cultural norms in the lab

I recently listened to an excellent episode of the "Stuk Rood Vlees" ("Hunk of Red Meat") podcast that is hosted by the Dutch political scientist Armèn Hakhverdian (@hakhverdian ‏on Twitter). His guest was Daniël Lakens (@lakens) and they talked at great length --- to the extent that the episode had to be split into two 1-hour sections --- about the replication crisis.

This podcast episode was recorded in Dutch, which is reasonable since that's the native language of both protagonists, but a little unfortunate for more than 99.5% of the world's population who don't understand it. (Confession: I'm a little bit lukewarm on podcasts --- apart from ones with me as a guest, which are fantastic --- because the lack of a transcript make them hard to search, and even harder to translate.)

This is a particular shame because Daniël is on sparkling form in this podcast. So I've taken the liberty of transcribing what I thought was the most important part, just over 15 minutes long, where Armèn and Daniël talk about publication bias and the culture that produces it. The transcription has been done rather liberally, so don't use it as a way to learn Dutch from the podcast. I've run it past both of the participants and they are happy that it doesn't misrepresent what they said.

This discussion starts at around 13:06, after some discussion of the Stapel and Bem affairs from 2010-2011, ending with surprise that when Stapel --- as Dean --- claimed to have been collecting his data himself, everybody thought this was really nice of him, and nobody seemed to find it weird. Now read on...

Daniël Lakens: Looking back, the most important lesson I've learned about this --- and I have to say, I'm glad that I had started my career back then, around 2009, back when we really weren't doing research right, so I know this from first-hand experience --- is just how important the influence of conforming to norms is. You imagine that you're this highly rational person, learning all these objective methods and applying them rigorously, and then you find yourself in this particular lab and someone says "Yeah, well, actually, the way we do it round here is X", and you just accept that. You don't think it's strange, it's just how things are. Sometimes something will happen and you think "Hmmm, that's a bit weird", but we spend our whole lives in the wider community accepting that slightly weird things happen, so why should it be different in the scientific community? Looking back, I'm thinking "Yeah, that wasn't very good", but at the time you think, "Well, maybe this isn't the optimal way to do it, but I guess everyone's OK with it".

Armèn Hakhverdian: When you arrive somewhere as a newbie and everyone says "This is how we do it here, in fact, this is the right way, the only way to do it", it's going to be pretty awkward to question that.

DL: Yes, and to some extent that's a legitimate part of the process of training scientists. The teacher tells you "Trust me, this is how you do it". And of course up to some point you kind of have to trust these people who know a lot more than you do. But it turns out that quite a lot of that trust isn't justified by the evidence.

AH: Have you ever tried to replicate your own research?

DL: The first article I was ever involved with as a co-author --- so much was wrong with that. There was a meta-analysis of the topic that came out showing that overall, across the various replications, there was no effect, and we published a comment saying that we didn't think there was any good evidence left.

AH: What was that study about?

DL: Looking back, I can see that it was another of these fun effects with little theoretical support ---

AH: Media-friendly research.

DL: Yep, there was a lot of that back then. This was a line of research where researchers tried to show that how warm or heavy something was could affect cognition. Actually, this is something that I still study, but in a smarter way. Anyway, we were looking at weight, and we thought there might be a relation between holding a heavy object and thinking that certain things were more important, more "weighty". So for example we showed that if you gave people a questionnaire to fill in and it was attached to a heavy clipboard, they would give different, more "serious" answers than if the clipboard was lighter. Looking back, we didn't analyse this very honestly --- there was one experiment that didn't give us the result we wanted, so we just ignored it, whereas today I'd say, no, you have to report that as well. Some of us wondered at the time if it was the right thing to do, but then we said, well, that's how everyone else does it.

AH: There are several levels at which things can be done wrong. Stapel making his data up is obviously horrible, but as you just described you can also just ignore a result you don't like, or you can keep analysing the data in a bunch of ways until you find something you can publish. Is there a scale of wrongdoing? We could just call it all fraud, but for example you could just have someone who is well-meaning but doesn't understand statistics --- that isn't an excuse, but it's a different type of problem from conscious fraud.

DL: I think this is also very dependent on norms. There are things that we still think are acceptable today, but which we might look back on in 20 years time and think, how could we every have thought that was OK? Premeditated fraud is a pretty easy call, a bit like murder, but in the legal system you also have the idea of killing someone, not deliberately, but by gross negligence, and I think the problems we have now are more like that. We've known for 50 years or more that we have been letting people with insufficient training have access to data, and now we're finally starting to accept that we have to start teaching people that you can't just trawl through data and publish the patterns that you find as "results". We're seeing a shift --- whereas before you could say "Maybe they didn't know any better", now we can say, "Frankly, this is just negligent". It's not a plausible excuse to pretend that you haven't noticed what's been going on for the past 10 years.

   Then you have the question of not publishing non-significant results. This is a huge problem. You look at the published literature and more than 90% of the studies show positive results, although we know that lots of research just doesn't work out the way we hoped. As a field we still think that it's OK to not publish that kind of study because we can say, "Well, where could I possibly get it published?". But if you ask people who don't work in science, they think this is nuts. There was a nice study about this in the US, where they asked people, "Suppose a researcher only publishes results that support his or her hypotheses, what should happen?", and people say, "Well, clearly, that researcher should be fired". That's the view of dispassionate observers about what most scientists think is a completely normal way to work. So there's this huge gap, and I hope that in, say, 20 years time, we'll have fixed that, and nobody will think that it's OK to withhold results. That's a long time, but there's a lot that still needs to be done. I often say to students, if we can just fix this problem of publication bias during our careers, alongside the actual research we do, that's the biggest contribution to science that any of us could make.

AH: So the problem is, you've got all these studies being done all around the world, but only a small fraction gets published. And that's not a random sample of the total --- it's certain types of studies, and that gives a distorted picture of the subject matter.

DL: Right. If you read in the newspaper that there's a study showing that eating chocolate makes you lose weight, you'll probably find that there were 40 or 100 studies done, and in one of them the researchers happened to look at how much chocolate people ate and how their weight changed, and that one study gets published. And of course the newspapers love this kind of story. But it was just a random blip in that one study out of 100. And the question is, how much of the literature is this kind of random blip, and how much is reliable.

AH: For many years I taught statistics to first- and second-year undergraduates who needed to do small research projects, but I never talked about this kind of thing. And lots of these students would come to me after collecting their data and say, "Darn, I didn't get a significant result". It's like there's this inherent belief that you have to get statistical significance to have "good research". But whether research is good or not is all about the method, not the results. It's not a bad thing that a hypothesis goes unsupported.

DL: But it's really hypocritical to tell a first-year student to avoid publication bias, and then to say "Hey, look at my impressive list of publications", when that list is full of significant results. In the last few years I've started to note the non-significant results in the Discussion section, and sometimes we publish via a registered report, where you write up and submit in advance how you're going to do the study, and the journal says "OK, we'll accept this paper regardless of how the results turn out". But if you look at my list of publications as a whole, that first-year student is not going to think that I'm very sincere when I say that non-significant results are just as important as significant ones. Young researchers come into a world that looks very different to what you just described, and they learn very quickly that the norm is, "significance means publishable".

AH: In political science we have lots of studies with null results. We might discover that it wouldn't make much difference if you made some proposed change to the voting system, and that's interesting. Maybe it's different if you're doing an experiment, because you're changing something and you want that change to work. But even there, the fact that your manipulation doesn't work is also interesting. Policymakers want to know that.

DL: Yes, but only if the question you were asking is an interesting one. When I look back to some of my earlier studies, I think that we weren't asking very interesting questions. They were fun because they were counterintuitive, but there was no major theory or potential application. If those kind of effects turn out not to exist, there's no point in reporting that, whereas we care about what might or might not happen if we change the voting system.

AH: So for example, the idea that if people are holding a heavier object they answer questions more seriously: if that turns out not to be true, you don't think that's interesting?

DL: Right. I mean, if we had some sort of situation in society whereby we knew that some people were holding heavy or light things while filling in important documents, then we might be thinking about whether that changes anything. But that's not really the case here, although there are lots of real problems that we could be addressing.

   Another thing I've been working on lately is teaching people how to interpret null effects. There are statistical tools for this ---

AH: It's really difficult.

DL: No, it's really easy! The tools are hardly any more difficult than what we teach in first-year statistics, but again, they are hardly ever taught, which also contributes to the problem of people not knowing what to with null results.

(That's the end of this transcript, at around the 30-minute mark on the recording. If you want to understand the rest of the podcast, it turns out that Dutch is actually quite an easy language to learn.)

01 July 2018

This researcher compared two identical numbers. The effect size he obtained will shock you!

Here's an extract from an article that makes some remarkable claims about the health benefits of drinking green tea. The article itself seems to merit scrutiny for a number of reasons, but here I just want to look at a point that illustrates why (as previously noted by James Heathers in his inimitable style) rounding to one decimal place (or significant figure) when reporting your statistics is not a good idea.

The above image is taken from Table 3 of the article, on page 596. This table shows the baseline and post-treatment values of a large number of variables that were measured in the study.  The highlighted row shows the participants' waist–hip ratio in each of two groups and at each of two time points. As you can see, all of the (rounded) means are equal, as are all of the (rounded) SDs.

Does this mean that there was absolutely no difference between the participants? Not quite. You can see that the p value is different for the two conditions. This p value corresponds to the paired t test that will have been performed for the 39 participants in the treatment group across the period of the study, or for the 38 participants in the control group. The p values (corresponding to the respective t statistics) would likely be different even if the means and SDs were identical to many decimal places because the paired t test makes 39 (or 38) comparisons of individual values between baseline and the end of the study.

However, what I'm interested in here is the difference in mean waist–hip ratios between the groups at baseline (i.e., the first and fourth columns of numbers). The participants have been randomized to conditions, so presumably the authors decided not to worry about baseline differences [PDF], but it's interesting to see what those differences could have been (not least because these same numbers could also have been, say, the results obtained by the two groups on a psychological test after they had been assigned randomly to conditions without a baseline measurement).

We can calculate the possible range of differences(*) by noting that the rounded mean of 0.9 could have corresponded to an actual value anywhere between 0.85001 and 0.94999 (let's leave the question of how to round values of exactly 0.85 or 0.95 for now; it's complicated). Meanwhile, each of the rounded SDs of 0.1 could have been as low as 0.05001. (The lower the SD, the higher the effect.)  Let's put those numbers into this online effect size calculator (M1=0.94999, M2=0.85001, SD1=SD2=0.05001) and click "Compute" (**).

Yes, you are reading that right: An effect size of = (almost) 2 is possible for the baseline difference between the groups even though the reported means are identical. (For what it's worth, the p value here, with 75 degrees of freedom, is .0000000000004). Again, James has you covered if you want to know what an effect size of 2 means in the real world.

Now, you might think that this is a bit pathological, and you're probably right. So play around with the means and SDs until they look reasonable to you. For example, if you keep the extreme means but use the rounded SDs as if they were exactly correct, you get = 0.9998.  That's still a whopping effect size for the difference between numbers that are reported as being equal. And even if you bring the means in from the edge of the cliff, the effect size can still be pretty large. Means of 0.93 and 0.87 with SDs of 1.0 will give you d = 0.6 and p = .01, which is good enough for publication in most journals.

Conclusion: Always report, not just two decimal places, but also at least two significant figures (it's very frustrating to see standardized regression coefficients, in particular, reported as 0.02 with a standard error of 0.01). In fact, since most people read papers on their screens and black pixels use less energy to display than white ones, save the planet and your battery lifetime and report three or four decimals. After all, you aren't afraid of GRIM, are you?

(*) I did this calculation by hand. My f_range() function, described here, doesn't work in this case because the underlying code (from a module that I didn't write, and have no intention of fixing) chokes when trying to calculate the midpoint test statistic when the means and SDs are identical.

(**) This calculator seems to be making the simplifying assumption that the group sizes are identical, which is close enough as to make no difference in this case. You can also do the calculation of d by hand: just divide the difference between the means by the standard deviation, assuming you're using the same SD for both means, or see here.

[Update 2018-07-02 12:13 UTC: Removed link to a Twitter discussion of a different article, following feedback from an alert reader.]

31 May 2018

How SPRITE works: a step-by-step introduction

Our preprint about SPRITE went live a few hours ago. I encourage you to read it, but not everyone will have the time, so here is a simple (I hope) explanation of what we're trying to do.

Before we start, I suggest that you open this Google spreadsheet and either make a copy or download an Excel version (both of these options are in the File menu) so you can follow along.

Imagine that you have read in an article that N=20 people responded to a 1–5 Likert-type item with a mean of 2.35 and an SD of 1.39. Here's how you could test whether that's possible:

1. Make a column of 20 random numbers in the range 1–5 and have your spreadsheet software display their mean and SD. Now we'll try and get the mean and SD to match the target values.

2. If the mean is less than the target mean (2.35), add 1 to one of the numbers that isn't a 5 (the maximum on the scale). If the mean is greater than the target mean, subtract 1 from one of the numbers that isn't a 1. Repeat this step until the mean matches the target mean.

3. If the SD doesn't match the target SD, select a pair of numbers from the list. Call the smaller number A and the larger one B (if they are identical, either can be A or B). If the SD is currently smaller than the target SD, subtract 1 from A and add 1 to B. If the SD is currently larger than the target SD, add 1 to A and subtract 1 from B. Repeat this step until the SD matches the target SD. (Not all pairs of numbers are good choices, as you will see if you play around a bit with the spreadsheet, but we can ignore that for the moment.)

Let's go through this in the spreadsheet; I hope you'll see that it's quite simple.

Here's the spreadsheet. Cells B2 and B3 contain the target mean and SD. Cells D2 and D3 contain the current mean and SD of the test data, which are the 20 numbers in cells D5 through D24. Cells C2 and C3 contain the difference between the current and target mean and SD, respectively. When that difference drops to 0.005 or less (which means that the numbers are equal, within the limits of rounding), these two cells will change colour. (For some reason, they turn green in Google Sheets but blue in my copy of Excel.)

In this spreadsheet, most of the work has already been done. The mean is 2.30 and the target is 2.35, so if you increase one value by 1 (say, D11, from 1 to 2), the mean will go to 2.35 and cell C1 will change colour. That's step 2 completed.

For the SD, observe that after you fixed the mean by changing D11, the SD became 1.31, which is smaller than the target. So you want to increase the SD, which means pushing two values further apart. For example, change D12 from 2 to 1 and D13 from 2 to 3. The mean will be unchanged, but now the SD is 1.35; changing two 2s to a 1 and a 3 increased the SD by 0.04, which is the amount that the SD is still short of the target. So let's do the same operation again. Change D14 from 2 to 1 and D15 from 2 to 3. You should now have an SD of 1.39, equal to the target value, and cell C2 should have changed colour. Step 3 is now completed.

Congratulations, you just found a SPRITE solution! That is, the list of values (after sorting)
has a mean of 2.35 and an SD of 1.39, and could have been the combination that gave the result that you were trying to reproduce from the article.

Not every swap of two values gives the same result, however. Let's back up a little by changing D15 from 3 back to 2 and D14 from 1 back to 2 (so the SD should now be back to 1.35). Now change cell D20 from 3 to 2 and cell D21 from 4 to 5. The mean is still OK, but the SD has overshot the target value of 1.39 and is now 1.42. So this means that
is not a valid solution.

There are eight unique solutions (I checked with CORVIDS); rSPRITE will usually find all eight, although it doesn't always get 100% of possible solutions in more complex cases. If playing with numbers like this is your idea of fun, you could try and find more solutions by hand. Here's the spoiler picture, with the solution we found earlier right in the middle:

Basically, that's all there is to it. SPRITE is just software that does this adding and swapping, with a few extra subtleties, very fast. It's the computer version of some checks that James Heathers and I first started doing in late 2015 when we were looking at some dodgy-looking articles. But we certainly aren't the first people who have had this idea to see if means/SD combinations are possible; it really isn't rocket science.

02 May 2018

A footnote on self-citation and duplicate publication

Since discussion of the topics of self-citation and duplicate publication seems to be "hot" at the moment, and I have probably had something to do with at least the second of those, I feel that I ought to mention my own record in this area, in the interest of full transparency.

I've never really thought about how bad "excessive" self-citation is as a misdemeanour on the academic "just not done" scale, nor indeed what "excessive might mean", but I think there are a few rather severe problems with self-plagiarism (aka duplicate publication):

  1. Copyright (whether we like it or not, most of the time, we sign over copyright to all of the text, give or take "fair use", to the publisher of an article or chapter);
  2. It's a low-effort way to pad your CV;
  3. Possible deception of the editors of books or journals to which the duplicates were submitted;
  4. Possible deception of the readers.
James Heathers has more thoughts on this here.

First, self-citation: According to Google Scholar, my published work (most, but not all, of it peer-reviewed) has 376 citations as of today. I have gone through all of the manuscripts with which I have been involved and counted 14 self-citations, plus two citations of chapters by other people in a book that I co-edited (which count towards my Google Scholar h-index). For what it's worth, I am not suggesting that anyone should feel the need to calculate and disclose their self-citation rate, as it would be a very tedious exercise for people with a lot of publications.

Second, duplicate publication: I am the third author (of four; I estimate that I contributed about 10% of the words) of this article in thJournal of Social and Political Psychology (JSPP), which I also blogged about here. In order to bring the ideas in that article to the attention of a wider public, we reworked it into a piece in Skeptical Inquirer, which included the following disclosure:

The JSPP article was published under a CC-BY 3.0 license, which means that there were no issues with copyright when re-using some parts of its text verbatim:

Both articles are mentioned in my CV [PDF], with the Skeptical Inquirer piece being filed under "Other publications", including a note that it was derived from the earlier peer-reviewed article in JSPP.

That's all I have to disclose on the questions of self-citation and duplicate publication. If you find something else, please feel free to call me out on it.

25 April 2018

Some instances of apparent duplicate publication by Dr. Robert J. Sternberg

Dr. Robert J. Sternberg is a past president of the American Psychological Association, currently at Cornell University, with a CV that is over 100 pages long [PDF] and, according to Google Scholar, almost 150,000 citations.

Recently, some people have been complaining that too many of those are self-citations, leading to a formal petition to the APS Publication CommitteeBut sometimes, it seems, Dr. Sternberg prefers to make productive use of his previous work in a more direct manner.  I was recently contacted by Brendan O'Connor, a graduate student at the University of Leicester, who had noticed that some of the text in Dr. Sternberg's many articles and chapters appeared to be almost identical. It seems that he may be on to something.

Exhibit 1

Brendan—who clearly has a promising career ahead of him as a data thug, should he choose that line of work—noticed that this 2010 article by Dr. Sternberg was basically a mashup of this article of his from the same year and this book chapter of his from 2002. One of the very few meaningful differences in the chunks that were recycled between the two 2010 articles is that the term "school psychology" is used in the mashup article to replace "cognitive education" from the other; this may perhaps not be unrelated to the fact that the former was published in School Psychology International (SPI) and the latter in the Journal of Cognitive Education and Psychology (JCEP). If you want to see just how much of the SPI article was recycled from the other two sources, have a look at this. Yellow highlighted text is copied verbatim from the 2002 chapter, green from the JCEP article. You can see that about 95% of the text is in one or the other colour:
Curiously, despite Dr. Sternberg's considerable appetite for self-citation (there are 26 citations of his own chapters or articles, plus 1 of a chapter in a book that he edited, in the JCEP article; 25 plus 5 in the SPI article), neither of the 2010 articles cites the other, even as "in press" or "manuscript under review"; nor does either of them cite the 2002 book chapter. If previously published work is so good that you want to copy big chunks from it, why would you not also cite it?

Exhibit 2

Inspired by Brendan's discovery, I decided to see if I could find any more examples. I downloaded Dr. Sternberg's CV and selected a couple of articles at random, then spent a few minutes googling some sentences that looked like the kind of generic observations that an author in search of making "efficient" use of his time might want to re-use.  On about the third attempt, after less than ten minutes of looking, I found a pair of articles, from 2003 and 2004, by Dr. Sternberg and Dr. Elena Grigorenko, with considerable overlaps in their text. About 60% of the text in the later article (which is about the general school student population) has been recycled from the earlier one (which is about gifted children), as you can see here (2003 on the left, 2004 on the right). The little blue paragraph in the 2004 article has also come from another source; see exhibit 4.
Neither of these articles cites the other, even as "in press" or "manuscript in preparation".

Exhibit 3

I wondered whether some of the text that was shared between the above pair of articles might have been used in other publications as well. It didn't take long(*) to find Dr. Sternberg's contribution (chapter 6) to this 2012 book, in which the vast majority of the text (around 85%, I estimate) has been assembled almost entirely from previous publications: chapter 11 of this 1990 book by Dr. Sternberg (blue), this 1998 chapter by Dr. Janet Davidson and Dr. Sternberg (green), the above-mentioned 2003 article by Dr. Sternberg and Dr. Grigorenko (yellow), and chapter 10 of this 2010 book by Dr. Sternberg, Dr. Linda Jarvin, and Dr. Grigorenko (pink).

Once again, despite the fact that this chapter cites 59 of Dr. Sternberg's own publications and another 10 chapters by other people in books that he (co-)edited, none of those citations are to the four works that were the source of all the highlighted text in the above illustration.

Now, sometimes one finds book chapters that are based on previous work. In such cases, it is the usual practice to include a note to that effect. And indeed, two chapters (numbered 26 and 27) in that 2012 book edited by Dr. Dawn Flanagan and Dr. Patti Harrison, contain an acknowledgement along the lines of "This chapter is adapted from <reference>.  Copyright 20xx by <publisher>.  Adapted by permission". But there is no such disclosure in chapter 6.

Exhibit 4

It appears that Dr. Sternberg has assembled a chapter almost entirely from previous work on more than one occasion. Here's a recent example of a chapter made principally from his earlier publications. About 80% of the words have been recycled from chapter 9 of this 2011 book by Dr. Sternberg, Dr. Jarvin, and Dr. Grigorenko (yellow), chapter 2 of this 2003 book by Dr. Sternberg, (blue; this is also the source of the blue paragraph in Exhibit 2), chapter 1 of this 2002 book by Drs Sternberg and Grigorenko (green), the 2012 chapter(**) mentioned in Exhibit 3 above (pink), and a wafer-thin slice from chapter 2 (contributed by Dr. Sternberg) of this 2008 book (purple).

This chapter cites 50 of Dr. Sternberg's own publications and another 7 chapters by others in books that he (co-)edited. This time, one of the citations was for one of the five books that were the basis of the highlighted text in the above illustration, namely the 2003 book Wisdom, Intelligence, and Creativity Synthesized that was the source of the blue text. However, none of the citations of that book indicate that any of the text taken from it is being correctly quoted, with quote marks (or appropriate indentation) and a page number. The four other books from which the highlighted text was taken were not cited. No disclosure that this chapter has been adapted from previously published material appears in the chapter, or anywhere else in the 2017 book (or, indeed, in the first edition of the book from 2005, where a similar chapter by Dr. Sternberg was published).

Why this might be a problem (other than for the obvious reasons)

There are a lot of reasons why this sort of thing is not great for science, and I suspect that there will be quite a lot of discussion about the meta-scientific, moral, and perhaps even legal aspects (I seem to recall that when I publish something, I generally have to sign my copyright over to someone, which means I can't go round distributing it as I like, and I certainly can't sign the copyright of the same text over to a second publisher). But I also want to make a point about how, even if the copying process itself does no apparent direct harm, this practice can damage the process of scientific inquiry.

During a number of the copy-and-paste operations that were apparently performed, a few words were sometimes changed. In some cases this was merely cosmetic (e.g., "participants" being changed to "students"), or a reflection of changing norms over time. But in other cases it seemed that the paragraphs being copied were merely being repurposed to describe a different construct that, while perhaps being in some ways analogous to the previous one, was not the same.  For example, the 2017 chapter that is the subject of Exhibit 4 above contains this sentence:

"In each case, important kinds of developing competencies for life were not adequately reflected by the kinds of competencies measured by the conventional ability tests" (p. 12).

But if we go to yet another chapter by Dr. Sternberg, this time from 2002, that contains mostly the same text (tracing all of the places in which a particular set of paragraphs have been recycled turns out to be computationally intensive for the human brain), we find:

"In each case, important kinds of developing expertise for life were not adequately reflected by the kinds of expertise measured by the conventional ability tests" (p. 21).

Are we sure that "competencies" are the same thing as "expertise"? How about "school psychology" and "cognitive education", as in the titles of the articles in Exhibit 1? Are these concepts really so similar that one can recycle, verbatim, hundreds of words at a time about one of them and be sure that all of those words, and the empirical observations that they sometimes describe, are equally applicable to both? And if so, why bother to have the two concepts at all?

Relatedly, the single biggest source of words for exhibit 3
—published in 2012—was a chapter published in 1990. Can it really be the case that so little has been discovered in 22 years in research into the nature of intelligence that this material doesn't even merit rewriting from a retrospective viewpoint?

What next?

I'm not sure, frankly. But James Heathers has some thoughts here.

(*) Brendan and I are looking for other similar examples to the ones described in this post. Given how easy it was to find these ones, we suspect that there may be more to be uncovered.

(**) While searching, I lost track of the number of times that the descriptions of the Rainbow and Kaleidoscope projects have been recycled across multiple publications. Citing the copy from the 2012 article seemed like an appropriate way to convey the continuity of the problem. For some reason, though, in this version from 2005, the number of students included in the sample was 777, instead of the 793 reported everywhere else.

13 March 2018

Announcing a crowdsourced reanalysis project

(Update 2018-03-14 10:18 UTC: I have received lots of offers to help with this, and I now have enough people helping.  So please don't send me an e-mail about this.)

Back in the spring of 2016, for reasons that don’t matter here, I found myself needing to understand a little bit about the NHANES (National Health and Nutrition Examination Survey) family of datasets.  NHANES is an ongoing programme that has been running in the United States since the 1970s, looking at how nutrition and health interact.

Most of the datasets produced by the various waves of NHANES are available to anyone who wants to download them. Before I got started on my project (which, in the end, was abandoned, again for reasons that don’t matter here), I thought that it was a good idea to check that I understood the structure of the data by reproducing the results of an article based on them. This seemed especially important because the NHANES files—at least, the ones I was interested in—are supplied in a format that requires SAS to read, and I needed to convert them to CSV before analyzing them in R.  So I thought the best way to check this would be to take a well-cited article and reproduce its table of results, which would allow me to be reasonably confident that I had done the conversion right, understood the variable names, etc.

Since I was using the NHANES-IIIdata (from the third wave of the NHANES programme, conducted in the mid-1990s) I chose an article at random by looking for references to NHANES-III in Google Scholar (I don’t remember the exact search string) and picking the first article that had several hundred citations.  I won't mention its title here (read on for more details), but it addresses what is clearly an important topic and seemed like a very nice paper—exactly what I was looking for to test whether or not I was converting, importing, and interpreting the NHANES data correctly.

The NHANES-III datasets are distributed in good old-fashioned mainframe magnetic tape format, with every field having a fixed width. There is some accompanying code to interpret these files (splitting up the tape records and adding variable names) in SAS.  Since I was going to do my analyses in R, I needed to run this code and export the data in CSV format. I didn't have access to SAS (in fact, I had never previously used it), and it seemed like a big favour to ask someone to convert over 280 megabytes of data (which is the size of the three files that I downloaded) for me, especially because I thought (correctly) that it might take a couple of iterations to get the right set of files.  Fortunately, I discovered the remarkable SAS University Edition, which is a free-to-use version of SAS that seems to have most of the features one might want from a statistics package. This, too, is a big download (around 2GB, plus another 100MB for the Oracle Virtual Machine Manager that you also need—SAS are not going to allow you to run your software on any operating system, it has to be Red Hat Linux, and even if you already have Red Hat Linux, you have to run their virtualised version on top!), but amazingly, it all worked first time.  As long as you have a recent computer (64-bit processor, 4GB of RAM, a few GB free on the disk) this should work on Windows, Mac, or Linux.

Having identified and downloaded the NHANES files that I needed, opening those files using SAS University Edition and exporting them to CSV format turned out to required just a couple of lines of code using PROC EXPORT, for which I was able to find the syntax on the web quite easily.  Once I had those CSV files, I could write my code to read them in, extract the appropriate variables, and repeat most of the analyses in the article that I had chosen.

Regular readers of this blog may be able to guess what happened next: I didn’t get the same results as the authors.  I won’t disclose too many details here because I don’t want to bias the reanalysis exercise that I’m proposing to conduct, but I will say that the differences did not seem to me to be trivial.  If my numbers are correct then a fairly substantial correction to the tables of results will be required.  At least one (I don't want to give more away) of the statistically significant results is no longer statistically significant, and many of the significant odds ratios are considerably smaller.  (There are also a couple of reporting errors in plain sight in the article itself.)

When I discovered these apparent issues back in 2016, I wrote to the lead author, who told me that s/he was rather busy and invited me to get in touch again after the summer. I did so, but s/he then didn't reply further. Oh well. People are indeed often very busy, and I can see how, just because one person who maybe doesn't understand everything that you did in your study writes to you, that perhaps isn't a reason to drop everything and start going through some calculations you ran more than a decade ago.  I let the matter drop at the time because I had other stuff to do, but a few weeks ago it stuck its nose up through the pile of assorted back burner projects (we all have one) and came to my attention again.

So, here's the project.  I want to recruit a few (ideally around three) people to independently reanalyse this article using the NHANES-III datasets and see if they come up with the same results as the original authors, or the same as me, or some different set of results altogether.  My idea is that, if several people working completely independently (within reason) come up with numbers that are (a) the same as each other and (b) different from the ones in the article, we will be well placed to submit a commentary article for publication in the journal (which has an impact factor over 5), suggesting that a correction might be in order. On the other hand, if it turns out that my analyses were wrong, and the article is correct, then I can send the lead author a note to apologise for the (brief) waste of his time that my 2016 correspondence with him represented. Whatever the outcome, I hope that we will all learn something.

For the moment I'm not going to name the article here, because I don't want to have too many people running around reanalysing it outside of this "crowdsourced" project.  Of course, if you sign up to take part, I will tell you what the article is, and then I can't stop you shouting its DOI from the rooftops, but I'd prefer to keep this low-key for now.

If you would like to take part, please read the conditions below.

1. If the line below says "Still accepting offers", proceed. If it says "I have enough people who have offered to help", stop here, and thanks for reading this far.

========== I have enough people who have offered to help ==========

2. You will need a computer that either already has SAS on it, or on which you can install SAS (e.g., University Edition).  This is so that you can download and convert the NHANES data files yourself.  I'm not going to supply these, for several reasons: (a) I don't have the right to redistribute them, (b) I might conceivably have messed something up when converting them to CSV format, and (c) I might not even have the right files (although my sample sizes match the ones in the article pretty closely).  If you are thinking of volunteering, and you don't have SAS on your computer, please download SAS University Edition and make sure that you can get it to work.  (An alternative, if you are an adventurous programmer, is to download the data and SAS files, and use the latter as a recipe for splitting the data and adding variable names.)

3. You need to be reasonably competent at performing logistic regressions in SAS, or in a software package than can read SAS or CSV files.  I used R; the original authors used proprietary software (not SAS).  It would be great if all of the people who volunteered used different packages, but I'm not going to turn down anyone just because someone else wants to use the same analysis software. However, I'm also not going to give you a tutorial on how to run a logistic regression (not least because I am not remotely an expert on this myself).

4. Volunteers will be anonymous until I have all the results (to avoid, as far as possible, people collaborating with each other).  However, by participating, you accept that once the results are in, your name and your principal results may be published in a follow-up blog post. You also accept, in principle, to be a co-author on any letter to the editor that might result from this exercise.  (This point isn't a commitment to be signed in blood at this stage, but I don't want anyone to be surprised or offended when I ask if I can publish their results or use them to support a letter.)

5. If you want to work in a team on this with some colleagues, please feel free to do so, but I will only put one person's name forward per reanalysis on the hypothetical letter to the editor; others who helped may get an acknowledgement, if the journal allows.  Basically, ensure that you can say "Yes, I did most of the work on this reanalysis, I meet the criteria for co-authorship".

6. The basic idea is for you to work on your own and solve your own problems, including understanding what the original authors did.  The article is reasonably transparent about this, but it's not perfect and there are some ambiguities. I would have liked to have the lead author explain some of this, but as mentioned above, s/he appears to be too busy. If you hit problems then I can give you a minimum amount of help based on my insights, but of course the more I do that, the more we risk not being independent of each other. (That said, I could do with some help in understanding what the authors did at one particular point...)

7. You need to be able to get your reanalysis done by June 30, 2018.  This deadline may be moved (by me) if I have trouble recruiting people, but I don't want to repeat a recent experience where a couple of the people who had offered to help me on a project stopped responding to their e-mails for several months, leaving me to decide whether or not to drop them.  I expect that the reanalysis will take between 10 and 30 hours of your time, depending on your level of comfort with computers and regression analyses.

Are you still here? Then I would be very happy if you would decide whether you think this reanalysis is within your capabilities, and then make a small personal commitment to follow through with it.  If you can do that, please send me an e-mail (nicholasjlbrown, gmail) and I will give you the information you need to get started.

26 February 2018

The Cornell Food and Brand Lab story goes full circle, possibly scooping up much of social science research on the way, and keeps turning

Stephanie Lee of BuzzFeed has just published another excellent article about the tribulations of the Cornell Food and Brand Lab.  This time, her focus is on the p-hacking, HARKing, and other "questionable research practices" (QRPs) that seem to have been standard in this lab for many years, as revealed in a bunch of e-mails that she obtained via Freedom of Information (FoI) requests.  In a way, this brings the story back to the beginning.

It was a bit more than a year ago when Dr. Brian Wansink wrote a blog post (since deleted, hence the archived copy) that attracted some negative attention, partly because of what some people saw as poor treatment of graduate students, but more (in terms of the weight of comments, anyway) because it described what appeared to be some fairly terrible ways of doing research (sample: 'Every day she came back with puzzling new results, and every day we would scratch our heads, ask "Why," and come up with another way to reanalyze the data with yet another set of plausible hypotheses'). It seemed pretty clear that researcher degrees of freedom were a big part of the business model of this lab. Dr. Wansink claimed not to have heard of p-hacking before the comments started appearing on his blog post; I have no trouble believing this, because news travels slowly outside the bubble of Open Science Twitter.  (Some advocates of better scientific practices in psychology have recently claimed that major improvements are now underway. All I can say is, they can't be reviewing the same manuscripts that I'm reviewing.)

However, things rapidly became a lot stranger.  When Tim, Jordan, and I re-analyzed some of the articles that were mentioned in the blog post, we discovered that many of the reported numbers were simply impossible, which is not a result you'd expect from the kind of "ordinary" QRPs that are common in psychology.  If you decide to exclude some outliers, or create subgroups based on what you find in your data, your ANOVA still ought to give you a valid test statistic and your means ought to be compatible with the sample sizes.

Then we found recycled text and tables of results, and strangely consistent numbers of responses to multiple surveys, and results that correlated .97 across studies with different populations, and large numbers of female WW2 combat veterans, and references that went round in circles, and unlikely patterns of responses. It seemed that nobody in the lab could even remember how old their participants were.  Clearly, this lab's outputgoing back 20 or more years, to a time before Dr. Wansink joined Cornellwas a huge mess.

Amidst all that weirdness, it was possible to lose sight of the fact that what got everything started was the attention drawn to the lab by that initial blog post from November 2016, at which point most of us thought that the worst we were dealing with was rampant p-hacking.  Since then, various people have offered opinions on what might be going on in the lab; one of the most popular explanations has been, if I can paraphrase, "total cluelessness".  On this account, the head of the lab is so busy (perhaps at least partly due to his busy schedule of media appearancestestifying before Congress, and corporate consulting*), the management of the place so overwhelmed on a day-to-day basis, that nobody knows what is being submitted to journals, which table to include in which manuscript, which folder on the shared drive contains the datasets.  You could almost feel sorry for them.

Stephanie's latest article changes that, at least for me.  The e-mail exchanges that she cites and discusses seem to show deliberate and considered discussion about what to include and what to leave out, why it's important to "tweek" [sic] results to get a p value down to .05, which sets of variables to combine in search of moderators, and which types of message will appeal to the editors (and readers) of various journals.  Far from being chaotic, it all seems to be rather well planned to me; in fact, it gives just the impression Dr. Wansink presumably wanted to give in his blog post that led us down this rabbit hole in the first place. When Brian Nosek, one of the most diplomatic people in science, is prepared to say that something looks like research misconduct, it's hard to imply that you're just in an argument with over-critical data thugs.

It's been just over eight hours since the BuzzFeed article appeared, on a Sunday evening in North America.  (This post was half-drafted, since I had an idea of what would Stephanie was going to write about in her piece, having been interviewed for it.  I was just about to go to sleep when my phone buzzed to let me know that the article had gone live. I will try to forgive my fellow data thug for scooping me to get the first blog about it online.) The initial social media response has been almost uniformly one of anger.  If there is a splitand it would seem to be mostly implicit for the momentit's between those who think that the Cornell Food and Brand Lab is somehow exceptional, and those who think that it's just a particularly egregious example of what goes on all the time in many psychology labs. If you're reading this on the first day I posted it, you might still be able to cast your vote about this.  Sanjay Srivastava, who made that poll, also blogged a while back about a 2016 article by anthropologist David Peterson that described rather similar practices in three (unnamed) developmental psychology labs. The Peterson article is well worth reading; I suspected at the time, and I suspect even more strongly today, that what he describes goes on in a lot of places, although maybe the PIs in charge are smart enough not to put their p-hacking directives in e-mails (or, perhaps, all of the researchers involved work at places whose e-mails can't be demanded under FoI, which doesn't extend to private universities; as far as I know, Stephanie Lee obtained all of her information from places other than Cornell).

Maybe this anger can be turned into something good.  Perhaps we will see a social media-based movement, inspired by some of the events of the past year, for people to reveal some of the bad methodological stuff their PIs expect them to do. I won't go into any details here, partly because the other causes I'm thinking about are arguably more important than social science research and I don't want to appear to be hitching a ride on their bandwagon by proposing hashtags (although I wonder how many people who thought that they would lose weight by decanting their breakfast cereal into small bags are about to receive a diagnosis of type II diabetes mellitus that could have been prevented if they had actually changed their dietary habits), and partly because as someone who doesn't work in a lab, it's a lot easier for me to talk about this stuff than it is for people with insecure employment that depends on keeping a p-hacking boss happy.

Back to Cornell: we've come full circle.  But maybe we're just starting on the second lap.  Because, as I noted earlier, all the p-hacking, HARKing, and other stuff that renders p values meaningless still can't explain the impossible numbers, duplicated tables, and other stuff that makes this story rather different from what, I suspect, mightapart, perhaps, from the scale at which these QRPs are being appliedbe "business as usual" in a lot of places. Why go to all the trouble of combining variables until a significant moderator shows up in SPSS or Stata, and then report means and test statistic that can't possibly have been output by those programs?  That part still makes no sense to me.  Nor does Dr. Wansink's claim that he and all his colleagues "didn't remember" when he wrote the correction to the "Elmo" article in the summer of 2017 that the study was conducted on daycare kids, when in February of that year he referred to daycare explicitly (and there are several other clues, some of which I've documented over the past year in assorted posts). And people with better memories than me have noted that the "complete" releases of data that we've been given appear not to be as complete as they might be.  We are still owed another round of explanations, and I hope that, among what will probably be a wave of demands for more improvements in research practices, we can still find time to get to the bottom of what exactly happened here, because I don't think that an explanation based entirely on "traditional" QRPs is going to cover it.

* That link is to a Google cache from 2018-02-19, because for some reason, the web page for McDonald's Global Advisory Council gives a 404 error as I'm writing this. I have no idea whether that has anything to do with current developments, or if it's just a coincidence.