09 May 2019

An update on our examination of the research of Dr. Nicolas Guéguen

(Joint post by Nick Brown and James Heathers)

It's now well over a year since we published our previous blog post about the work of Dr. Nicolas Guéguen. Things have moved on since then, so here is an update.

*** Note: We have received a reply from the Scientific Integrity Officer at the University of Rennes-2, Alexandre Serres. See the update of 2019-05-22 at the bottom of this post ***

We have seen two documents from the Scientific Integrity Officer at the University of Rennes-2, which appears to have been the institution charged with investigating the apparent problems in Dr. Guéguen's work. The first of these dates from June 2018 and is entitled (our translation from French), "Preliminary Investigation Report Regarding the Allegations of Fraud against Nicolas Guéguen".

It is unfortunate that we have been told that we are not entitled to disseminate this document further, as it is considerably more trenchant in its criticism of Dr. Guéguen's work than its successor, described in the next paragraph of this blog post. We would also like to stress that the title of this document is extremely inexact. We have not made, and do not make, any specific allegations of fraud, nor are any implied. The initial document that we released is entitled “A commentary on some articles by Dr. Nicolas Guéguen” and details a long series of inconsistencies in research methods, procedures, and data. The words “fraud” and “misconduct” do not appear in this document, nor in any of our communications with the people who helped with the investigation. We restrict ourselves to pointing out that results are “implausible” (p. 2) or that scenarios are “unlikely [to] be enacted in practice” (p. 31).

The origin of inconsistencies (be it typographical errors, inappropriate statistical methods, analytical mistakes, inappropriate data handling, misconduct, or something else) is also irrelevant to the outcome of any assessment of research. Any research object with a strong and obvious series of inconsistencies may be deemed too inaccurate to trust, irrespective of their source. In other words, the description of inconsistency makes no presumption about the source of that inconsistency.

The second document, entitled "Memorandum of Understanding Regarding the Allegations of Lack of Scientific Integrity Concerning Nicolas Guéguen", is dated October 2018, and became effective on 10 December 2018. It describes the outcome of a meeting held on 10 September 2018 between (1) Dr. Guéguen, (2) the above-mentioned Scientific Integrity Officer, (3) a representative from the University of Rennes-2 legal department, and (4) an external expert who was, according to the report, "contacted by [Brown and Heathers] at the start of their inquiry". (We are not quite certain who this last person is, although the list of candidates is quite short.)

The Memorandum of Understanding is, frankly, not very hard-hitting. Dr. Guéguen admits to some errors in his general approach to research, notably using the results of undergraduate fieldwork projects as the basis of his articles, and he agrees that within three months of the date of effect of the report, he will retract two articles: "High heels increase women's attractiveness" in Archives of Sexual Behavior (J1) and "Color and women hitchhikers’ attractiveness: Gentlemen drivers prefer red" in Color Research and Application (J2). Recall that our original report into problems with Dr. Guéguen's research listed severe deficiencies in 10 articles; the other eight are barely mentioned.

On the question of Dr. Guéguen's use of undergraduate fieldwork: We were contacted in November 2018 by a former student from Dr. Guéguen's class, who gave us some interesting information. Here are a few highlights of what this person told us (our translation from French):
I was a student on an undergraduate course in <a social science field>. ... The university where Dr. Guéguen teaches has no psychology department. ... As part of an introductory class entitled "Methodology of the social sciences", we had to carry out a field study. ... This class was poorly integrated with the rest of the course, which had nothing to do with psychology. As a result, most of the students were not very interested in this class. Plus, we were fresh out of high school, and most of us knew nothing about statistics. Because we worked without any supervision, yet the class was graded, many students simply invented their data. I can state formally that I personally fabricated an entire experiment, and I know that many others did so too. ... At no point did Dr. Guéguen suggest to us that our results might be published.
Our correspondent also sent us an example of a report of one of these undergraduate field studies. This report had been distributed to the class by Dr. Guéguen himself as an example of good work by past students, and has obvious similarities to his 2015 article "Women’s hairstyle and men’s behavior: A field experiment". It was written by a student workgroup from such an undergraduate class, who claimed to have conducted similar tests on passers-by; the most impressive of the three sets of results (on page 7 of the report) was what appeared in the published article. The published version also contains some embellishments to the experimental procedure; for example, the article states that the confederate walked "in the same direction as the participant about three meters away" (p. 638), a detail that is not present in the original report by the students. A close reading of the report, combined with our correspondent's comments about the extent of admitted fabrication of data by the students, leads us to question whether the field experiments were carried out as described (for example, it is claimed that the three students tested 270 participants between them in a single afternoon, which is extraordinarily fast progress for this type of fieldwork).

(As we mentioned in our December 2017 blog post, at one point in our investigation Dr. Guéguen sent us, via the French Psychological Society, a collection of 25 reports of field work carried out by his students. None of these corresponded to any of the articles that we critiqued. Presumably he could have sent us the report that appears to have become the article "Women’s hairstyle and men’s behavior: A field experiment", but apparently he chose not to do so. Note also that the Memorandum of Understanding does not list this article as one that Dr. Guéguen is required to retract.)

We have made a number of documents available at https://osf.io/98nzj/, as follows:
  • "20190509 Annotated Guéguen report and response.pdf" will probably be of most relevance to non French-speaking readers. It contains the most relevant paragraphs of the Memorandum of Understanding, in French and (our translation) English, accompanied by our responses in English, which then became the basis of our formal response.
  • "Protocole d'accord_NG_2018-11-29.pdf" is the original "Memorandum of Understanding" document, in French.
  • "20181211 Réponse Brown-Heathers au protocole d'accord.pdf" is our formal response, in French, to the "Summary" document.
  • "20190425 NB-JH analysis of Gueguen articles.pdf" is the latest version of our original report into the problems we found in 10 articles by Dr. Guéguen.
  • "Hairstyle report.pdf" is the student report of the fieldwork (in French) with a strong similarity to the article "Women’s hairstyle and men’s behavior: A field experiment", redacted to remove the names of the authors.
Alert readers will have noted that almost five months have elapsed since we wrote our response to the "Memorandum of Understanding" document. We have not commented publicly since then, because we were planning to publish this blog post in response to the first retraction of one of Dr. Guéguen's articles, which could either have been one that he was required to retract by the agreement, or one from another journal. (We are aware that at least two other journals, J3 and J4, are actively investigating multiple articles by Dr. Guéguen that they published.)

However, our patience has now run out. The two articles that Dr. Guéguen was required to retract are still untouched on the respective journals' websites, and our e-mails to the editors of those journals asking if they have received a request to retract the articles have gone unanswered (i.e., we haven't even been told to mind our own business) after several weeks and a reminder. No other journal has yet taken any action in the form of a retraction, correction, or expression of concern.

All of this leaves us dissatisfied. The Memorandum of Understanding notes on page 5 that Dr. Guéguen has 336 articles on ResearchGate published between 1999 and 2017. We have read approximately 40 of these articles, and we have concerns about the plausibility of the methods and results in a very large proportion of those. Were this affair to be considered closed after the retraction of just two articles—not including one that seems to have been published without attribution from the work of the author’s own students—it seems to us that this would leave a substantial amount of serious inconsistencies unresolved.

Accordingly, we feel it would be prudent for the relevant editors of journals in psychology, marketing, consumer behaviour, and related disciplines to take action. In light of what we now know about the methods deployed to collect the student project data, we do not think it would be excessive for every article by Dr. Guéguen to be critically re-examined by one or more external reviewers.

[ Update 2019-05-09 15:03 UTC: An updated version of our comments on the Memorandum of Understanding was uploaded to fix some minor errors, and the filename listed here was changed to reflect that. ]

[ Update 2019-05-09 18:51 UTC: Fixed a couple of typos. Thanks to Jordan Anaya. ]

[ Update 2019-05-10 16:33 UTC: Fixed a couple of typos and stylistic errors. ]

[ Update 2019-05-22 15:51 UTC:
We have received a reply to our post from Alexandre Serres, who is the Scientific Integrity Officer at the University of Rennes-2. This took the form of a 3-page document (in both French and English versions) that did not fit into the comments box of a Blogger.com post, so we have made these two versions available at our OSF page. The filenames are "Réponse_billet de Brown et Heathers_2019-05-20.pdf" (in French) and "Réponse_billet de Brown et Heathers_2019-05-20 EN" (in English).

We have also added a document that was created by the university before the inquiry took place (filename "Procédure_traitement des allégations de fraude_Univ Rennes2_2018-01-31.pdf"), which established the ground rules and procedural framework for the inquiry into Dr. Guéguen's research.

We thank Alexandre Serres for these clarifications, and would only add that, while we are disappointed in the outcome of the process in terms of the very limited impact that it seems to have had on the problems that we identified in the public literature, we do not have any specific criticisms of the way in which the procedure was carried out.
]


01 May 2019

The results of my crowdsourced reanalysis project

Just over a year ago, in this post, I asked for volunteers to help me reanalyze an article that I had read entirely by chance, and which seemed to have a few statistical issues. About ten people offered to help, and three of them (Jan van Rongen, Jakob van de Velde, and Matt Williams) stayed the course. Today we have released our preprint on PsyArXiv detailing what we found.

The article in question is "Is Obesity Associated with Major Depression? Results from the Third National Health and Nutrition Examination Survey" (2003) by Onyike, Crum, Lee, Lyketsos, and  Eaton. This has 951 citations according to Google Scholar, making it quite an important paper in the literature on obesity and mental health. As I mentioned in my earlier blog post, I contacted the lead author, Dr. Chiadi Onyike, when I first had questions about the paper, but our correspondence petered out before anything substantial was discussed.

It turns out that most of the original problems that I thought I had found were due to me misunderstanding the method; I had overlooked that the authors had a weighted survey design. However, even within this design, we found a number of issues with the reported results. The power calculations seem to be post hoc and may not have carried out appropriately; this makes us wonder whether the main conclusion of the article (i.e., that severe obesity is strongly associated with major depressive disorder) is well supported. There are a couple of simple transcription errors in the tables, which as a minimum seem to merit a correction. There are also inconsistencies in the sample sizes.

I should make it clear that there is absolutely no suggestion of any sort of misconduct here. Standards of reproducibility have advanced considerably since Onyike et al.'s article was published, as has our understanding of statistical power; and the remaining errors are of the type that anyone who has tried to assemble results from computer output into a manuscript will recognise.

I think that all four of us found the exercise interesting; I know I did. Everyone downloaded the publicly available dataset separately and performed their analyses independently, until we pooled the results starting in October of this year. We all did our analyses in R, whereas I had hoped for more diversity (especially if someone had used Stata, which is what the original authors used); however, this had the advantage that I was able to combine everybody's contribution into a single script file. You can find the summary of our analyses in an OSF repository (the URL for which is in the preprint).

We intend to submit the preprint for publication, initially to the American Journal of Epidemiology (where the original article first appeared). I'll post here if there are any interesting developments.

If you have something to say about the preprint, or any questions or remarks that you might have about this way of doing reanalyses, please feel free to comment!

19 February 2019

Just another week in real-world science: Butler, Pentoney, and Bong (2017).

This is a joint post by Nick Brown and Stuart Ritchie. All royalty cheques arising from the post will be split between us, as will all the legal bills.

Today's topic is this article:
Butler, H. A., Pentoney, C., & Bong, M. P. (2017). Predicting real-world outcomes: Critical thinking ability is a better predictor of life decisions than intelligence. Thinking Skills and Creativity, 25, 38–46. http://dx.doi.org/10.1016/j.tsc.2017.06.005

We are not aware of any official publicly available copies of this article, but readers with institutional access to Elsevier journals should have no trouble in finding it, and otherwise we believe there may exist other ways to get hold of a copy using the DOI.

Butler et al.'s article received some favourable coverage when it appeared, including in Forbes, Psychology Today, the BPS Digest, and an article by the lead author in Scientific American that was picked up by the blog of the noted skeptic (especially of homeopathy) Edzard Ernst. Its premise is that the ability to think critically (measured by an instrument called the Halpern Critical Thinking Assessment, HCTA) is a better predictor than IQ (measured with a set of tests called the Intelligence Structure Battery, or INSBAT) of making life decisions that lead to negative outcomes, measured by the Real-World Outcomes (RWO) Inventory, which was described by its creator in a previous article (Butler, 2012).

In theory, we’d expect both critical thinking and IQ to act favourably to reduce negative experiences. The correlations between both predictors and the outcome in this study would thus be expected to be negative, and indeed they were. For critical thinking the correlation was −.330 and for IQ it was −.264. But is this a "significant" difference?

To test this, Butler et al. conducted a hierarchical regression, entering IQ (INSBAT) and then critical thinking (HCTA) as predictors. They concluded that, since the difference in R² when the second predictor (HCTA) was added was statistically significant, this indicated that the difference between the correlations of each predictors with the outcome (the correlation for HCTA being the larger) was also significant. But this is a mistake. On its own, the fact that the addition of a second predictor variable to a model causes a substantial increase in R² might tell us that both variables add incrementally to the prediction of the outcome, but it tells us nothing about the relative strength of the correlations between the two predictors and the outcome. This is because the change in R² is also dependent on the correlation between the two predictors (here, .380). The usual way to compare the strength of two correlations, taking into account the third variable, is to use Steiger’s z, as shown by the following R code:


> library(cocor)
> cocor.dep.groups.overlap(-.264, -.330, .380, 244, "steiger1980", alt="t")
<some lines of output omitted for brevity>
 z = 0.9789, p-value = 0.3276

So the Steiger’s z test tells us that there’s no statistically significant difference between the sizes of these two (dependent) correlations in this sample, p = .328.

We noted a second problem, namely that the reported bivariate correlations are not compatible with the results of the regression reported in Table 2. In a multiple regression model, the standardized regression coefficients are determined (only) by the pattern of correlations between the variables, and in the case of the two-predictor regression, these coefficients can be determined by a simple formula. Using that formula, we calculated that the coefficients for INSBAT and HCTA in model 2 should be −.162 and −.268, respectively, whereas Butler et al.’s Table 2 reports them as −.158 and −.323. When we wrote to Dr. Butler in July 2017 to point out these issues, she was unable to provide us with the data set, but she did send us an SPSS output file in which neither the correlations nor the regression coefficients exactly matched the values reported in the article.

There was a very minor third problem: The coefficient of .264 in the first cell of Table 2 is missing its minus sign. (Dr. Butler also noticed that there was an issue with the significance stars in this table.)

We wrote to the two joint editors-in-chief of Thinking Skills and Creativity in November 2017. They immediately indicated that they would handle the points that we had raised with the "journal management team" (i.e., Elsevier). We found this rather surprising, as we had only raised scientific issues that we imagined would be entirely an editorial matter. Over the following year we occasionally sent out messages asking if any progress had been made. In November 2018, we were told by the Elsevier representative that following a review of the Butler et al. article by two independent reviewers who are "senior statistical experts in this field", the journal had decided to issue a correction for... the missing minus sign in Table 2. And nothing else.

We were, to say the least, somewhat disappointed by this. We wrote to ask for a copy of the report by these senior statistical experts, but received no reply (and, after more than three months, we guess we aren't going to get one). Perhaps the experts disagree with us about the relevance of Steiger's z, but the inconsistencies between the correlations and the regression coefficients are a matter of simple mathematics and the evidence of numerical discrepancies between the authors' own SPSS output and the published article is indisputable.

So apparently Butler et al.'s result will stand, and another minor urban legend with no empirical support will be added to the folklore of "forget IQ, you just have to work hard (and I can show you how for only $499)" coaches. Of course, both of us are in favour of critical thinking. We just wish that people involved in publishing research about it were as well.

We had been planning to wait for the correction to be issued before we wrote this post, but as far as we can tell it still hasn't appeared (well over a year since we originally contacted the editors, and 19 months since we first contacted the authors). Some recent events make us believe that now would be an appropriate moment to bring this matter to public attention. Most important among these are the two new papers from Ben Goldacre and his team, showing what (a) editors and (b) researchers did when problems were pointed out in medical trial study protocols (spoiler: very often, not much). Then the inimitable James Heathers tweeted this thread expressing some of the frustrations that he (sometimes abetted by Nick) has had when trying to get editors to fix problems. And last week we also saw the case of a publisher taking a ridiculous amount of time to retract an article that was published in one of their journals published after it had been stolen, accompanied by an editorial note of the "move along, nothing to see here" variety.

There seems to be a real problem with academic editors, especially those at the journals of certain publishers, being reluctant, unwilling, or unable to take action on even the simplest problems without the approval of the publisher, whose evaluation of the situation may be based as much on the need to save face as to correct the scientific record.

A final anecdote: One of us (Nick) has been told of a case where the editor would like to retract at least two fraudulent articles but is waiting for the publisher (not Elsevier, in that case) to determine whether the damage to their reputation caused by retracting would be greater than that caused by not retracting. Is this really the kind of consideration to which we want the scientific literature held hostage?



References

Butler, H. A. (2012). Halpern critical thinking assessment predicts real-world outcomes of critical thinking. Applied Cognitive Psychology, 26, 721–729. http://dx.doi.org/10.1002/acp.2851

17 December 2018

Have scientists found an explanation for the onset of ME/CFS?

In this post I'm going to discuss this article (the link leads to a page from which you can download the PDF; I'm not sure if this will last), which appeared today (17 December 2018):

Russell, A., Hepgula, N., Nikkheslat, N., Borsini, A., Zajkowska, Z., Moll, N., . . . Pariante, C. M. (2018). Persistent fatigue induced by interferon-alpha: A novel, inflammation-based, proxy model of chronic fatigue syndrome. Psychoneuroendocrinology. Advance online publication. http://dx.doi.org/10.1016/j.psyneuen.2018.11.032

(Notes for nerds: (1) The article date will become 2019 when it appears in print; (2) There are 20 named authors, so I'm glad that APA referencing style only requires me to list the first six and the last one. I will be calling it "the article" or "the study" or "Russell et al." henceforth.)

Before I start, a small disclosure. In 2015, a colleague and I had a manuscript desk-rejected by Psychoneuroendocrinology for what we considered inadequate reasons. This led to a complaint to the Committee on Publication Ethics and a change in the journal's editorial policies, but unfortunately did not result in our article being sent out for review; it was subsequently published elsewhere. My interest in the Russell et al. article arose for entirely unrelated reasons, and I only discovered the identity of the journal after deciding to look at it. So, to the extent that one's reasoning can ever be free of motivation, I don't believe that my criticisms of the article that follow here are related to the journal in which it appeared. But it seems like a good idea to mention this, in case the editor-in-chief of the journal is reading this post and recognises my name.

Media coverage


This article is getting a fair amount of coverage in the UK media today, for example at the BBC, the Mail Online, the Independent, and the Guardian (plus some others that are behind a paywall). The simplified story that these outlets are telling is that "chronic fatigue syndrome is real and is caused by [the] immune system" (Mail Online) and that the study "challenges stigma that chronic fatigue is 'all in the mind'" (Independent). Those are hopeful-sounding messages for ME/CFS patients, but I'm not sure that such conclusions are justified.

I was made aware of this article by a journalist friend, who had received an invitation to attend a press briefing for the article at the Science Media Centre in London on Friday 14 December. By a complete coincidence I was in London that morning and decided to go along. I was allowed in without a press pass after identifying myself as a researcher, but when I tried to get clarification of a point that had been made during the presentation I was told that only reporters (i.e., not researchers or other members of the public) were allowed to ask questions. This was a little annoying at the time, but on reflection it seems fair enough since time is limited and the event was organised for journalists, not for curious researchers with a little time on their hands. There were about 10 journalists present, from most of the major UK outlets.

You can get a summary of the study from the media pieces linked above (the Guardian's coverage by Nicola Davis is particularly good). If you haven't seen the media articles, go and read them now, and then come back to this post. There was also a press release. I suggest that you also read the Russell et al. article itself, although it does get pretty technical.

What did the study claim to show?


Here's my summary of the study: The participants were 55 people with hepatitis C who were about to undergo interferon-alpha (IFN-α) treatment. The treatment lasted 24, 36, or 48 weeks. At five time points (at the start of treatment, than after 4 weeks, 8 weeks, 12 weeks, and at the end of treatment, whenever that might have been), patients were asked about their levels of fatigue and also had their cytokine levels (a measure of activity in the immune system) tested. These tests were then repeated six months after the end of treatment. Patients were also assessed for depression, stressful life events, and childhood trauma.

Interferon-alpha occurs naturally in the body as part of the immune system, but it can also be injected to fight diseases in doses that are much greater than what your body can produce. It's sometimes used as an adjunct to chemotherapy for cancer. IFN-α treatment often has substantial fatigue as a side effect, although this fatigue typically resolves itself gradually after treatment ends. But six months after they finished their treatment, 18 of the 55 patients in this study had higher levels of fatigue than when they started treatment. These patients are referred to as the PF ("persistent fatigue") group, compared to the 37 whose fatigue more or less went away, who are the RF ("resolved fatigue") group.

The authors' logic appears to run like this:
1. Some people still have a lot of fatigue six months after the end of a 24/36/48-week long course of treatment with IFN-α for hepatitis C.
2. Maybe we can identify what it is about those people (one-third of the total) that makes them slower to recover from their fatigue than the others.
3. ME/CFS patients are people who have fatigue long after the illness that typically preceded the onset of their condition. (It seems to be widely accepted by all sides in the ME/CFS debate that a great many cases occur following a infection of some kind.)
4. Perhaps what is causing the onset of fatigue after their infectious episode in ME/CFS patients is the same thing causing the onset of fatigue after IFN-α treatment in the hepatitis C patients.

Russell et al.'s claim is that patients who went on to have persistent fatigue (versus resolved fatigue) at a point six months after the end of their treatment, had also had greater fatigue and cytokine levels when they were four weeks into their treatment (i.e., between 46 and 70 weeks before their persistent fatigue was measured, depending on how long the treatment lasted). On this account, something that happened at an early stage of the procedure determined how well or badly people would recover from the fatigue induced by the treatment, once the treatment was over.

Just to be clear, here are some things that Russell et al. are not claiming. I mention these partly to show the limited scope of their article (which is not necessarily a negative point; all scientific studies have a perimeter), but also to make things clearer in case a quick read of the media coverage has led to confusion in anyone's mind.
- Russell et al. are not claiming to have identified the cause of ME/CFS.
- Russell et al. are not claiming to have identified anything that might cure ME/CFS.
- Russell et al. are not claiming to have demonstrated any relation between hepatitis C and ME/CFS.
- Russell et al. are not claiming to have demonstrated any relation between interferon-alpha --- whether this is injected during medical treatment or naturally produced in the body by the immune system --- and ME/CFS. They do not suggest that any particular level of IFN-α predicts, causes, cures, or is any other way associated with ME/CFS.
- Russell et al. are not claiming to have demonstrated any relation between a person's current cytokine levels and their levels of persistent fatigue subsequent to interferon-alpha treatment for hepatitis C. (As they note on p. 7 near the bottom of the left-hand column, "we ... find that cytokines levels do not distinguish [persistent fatigue] from [resolved fatigue] patients at the 6-month followup".)
- Russell et al. are not claiming to have demonstrated any relation between a person's current cytokine levels and their ME/CFS status. (As Table 2 shows, cytokine levels are comparable between ME/CFS patients and the healthy general population.)

Some apparent issues


Here are some of the issues I see with this article in terms of its ability to tell us anything about ME/CFS.

1. This was not a study of ME/CFS patients

It cannot be emphasised enough that none of the patients in this study had a diagnosis of ME/CFS, either at the start or the end of the study, and this greatly limits the generalisability of the results. (To be fair, the authors go into some aspects of this issue in their Discussion section on the left-hand side of p. 8, but the limitations of any scientific article rarely make it into the media coverage.) We don't know how long ago these hepatitis C patients were treated with interferon-alpha and subsequently tested at follow-up, or if any of them still had fatigue another six or 12 months later, or if they ever went on to receive a diagnosis of ME/CFS. One of the criteria for such a diagnosis is that unresolved fatigue lasts longer than six months (so it would have been really useful to have had a further follow-up). But in any case, the fatigue that Russell et al. studied was, by definition, not of sufficient duration to count as "chronic fatigue syndrome" (and, of course, there are several other criteria that need to be met for a diagnosis of ME/CFS; chronic fatigue by itself is a lot more common than full-on ME/CFS). I feel that it is therefore rather questionable to refer to "the presence of the CFS phenotype... for ... IFN-α-induced persistent fatigue" (last sentence on p. 5). Maybe this is just an oversight, but even the description of persistent fatigue as "the CFS-like phenotype", used at several other points in the article, is also potentially somewhat loaded.

Furthermore, the patients in this study were people whom we would have expected to be fatigued, at least throughout their treatment. IFN-α treatment knocks you about quite a bit. Additionally, fatigue is also a common symptom of hepatitis C infection, which makes me wonder whether some of the patients with "persistent fatigue" maybe just had a slightly higher degree of fatigue from their underlying condition rather than the IFN-α treatment --- the definition of persistent fatigue was any score on the Chalder Fatigue Scale that was higher than baseline, presumably even by one point (and, theoretically, even if the score was 0 at baseline and 1 six months after treatment ended). So Russell et al. are comparing people who are recovering faster or slower from fatigue that is entirely expected both from the condition that they have and the treatment that they underwent, with ME/CFS patients in whom the onset of fatigue is arguably the thing that needs to be explained.

There are many possible causes of fatigue, and I don't think that the authors have given us any good reason to believe that the fatigue that was reported by their hepatitis C patients six months after finishing an exhausting medical procedure that itself lasted for half a year or more was caused by the same mechanism (whatever that might be) as the multi-year ongoing fatigue in ME/CFS patients, especially since, for all we know, some or all of the 18 cases of persistent fatigue might have been only marginal (i.e., a small amount worse than baseline) or resolved themselves within not too many months.

2. Is post-treatment fatigue really unrelated to cytokine levels?

It can be seen from Table 2 of the article that the people with "persistent fatigue" (i.e., the hepatitis C patients who were still fatigued six months after finishing treatment) still had elevated cytokine levels at that point, compared to samples of both healthy people and ME/CFS patients. Indeed, these cytokine levels were similarly high in patients whose fatigue had not persisted. The authors ascribe these higher levels of cytokines to the IFN-α treatment; their argument then becomes that, since both the "resolved fatigue" and "persistent fatigue" groups had similar cytokine levels, albeit much higher than in healthy people, that can't be what was causing the difference in fatigue in this case. But I'm not sure they have done enough to exclude the possibility of those high cytokine levels interacting with something else in the PF group. (I must apologise to my psychologist friends here for invoking the idea of a hidden moderator.) Their argument appears to be based on the assumption that ME/CFS-type fatigue and post-IFN-α-treatment fatigue have a common cause, which remains unexplained; however, in the absence of any evidence of what that mechanism might be, this assumption seems to be based mainly on speculation.

3. Statistical limitations

The claim that the difference in fatigue at six-month follow-up was related to a difference in cytokine levels four weeks into the treatment does not appear to be statistically robust. The headline claim --- that fatigue was greater after four weeks in patients who went on to have persistent fatigue --- has a p value of .046, and throughout the article, many of the other focal p values are either just below .05, or even slightly higher, with values in the latter category being described as, for example, "a statistical trend towards higher fatigue", p. 4.  But in the presence of true effects, we would expect a preponderance of much smaller p values.

Russell et al. also sometimes seem to take a creative approach to what counts as a meaningful result. For example, at the end of section 3.1, the authors consider a p value of .09 from a test to represent "trend-statistical significance" (p. 4) and at the start of section 3.2 they invoke another p value of .094 as showing that "IL-6 values in [persistent fatigue] subjects ... remained higher at [the end of treatment]" (p. 5), but in the sentence immediately preceding the latter example, they treat a p value of .12 as indicating that there was "no significant interaction" (p. 5).

These borderline p values should also be considered in the light of the many other analyses that the authors could have performed. For example, they apparently had all the necessary data to perform the comparisons after eight weeks of treatment, after 12 weeks of treatment, and at the end of treatment, as well as the four-week results that they mainly reported. None of the eight-week or 12-week results appear in the article, and the two from the end of treatment are extremely unconvincingly argued (see previous paragraph). It is possible that the authors simply did not perform any tests on these results, but I am inclined to believe that they did run these tests and found not to provide support for their hypotheses.

There is also a question of whether we should be using .05 as our criterion for statistical significance with these results. (I won't get into the separate discussion of whether we should be using statistical significance as a way of determining scientific truth at all; that ship has sailed, and until it voluntarily returns to port, we are where we are.) Towards the bottom of the left-hand column of p. 8, we read:
Finally, due to the sample size there was no correction for multiple comparisons; however, we aimed to limit the number of statistical comparisons by pre-selecting the cytokines to measure at the different stages of the study.
It's nice that the authors pre-selected their predictors, but that is not sufficient. If (as seems reasonable to assume) they also tested the differences between the groups at eight or 12 weeks into the treatment, and found that the results were not significantly different, they should have adjusted their threshold for statistical significance accordingly. The fact that they did not have a very large sample size is not a valid reason not to do this, so I am slightly perplexed by the term "due to" in the sentence quoted above. (The sample size was, indeed, very small. Not only were there only 55 people in total; there were only 18 people in the condition of principal interest, displaying the "CFS-like phenotype". Under these conditions, any effect would have to be very large to be detected reliably.)

Conclusion


I don't find Russell et al.'s study to be very convincing. My guess is that different cytokine levels do not predict fatigue in either hepatitis C/IFN-α patients or ME/CFS patients, and that the purported relation between cytokine levels at four weeks into the IFN-α treatment and subsequent fatigue may well just be noise. In terms of explaining how ME/CFS begins, let alone how we might prevent or cure it, this study may not get us any closer to the truth.

18 October 2018

Just another week in real-world science: de Venter et al. (2017)

(Note: This post, as with all my blog posts, represents only my own opinions, and not those of any organizations with which I am affiliated, or anyone who works for those organizations.)

Someone sent me a link to this article and asked what I thought of it.

De Venter, M., Illegems, J., Van Royen, R., Moorkens, G., Sabbe, B. G. C., & Van Den Eede, F. (2017). Differential effects of childhood trauma subtypes on fatigue and physical
functioning in chronic fatigue syndrome. Comprehensive Psychiatry, 78, 76–82. http://dx.doi.org/10.1016/j.comppsych.2017.07.006

The article describes an investigation into possible relations between various negative childhood events (as measured by the Traumatic Experiences Checklist [TEC]) and impaired functioning (fatigue, as measured by the Checklist Individual Strength [CIS] scale, and general health and well-being, as measured by the well-known SF-36 scale).  The authors' conclusions, from the abstract, were fairly unequivocal: "... sexual harassment emerged as the most important predictor of fatigue and poor physical functioning in the CFS patients assessed. These findings have to be taken into account [emphasis added] in further clinical research and in the assessment and treatment of individuals coping with chronic fatigue syndrome." In other words, as of the publication of this article, the authors believe that the assessment of past sexual harassment should be an integral part of people with the condition widely known as Chronic Fatigue Syndrome (I will use that term for simplicity, although I appreciate that some people prefer alternatives.)

The main results are in Table 3, which I have reproduced here (I hope that Elsevier's legal department will agree that this counts as fair use):

Table 3 from de Venter et al., 2017. Red lines added by me. Bold highlighting of the p values below .05 by the authors.

The article is quite short, with the Results section focusing on the two standardized (*) partial regression coefficients (henceforth, "betas") that I've highlighted in red here, which have associated p values below .05, and the Discussion section focusing on the implications of these.

There are a couple of problems here.

First, there are five regression coefficients for each dependent variable, of which just one per DV has a p value below .05 (**). It's not clear to me why, in theoretical terms, childhood experiences of sexual harassment (but not emotional neglect, emotional abuse, bodily threat, or sexual abuse) should be a good predictor of fatigue (CIS) or general physical functioning (SF-36). The authors define sexual harassment as "being submitted to sexual acts without physical contact" and sexual abuse as "having undergone sexual acts involving physical contact". I'm not remotely qualified in these matters, but it seems to me that with these definitions, "sexual abuse" would probably be expected to lead to more problems in later functioning than "sexual harassment". Indeed, I find it difficult to imagine how one could be subjected to "abuse with contact" while not also being subjected to "abuse without contact", more or less by the nature of "sexual acts". (I apologise if this whole topic makes you feel uneasy. It certainly makes me feel that way.)

It seems unlikely that the specific hypothesis that "sexual harassment (but not emotional neglect, emotional abuse, bodily threat, or sexual abuse) will be a significant predictor of fatigue and [impaired] general functioning among CFS patients" was made a priori. And indeed, the authors tell us in the last sentence of the introduction that there were no specific a priori hypotheses: "Thus, in the present study, we examine the differential impact of subtypes of self-reported early childhood trauma on fatigue and physical functioning levels in a well-described population of CFS patients" (p. 77). In other words, they set out to collect some data, run some regressions, and see what emerged. Now, this can be perfectly fine, but it's exploratory research. The conclusions you can draw are limited, and the interpretation of p values is unclear (de Groot, 1956). Any results you find need to be converted into specific hypotheses and given a severe test (à la Popper) with new data.

The second problem is a bit more subtle, but it illustrates the danger of running complex multiple regressions, and especially of reporting only the regression coefficients of the final model. For example, there is no measure of the total variance explained by the model (R^2), or of the increase in R^2 from the model with just the covariates to the model where the variable of interest is added. (Note that only 29 of the 155 participants reported any experience of childhood sexual harassment at all. You might wonder how much of the total variance in a sample can be explained by a variable for which 80% of the participants had the same score, namely zero.)  All we have is statistically significant betas, which doesn't tell us a lot, especially given the following problem.

Take a look at the betas for sexual harassment. You will see that they are both greater (in magnitude) than 1.0. That is, the beta for sexual harassment in each of the regressions in de Venter et al.'s article must be considerably larger than the zero-order correlation between sexual harassment and the outcome variable (CIS or SF-36), which of course is bounded between 0 and 1. For SF-36, for example, even if the original correlation was -.90, the corresponding beta is twice as large. If you have trouble thinking about what a beta coefficient greater than 1.0 might mean, I recommend this excellent blog post by David Disabato.

(A quick poll of some colleagues revealed that quite a few researchers are not even aware that a beta coefficient above 1.0 is even "a thing". Such values tend to arise only when there is substantial correlation between the predictors. For two predictors there are analytically complete solutions to predict the exact circumstances under which this will happen --- e.g., Deegan, 1978 --- but beyond that you need a numerical solution for all but the most trivial cases. The rules of matrix algebra, which govern how multiple regression works, are deterministic, but their effects are often difficult to predict from a simple inspection of the correlation table.)

This doubling of the zero-order coefficient to the beta is a very large difference that is almost certainly explained entirely by substantial correlations between at least some, and possible all, of the five predictors. If the authors wish to claim otherwise, they have some serious theoretical explaining to do. In particular, they need to show why the true relation between sexual harassment and SF-36 functioning is in fact twice as strong as the (presumably already substantial) zero-order correlation would suggest, and how the addition of the other covariates somehow reveals this otherwise hidden part of the relation.  If they cannot do this, then the default explanation --- namely, that this is a statistical artefact as a result of highly correlated predictors --- is by far the most parsimonious.

My best guess is that that the zero-order correlation between sexual harassment and the outcome variables is not statistically significant, which brings to mind one of the key points from Simmons, Nelson, and Simonsohn (2011): "If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate" (p. 1362). I also suspect that we would find that sexual harassment and sexual abuse are rather strongly correlated, as discussed earlier.

I wanted to try and reproduce the authors' analyses, to understand which predictors were causing the inflation in the beta for sexual harassment. The best way to do this would be if the entire data set had been made public somewhere, but this is research on people with a controversial condition, so it's not necessarily a problem that the data are not just sitting on OSF waiting for anyone to download them. All I needed were the seven variables that went into the regressions in Table 3, so there is no question of requesting any personally-identifiable information. In fact, all I really needed was the correlations between these variables, because I could work out the regression coefficients from them.

(Another aside: Many people don't know that with just the table of correlations, you can reproduce the standardized coefficients of any OLS regression analysis. With the SDs as well you can get the unstandardized coefficients, and with the means you can also derive the intercepts. For more information and R code, see this post, knowing that the correlation matrix is, in effect, the covariance matrix for standardized variables.)

So, that was my starting point when I set out to contact the authors. All I needed was seven columns of (entirely anonymous) numbers --- or even just the table of correlations, which arguably ought to have been in the article anyway. But my efforts to obtain the data didn't go very well, as you can see from the e-mail exchange below. (***)

First, I wrote to the corresponding author (Dr Maud de Venter) from my personal Gmail account:

Nick Brown <**********@gmail.com>  3 Oct, 16:50

Dear Dr. de Venter,

I have read with interest your article "Differential effects of childhood trauma subtypes on fatigue and physical functioning in chronic fatigue syndrome", published in Comprehensive Psychiatry.

My attention was caught, in particular, by the very high beta coefficients for the two statistically significant results.  Standardized regression coefficients above 1.0 tend to indicate that suppression effects or severe confounding are occurring, which can result in betas that are far larger in magnitude than the corresponding zero-order correlations. Such effects tend to require a good deal of theoretical explanation if they are not to be considered as likely statistical artefacts.

I wonder if you could supply me with a copy of the data set so that I could examine this question in more detail? I would only need the seven variables mentioned in Table 3, with no demographic information of any kind, so I would hope that there would be no major concerns about confidentiality. I can read most formats, including SPSS .SAV files and CSV. Alternatively, a simple correlation matrix of these seven variables, together with their means and SDs, would also allow me to reproduce the results.

Kind regards,
Nicholas Brown

A couple of days later I received a reply, not from Dr. de Venter, but from Dr. Filip Van Den Eede, who is the last author on the article, asking me to clarify the purpose of my request. I thought I had been fairly clear, but it didn't seem unreasonable to send my request from my university address with my supervisor in copy, even though this work is entirely independent of my studies. So I did that:

Brown, NJL ...  05 October 2018 19:32

Dear Dr. Van Den Eede,

Thank you for your reply.

As I explained in my initial mail to Dr. de Venter, I would like to understand where the very high beta coefficients --- which would appear to be the key factors driving the headline results of the article --- are coming from. Betas of this magnitude (above 1.0) are unusual in the absence of confounding or suppression effects, the presence of which could have consequences for the interpretation of the results.

If I had access either to the raw data, or even just the full table of descriptives (mean, SD, and Pearson correlations), then I believe that I would be better able to identify the source of these high coefficients. I am aware that these data may be sensitive in terms of patient confidentiality, but it seems unlikely that any participant could be identified on the basis of just the seven variables in question.

I would be happy to answer any other questions that you might have, if that would make my purpose clearer.

As you requested, I am putting my supervisors in copy of this mail.

Kind regards,
Nicholas Brown

Within less than 20 minutes, back came a reply from Dr. Van Den Eede indicating that his research team does not share data outside of specific collaborations or for a "clear scientific purpose", an example of which might be a meta-analysis. This sounded to me like a refusal to share the data with me, but that wasn't entirely clear, so I sent a further reply:

Brown, NJL ...  07 October 2018 18:56

Dear Dr. Van Den Eede,


Thank you for your prompt reply.

I would argue that establishing whether or not a published result might be based on a statistical artifact does in fact constitute a "clear scientific purpose", but perhaps we will have to differ on this.

For the avoidance of doubt, might I ask you to formally confirm that you are refusing to share with me both (a) the raw data for these seven variables and (b) their descriptive statistics (mean, SD, and table of intercorrelations)?

Kind regards,
Nick Brown

That was 11 days ago, and I haven't received a reply since then, despite Dr. Van Den Eede's commendable speed in responding up to that point. I guess I'm not going to get one.

In case you're wondering about the journal's data sharing policy, here it is. It does not impose what I would call especially draconian requirements on authors, so I don't think writing to the editor is going to help much here.


Where does this leave us? Well, it seems to me that this research team is declining to share some inherently anonymous data, or even just the table of correlations for those data, with a bona fide researcher (at least, I think I am!), who has offered reasonable preliminary evidence that there might be a problem with one of their published articles.

I'm not sure that this is an optimal way to conduct robust research into serious problems in people's lives.


[[ Begin update 2018-11-03 18:00 UTC ]]
I wrote to the editor of the journal that published the de Venter et al. article, with my concerns. He replied with commendable speed, enclosing a report from a "statistical expert" whom he had consulted. Sadly, when I asked for permission to quote that report here, the editor requested me not to do so. So I shall have to paraphrase it, and hope that no inaccuracies creep in.

Basically, the statistical expert agreed with my points, stated that it would indeed be useful to know if the regression coefficients were standardized or unstandardized, but didn't think that much could be done unless the authors wanted to write an erratum. This expert also didn't think there was a problem with the lack of Bonferroni correction, because the reader could fill that in for themselves.
[[ End update 2018-11-03 18:00 UTC ]]

[[ Begin update 2018-11-24 17:20 UTC ]]
Since this blog post was first written, I have received two further e-mails from Dr. Van Den Eede on this subject. In the latest of these, he indicated that he does not approve of the sharing of the text of his e-mails on my blog. Accordingly, I have removed the verbatim transcripts of the e-mails that I received from him before this blog post, and replaced them with what I believe to be fair summaries of their content.

The more recent e-mails do not add much in terms of progress towards my goal of obtaining the seven variables, or the full table of descriptives. However, Dr. Van Den Eede did tell me that the regression coefficients published in the De Venter et al. article were unstandardized, suggesting (given the units of the scales involved) that the effect size was very small.
[[ End update 2018-11-24 17:20 UTC ]]

[[ Begin update 2018-11-27 12:15 UTC ]]
I added a note to the start of this post to make it clear that it represents my personal opinions only.
[[ End update 2018-11-27 12:15 UTC ]]

[[ Begin update 2019-02-01 19:14 UTC ]]
Fixed link to the original De Venter et al. article.
[[ End update 2019-02-01 19:14 UTC ]]

References

de Groot, A. D. (1956). De betekenis van “significantie” bij verschillende typen onderzoek [The meaning of “significance” for different types of research]. Nederlands Tijdschrift voor de Psychologie, 11, 398–409. English translation by E. J. Wagenmakers et al. (2014), Acta Psychologica, 148, 188–194. http://dx.doi.org/10.1016/j.actpsy.2014.02.001

Deegen, J. D., Jr. (1978). On the occurrence of standardized regression coefficients greater than one. Educational and Psychological Measurement, 38, 873-888. http://dx.doi.org/10.1177/001316447803800404

Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22, 1359–1366. http://dx.doi.org/10.1177/0956797611417632

(*) A colleague suggested to me that these numbers might in fact be unstandardized coefficients. My assumption that they are standardized partial regression coefficients is based on the following:
1. The Abstract and the Results section of the article refer to them with the Greek letter β., which is the symbol normally used to denote standardized partial regression coefficients.
2. The scale ranges of the IVs (0–12 in four cases, 0–21 in the fifth) and DVs (maximum values over 100) are such that I would expect considerably larger unstandardized values for a statistically significant effect, if these IVs were explaining a non-trivial amount of variance in the DVs.
3. I mentioned "standardized regression coefficients" in my first e-mail to the authors, and "beta coefficients" in the second. Had these numbers in fact been referring to unstandardized coefficients, I would have hoped that the last author would have pointed this out, thus saving everybody's time, rather than entering into a discussion about their data sharing policies.

I suppose that it is just possible that these are unstandardized coefficients (in which case a correction to the article would seem to be required on that basis alone), but of course, if the authors would agree to share their data with me, I could ascertain that out for myself.

(**) I hope that readers will forgive me if, for the purposes of the present discussion, I assume that identifying whether a p value is above or below .05 has some utility when one is attempting to learn something true about the universe.

(***) I'm not sure what the ethical rules are about publishing e-mail conversations, but I don't feel that I'm breaking anyone's trust here. Perhaps I should have asked Dr. Van Den Eede if he objected to me citing our correspondence, but since he has stopped replying to me about the substantive issue I'm not sure that it would be especially productive to ask about a procedural matter.

10 September 2018

Replication crisis: Cultural norms in the lab

I recently listened to an excellent episode of the "Stuk Rood Vlees" ("Hunk of Red Meat") podcast that is hosted by the Dutch political scientist Armèn Hakhverdian (@hakhverdian ‏on Twitter). His guest was Daniël Lakens (@lakens) and they talked at great length --- to the extent that the episode had to be split into two 1-hour sections --- about the replication crisis.

This podcast episode was recorded in Dutch, which is reasonable since that's the native language of both protagonists, but a little unfortunate for more than 99.5% of the world's population who don't understand it. (Confession: I'm a little bit lukewarm on podcasts --- apart from ones with me as a guest, which are fantastic --- because the lack of a transcript make them hard to search, and even harder to translate.)

This is a particular shame because Daniël is on sparkling form in this podcast. So I've taken the liberty of transcribing what I thought was the most important part, just over 15 minutes long, where Armèn and Daniël talk about publication bias and the culture that produces it. The transcription has been done rather liberally, so don't use it as a way to learn Dutch from the podcast. I've run it past both of the participants and they are happy that it doesn't misrepresent what they said.

This discussion starts at around 13:06, after some discussion of the Stapel and Bem affairs from 2010-2011, ending with surprise that when Stapel --- as Dean --- claimed to have been collecting his data himself, everybody thought this was really nice of him, and nobody seemed to find it weird. Now read on...

Daniël Lakens: Looking back, the most important lesson I've learned about this --- and I have to say, I'm glad that I had started my career back then, around 2009, back when we really weren't doing research right, so I know this from first-hand experience --- is just how important the influence of conforming to norms is. You imagine that you're this highly rational person, learning all these objective methods and applying them rigorously, and then you find yourself in this particular lab and someone says "Yeah, well, actually, the way we do it round here is X", and you just accept that. You don't think it's strange, it's just how things are. Sometimes something will happen and you think "Hmmm, that's a bit weird", but we spend our whole lives in the wider community accepting that slightly weird things happen, so why should it be different in the scientific community? Looking back, I'm thinking "Yeah, that wasn't very good", but at the time you think, "Well, maybe this isn't the optimal way to do it, but I guess everyone's OK with it".

Armèn Hakhverdian: When you arrive somewhere as a newbie and everyone says "This is how we do it here, in fact, this is the right way, the only way to do it", it's going to be pretty awkward to question that.

DL: Yes, and to some extent that's a legitimate part of the process of training scientists. The teacher tells you "Trust me, this is how you do it". And of course up to some point you kind of have to trust these people who know a lot more than you do. But it turns out that quite a lot of that trust isn't justified by the evidence.

AH: Have you ever tried to replicate your own research?

DL: The first article I was ever involved with as a co-author --- so much was wrong with that. There was a meta-analysis of the topic that came out showing that overall, across the various replications, there was no effect, and we published a comment saying that we didn't think there was any good evidence left.

AH: What was that study about?

DL: Looking back, I can see that it was another of these fun effects with little theoretical support ---

AH: Media-friendly research.

DL: Yep, there was a lot of that back then. This was a line of research where researchers tried to show that how warm or heavy something was could affect cognition. Actually, this is something that I still study, but in a smarter way. Anyway, we were looking at weight, and we thought there might be a relation between holding a heavy object and thinking that certain things were more important, more "weighty". So for example we showed that if you gave people a questionnaire to fill in and it was attached to a heavy clipboard, they would give different, more "serious" answers than if the clipboard was lighter. Looking back, we didn't analyse this very honestly --- there was one experiment that didn't give us the result we wanted, so we just ignored it, whereas today I'd say, no, you have to report that as well. Some of us wondered at the time if it was the right thing to do, but then we said, well, that's how everyone else does it.

AH: There are several levels at which things can be done wrong. Stapel making his data up is obviously horrible, but as you just described you can also just ignore a result you don't like, or you can keep analysing the data in a bunch of ways until you find something you can publish. Is there a scale of wrongdoing? We could just call it all fraud, but for example you could just have someone who is well-meaning but doesn't understand statistics --- that isn't an excuse, but it's a different type of problem from conscious fraud.

DL: I think this is also very dependent on norms. There are things that we still think are acceptable today, but which we might look back on in 20 years time and think, how could we every have thought that was OK? Premeditated fraud is a pretty easy call, a bit like murder, but in the legal system you also have the idea of killing someone, not deliberately, but by gross negligence, and I think the problems we have now are more like that. We've known for 50 years or more that we have been letting people with insufficient training have access to data, and now we're finally starting to accept that we have to start teaching people that you can't just trawl through data and publish the patterns that you find as "results". We're seeing a shift --- whereas before you could say "Maybe they didn't know any better", now we can say, "Frankly, this is just negligent". It's not a plausible excuse to pretend that you haven't noticed what's been going on for the past 10 years.

   Then you have the question of not publishing non-significant results. This is a huge problem. You look at the published literature and more than 90% of the studies show positive results, although we know that lots of research just doesn't work out the way we hoped. As a field we still think that it's OK to not publish that kind of study because we can say, "Well, where could I possibly get it published?". But if you ask people who don't work in science, they think this is nuts. There was a nice study about this in the US, where they asked people, "Suppose a researcher only publishes results that support his or her hypotheses, what should happen?", and people say, "Well, clearly, that researcher should be fired". That's the view of dispassionate observers about what most scientists think is a completely normal way to work. So there's this huge gap, and I hope that in, say, 20 years time, we'll have fixed that, and nobody will think that it's OK to withhold results. That's a long time, but there's a lot that still needs to be done. I often say to students, if we can just fix this problem of publication bias during our careers, alongside the actual research we do, that's the biggest contribution to science that any of us could make.

AH: So the problem is, you've got all these studies being done all around the world, but only a small fraction gets published. And that's not a random sample of the total --- it's certain types of studies, and that gives a distorted picture of the subject matter.

DL: Right. If you read in the newspaper that there's a study showing that eating chocolate makes you lose weight, you'll probably find that there were 40 or 100 studies done, and in one of them the researchers happened to look at how much chocolate people ate and how their weight changed, and that one study gets published. And of course the newspapers love this kind of story. But it was just a random blip in that one study out of 100. And the question is, how much of the literature is this kind of random blip, and how much is reliable.

AH: For many years I taught statistics to first- and second-year undergraduates who needed to do small research projects, but I never talked about this kind of thing. And lots of these students would come to me after collecting their data and say, "Darn, I didn't get a significant result". It's like there's this inherent belief that you have to get statistical significance to have "good research". But whether research is good or not is all about the method, not the results. It's not a bad thing that a hypothesis goes unsupported.

DL: But it's really hypocritical to tell a first-year student to avoid publication bias, and then to say "Hey, look at my impressive list of publications", when that list is full of significant results. In the last few years I've started to note the non-significant results in the Discussion section, and sometimes we publish via a registered report, where you write up and submit in advance how you're going to do the study, and the journal says "OK, we'll accept this paper regardless of how the results turn out". But if you look at my list of publications as a whole, that first-year student is not going to think that I'm very sincere when I say that non-significant results are just as important as significant ones. Young researchers come into a world that looks very different to what you just described, and they learn very quickly that the norm is, "significance means publishable".

AH: In political science we have lots of studies with null results. We might discover that it wouldn't make much difference if you made some proposed change to the voting system, and that's interesting. Maybe it's different if you're doing an experiment, because you're changing something and you want that change to work. But even there, the fact that your manipulation doesn't work is also interesting. Policymakers want to know that.

DL: Yes, but only if the question you were asking is an interesting one. When I look back to some of my earlier studies, I think that we weren't asking very interesting questions. They were fun because they were counterintuitive, but there was no major theory or potential application. If those kind of effects turn out not to exist, there's no point in reporting that, whereas we care about what might or might not happen if we change the voting system.

AH: So for example, the idea that if people are holding a heavier object they answer questions more seriously: if that turns out not to be true, you don't think that's interesting?

DL: Right. I mean, if we had some sort of situation in society whereby we knew that some people were holding heavy or light things while filling in important documents, then we might be thinking about whether that changes anything. But that's not really the case here, although there are lots of real problems that we could be addressing.

   Another thing I've been working on lately is teaching people how to interpret null effects. There are statistical tools for this ---

AH: It's really difficult.

DL: No, it's really easy! The tools are hardly any more difficult than what we teach in first-year statistics, but again, they are hardly ever taught, which also contributes to the problem of people not knowing what to with null results.

(That's the end of this transcript, at around the 30-minute mark on the recording. If you want to understand the rest of the podcast, it turns out that Dutch is actually quite an easy language to learn.)

01 July 2018

This researcher compared two identical numbers. The effect size he obtained will shock you!

Here's an extract from an article that makes some remarkable claims about the health benefits of drinking green tea. The article itself seems to merit scrutiny for a number of reasons, but here I just want to look at a point that illustrates why (as previously noted by James Heathers in his inimitable style) rounding to one decimal place (or significant figure) when reporting your statistics is not a good idea.


The above image is taken from Table 3 of the article, on page 596. This table shows the baseline and post-treatment values of a large number of variables that were measured in the study.  The highlighted row shows the participants' waist–hip ratio in each of two groups and at each of two time points. As you can see, all of the (rounded) means are equal, as are all of the (rounded) SDs.

Does this mean that there was absolutely no difference between the participants? Not quite. You can see that the p value is different for the two conditions. This p value corresponds to the paired t test that will have been performed for the 39 participants in the treatment group across the period of the study, or for the 38 participants in the control group. The p values (corresponding to the respective t statistics) would likely be different even if the means and SDs were identical to many decimal places because the paired t test makes 39 (or 38) comparisons of individual values between baseline and the end of the study.

However, what I'm interested in here is the difference in mean waist–hip ratios between the groups at baseline (i.e., the first and fourth columns of numbers). The participants have been randomized to conditions, so presumably the authors decided not to worry about baseline differences [PDF], but it's interesting to see what those differences could have been (not least because these same numbers could also have been, say, the results obtained by the two groups on a psychological test after they had been assigned randomly to conditions without a baseline measurement).

We can calculate the possible range of differences(*) by noting that the rounded mean of 0.9 could have corresponded to an actual value anywhere between 0.85001 and 0.94999 (let's leave the question of how to round values of exactly 0.85 or 0.95 for now; it's complicated). Meanwhile, each of the rounded SDs of 0.1 could have been as low as 0.05001. (The lower the SD, the higher the effect.)  Let's put those numbers into this online effect size calculator (M1=0.94999, M2=0.85001, SD1=SD2=0.05001) and click "Compute" (**).

Yes, you are reading that right: An effect size of = (almost) 2 is possible for the baseline difference between the groups even though the reported means are identical. (For what it's worth, the p value here, with 75 degrees of freedom, is .0000000000004). Again, James has you covered if you want to know what an effect size of 2 means in the real world.

Now, you might think that this is a bit pathological, and you're probably right. So play around with the means and SDs until they look reasonable to you. For example, if you keep the extreme means but use the rounded SDs as if they were exactly correct, you get = 0.9998.  That's still a whopping effect size for the difference between numbers that are reported as being equal. And even if you bring the means in from the edge of the cliff, the effect size can still be pretty large. Means of 0.93 and 0.87 with SDs of 1.0 will give you d = 0.6 and p = .01, which is good enough for publication in most journals.

Conclusion: Always report, not just two decimal places, but also at least two significant figures (it's very frustrating to see standardized regression coefficients, in particular, reported as 0.02 with a standard error of 0.01). In fact, since most people read papers on their screens and black pixels use less energy to display than white ones, save the planet and your battery lifetime and report three or four decimals. After all, you aren't afraid of GRIM, are you?



(*) I did this calculation by hand. My f_range() function, described here, doesn't work in this case because the underlying code (from a module that I didn't write, and have no intention of fixing) chokes when trying to calculate the midpoint test statistic when the means and SDs are identical.

(**) This calculator seems to be making the simplifying assumption that the group sizes are identical, which is close enough as to make no difference in this case. You can also do the calculation of d by hand: just divide the difference between the means by the standard deviation, assuming you're using the same SD for both means, or see here.

[Update 2018-07-02 12:13 UTC: Removed link to a Twitter discussion of a different article, following feedback from an alert reader.]