## 14 October 2019

### A curious edge case issue with PET-PEESE

(This post has been written in the spirit of "If you want to understand how something works, dig around inside it for a while and write up what you found". It might already be known somewhere, and the chances of it causing a problem in the real world might be small, but I wanted to get my analyses down in writing for my future self, and I thought it might be useful for somebody. Thanks to Daniël Lakens for his very helpful comments on an earlier draft of this post.)

PET-PEESE is a method for detecting and estimating the effects of publication bias in a meta-analysis. I don't do meta-analyses for a living (or indeed as a hobby), so up to now my interest in this topic has been fairly minimal, and doubtless it will go back to being so after this post.  I won't give any introduction to what PET-PEESE does here; you can find that in the blog posts that I link to below.

While trying to understand more about meta-analyses in general and PET-PEESE in particular, I read this blog post from 2015 by Will Gervais. He discusses a number of limitations of PET-PEESE, including possible bias when using it with Cohen's d as the effect size. The problem, in a nutshell, is that PET-PEESE involves regressing the effect size on its standard error, and since the calculation of the standard error for Cohen's d includes the effect size itself, there is by definition going to be at least some correlation there. Here's Will's version of the formula for the SE:
(Note that this page has a slightly different formula that is missing the final term in parentheses, but that doesn't make a lot of difference here.)

In a reply to Will's post (scroll down to near the end of the comments), Jake Westfall wrote, "I don't think that dependence [on d] ends up mattering much. You can see in the SE formula that the term involving d quickly vanishes to 0 as the sample size grows. In fact, for typical effect sizes (d = .2 to .6), the term involving d is effectively nil once we have reached just 10 participants per group."

That seems reasonable. After all, d is probably less (or not much bigger) than 1, and twice the combined sample size is going to be a fairly large number in comparison. So it seems as if the effect of the term with d-squared is not going to be very great.

I wrote some R code to investigate this, which you can find here. (I've reproduced it below as an image, so you can read along, but the definitive version is in that gist.) It builds the effect size in variables t1a and t1b (the two terms in the left pair of parentheses of the SE formula above) which are added together and multiplied by t2, the term in the right pair of parentheses. I used the ratio of t1b to t1a as a measure of the influence of the d-squared term, and indeed it's quite small. As the code stands, this term is only about 0.02 times the left-hand term, and the correlation between d and its SE is about .03, which is only a small bias on the PET test.

The code generates sample sizes n1 and n2 as random numbers. When generating n1 I had to add a minimum value, to avoid having a sample size of 0 (which would cause things to break) or 1 (which would be a bit silly). As it stands below, the code tries to make a mean n1 of 50 with a minimum of 20. However, while playing with the code, at one point I reduced the mean to 10, without changing the minimum from 20. This meant that n1 was accidentally forced to be 20 for every sample (because the maximum value that can be generated at line 14 is twice the target mean). Suddenly, although the ratio between the two terms inside the left-hand set of parentheses in the SE formula remained at 0.02, the correlation between d and the SE went up to .30.  You can try it yourself; just change 50 to 10 in line 12.

Things get even wilder if you have a bigger range of effect sizes. In line 19, put a # character before the *, so that the line is just

d = runif(iter)

and hence the range of d is from 0 to 1. (Aside: I can't get Blogger.com's editor to stop eating left angle brackets, so please don't @ me about my use of = for assignment here.) Now the correlation between d and SE is about .65. Want more mayhem? Uncomment line 16, so that n2 is now exactly the same as n1 (instead of being a bit more or less), making all the pooled sample sizes the same. The correlation between d and SE now goes up to about .97 (!).

The effect also makes a big difference to the intercept, which is intended to be a measure of the true effect size. For example, after making the various changes mentioned above, try examining the wls regression object; the intercept can go below 6.  A plot() of this object is interesting; indeed, even a simple plot of d against se is quite spectacular:

Plot of se against d with n1=10 but no other edits to the supplied code (range of d: 0.2 to 0.6; n2 != n1)

In a 2017 article [PDF], Tom Stanley (the originator of PET-PEESE) noted that there can be some bias when all of the sample sizes are small. However, this issue identified here goes beyond that. If you change the minimum sample size to 1000 (line 11) you will see that the problem remains almost exactly the same.

In a normal meta-analysis, the biasing effect will of course be much less drastic than this, but with studies of sufficiently similar size, this problem has the potential to introduce some bias into the unsuspecting user's interpretation of the PET-PEESE regression line. (In the article just mentioned, Stanley recommends including a minimum of 20 effects in a PET-PEESE analysis, for other reasons.) Interested readers might try playing with the variable iter to see how various numbers of studies affect the result.

What's going on? In the formula for the SE, the term on the right that includes d-squared is of negligible magnitude compared to the one on its left, and yet it is driving the entire relationship. The answer appears (I think) to be our good friend, granularity. With homogeneous sample sizes, the numbers in the first term of the formula ((n1 + n2 )/ n1 * n2) are always the same, or at least, quite similar. Hence, the variance provided by the term containing d turns out to make a significant contribution. At least, that's what I think is happening; please feel free to play with the simulated data from my code and disagree with me (because I'm totally just winging it here).

Some time after Will's post, Uri Simonsohn blogged about PET-PEESE at Data Colada. Uri noted: "A more surprisingly consequential assumption involves the symmetry of sample sizes across studies. Whether there are more small than large n studies, or vice versa, PET PEESE’s performance suffers quite a bit."  I wrote the code shown here before I read Uri's post, but when I did read it, it made sense (I presume that the effect that Uri is describing could be the same as the one I observed in my simulated data).

In summary, it seems that when sample sizes are "too" homogeneous, PET-PEESE will be biased, in favour of suggesting that there is excessive publication bias, with this bias being an inverse function (which I am not smart enough to work out) of the variability of the sample sizes (n1 and n2).

How much of a problem is this in practice, for the typical meta-analysis? Probably not very much at all. I just find it curious that (assuming the above analysis is correct) a meta-analysis method could potentially fail if it was used in a field where sample sizes are highly homogeneous, which I suppose could happen if there was a "natural" sample size; say, the number of matches played in a football league with 20 teams over successive seasons. Of course, all analysis methods have limitations on the conditions where they can be used, but typically these arise when the input data are extremely variable. In ANOVA, we like it when all of the groups have similar variance; we don't have to worry about something suddenly heading off towards infinity if this similarity is less than 0.03 or whatever. In the title of this post I've referred to the problem as an "edge case", but it feels more like a sneaky hole lurking in the middle of the playing field.

## 06 August 2019

### Some instances of apparent duplicate publication by Dr. Mark D. Griffiths

According to his Twitter bio, Dr. Mark D. Griffiths is a Chartered Psychologist and Distinguished Professor of Behavioural Addiction at the Nottingham Trent University. He is also a remarkably prolific researcher, with a Google Scholar h-index of 125. In this recent tweet he reports having just published the 879th paper of his career, which according to his faculty page makes about 30 publications per year since he obtained his PhD. In fact, that number may be a low estimate, as I counted 1,166 lines in the "Journal Articles" section of his publications list; half of these were published since 2010, which would represent more than one article published per week, every week of the year, for the last ten years.

Now, even 30 publications a year represents a lot of writing. I'm sure that many of us would like to be able to produce even a third as many pieces as that. And helpfully, Dr. Griffiths has written a piece at Psychology Today in which he give us "some general tips on how to make your writing more productive". But there is one tip that he didn't include, which is that copying and pasting extensively from one's previous manuscripts is a lot faster than typing new material.

#### Exhibit 1a

Here is a marked-up snapshot of a book chapter by Dr. Griffiths:
Griffiths, M. (2005). Internet abuse and addiction in the workplace. In M. Khosrow-Pour (Ed.), Encyclopedia of information science and technology (pp. 1623–1626). Hershey, PA: IGI Global.

The highlighted text appears to have been copied, verbatim and without attribution, from an earlier book chapter:
Griffiths, M. (2004). Internet abuse and addiction in the workplace: Issues and concerns for employers. In M. Anandarajan & C. A. Simmers (Eds.), Personal web usage in the workplace: A guide to effective human resources management (pp. 230–245). Hershey, PA: IGI Global.

#### Exhibit 1b

This is a snapshot of a journal article:
Griffiths, M. (2010). Internet abuse and internet addiction in the workplace. Journal of Workplace Learning, 22, 463–472. http://dx.doi.org/10.1108/13665621011071127

Over half of this article (the highlighted part in the above image) consists of text that appears to have been copied, verbatim and without attribution, from the same 2004 book chapter that was the source for the 2005 chapter depicted in Exhibit 1a, despite being published six years later.

One might, perhaps, have expected an article on the important and fast-moving topic of workplace Internet abuse to consist entirely of new material after such a long time, but apparently Dr. Griffiths considered his work from 2004 to still be highly relevant (and his assignment of copyright to the publisher of the 2004 and 2005 books to be of minor importance), to the extent that the 2010 article does not contain the terms "social media", "Twitter", "Facebook", "YouTube", or even "MySpace", although like the earlier chapter it does mention, perhaps somewhat anachronistically, the existence of web sites that host "internet versions of widely available pornographic magazines".

#### Exhibit 2a

Next up is another journal article:
Griffiths, M. (2009). Internet help and therapy for addictive behavior. Journal of CyberTherapy and Rehabilitation, 2, 43–52. (No DOI.)
The highlighted portions of this article appear to have been copied, verbatim and without attribution, from the following sources:
Yellow: Griffiths, M. (2005). Online therapy for addictive behaviors. CyberPsychology & Behavior, 8, 555–561. http://dx.doi.org/10.1089/cpb.2005.8.555
Green: Wood, R. T. A., & Griffiths, M. D. (2007). Online guidance, advice, and support for problem gamblers and concerned relatives and friends: An evaluation of the GamAid pilot service. British Journal of Guidance & Counselling, 35, 373–389. http://dx.doi.org/10.1080/03069880701593540
Blue: Griffiths, M. D., & Cooper, G. (2003). Online therapy: Implications for problem gamblers and clinicians. British Journal of Guidance & Counselling, 31, 113–135. http://dx.doi.org/10.1080/0306988031000086206

#### Exhibit 2b

Closely related to the previous exhibit is this book chapter:
Griffiths, M. (2010). Online advice, guidance and counseling for problem gamblers. In M. M. Cruz-Cunha, A. J. Tavares, & R. Simoes (Eds.), Handbook of research on developments in e-health and telemedicine: Technological and social perspectives (pp. 1116–1133). Hershey, PA: IGI Global.

The highlighted portions of this chapter appear to have been copied, verbatim and without attribution, from the following sources:
Yellow: Griffiths, M. D., & Cooper, G. (2003). Online therapy: Implications for problem gamblers and clinicians. British Journal of Guidance & Counselling, 31, 113–135. http://dx.doi.org/10.1080/0306988031000086206
Green: Griffiths, M. (2005). Online therapy for addictive behaviors. CyberPsychology & Behavior, 8, 555–561. http://dx.doi.org/10.1089/cpb.2005.8.555
Blue: Wood, R. T. A., & Griffiths, M. D. (2007). Online guidance, advice, and support for problem gamblers and concerned relatives and friends: An evaluation of the GamAid pilot service. British Journal of Guidance & Counselling, 35, 373–389. http://dx.doi.org/10.1080/03069880701593540
Apart from a change of coding colour, those are the same three sources that went into the article in Exhibit 2a. That is, these three source articles were apparently recycled into an article and a book chapter.

#### Exhibit 3

This one is a bit more complicated: an article of which about 80% consists of pieces that have been copied, verbatim and without attribution, from no less than seven other articles and book chapters.
Griffiths, M. D. (2015). Adolescent gambling and gambling-type games on social networking sites: Issues, concerns, and recommendations. Aloma, 33(2), 31–37. (No DOI.)

The source documents are:
Mauve: Anthony, K., & Griffiths, M. D. (2014). Online social gaming - why should we be worried? Therapeutic Innovations in Light of Technology, 5(1), 24–31. (No DOI.)
Pink: Carran, M., & Griffiths, M. (2015). Gambling and social gambling: An exploratory study of young people’s perceptions and behaviour. Aloma33(1), 101–113. (No DOI.)
Orange: Griffiths, M. D. (2014). Child and adolescent social gaming: What are the issues of concern? Education and Health, 32, 19–22. (No DOI.)
Indigo: Griffiths, M. D. (2014). Adolescent gambling via social networking sites: A brief overview. Education and Health31, 84–87. (No DOI.)
Light blue: Griffiths, M. D. (2013). Social gambling via Facebook: Further observations and concerns. Gaming Law Review & Economics, 17, 104–106. http://dx.doi.org/10.1089/glre.2013.1726
Yellow: Griffiths, M. (2011). Adolescent gambling. In B. B. Brown & M. Prinstein (Eds.), Encyclopedia of adolescence (Vol. 3, pp. 11–20). New York, NY: Academic Press.
Green: Griffiths, M. D. (2013). Social networking addiction: Emerging themes and issues. Journal ofAddiction Research & Therapy, 4, e118. http://dx.doi.org/10.4172/2155-6105.1000e118
Note that it is possible that I may have used more source documents than strictly necessary here, because some sections of the text are further duplicated across the various source articles and book chapters. However, in the absence (to my knowledge) of any definitions of best practices when looking for this type of duplication, I hope that readers will forgive any superfluous complexity.

#### Conclusion

In his Psychology Today piece, Dr. Griffiths describes a number of "false beliefs that many of us have about writing", including this: "Myth 2 - Good writing must be original: Little, if any, of what we write is truly original". I don't think I can improve on that.

#### Housekeeping

All of the annotated documents that went into making the images in this post are available here. I hope that this counts as fair use, but I will remove any document at once if anyone feels that their copyright has been infringed (by me, anyway).

## 30 July 2019

### Applying some error detection techniques to controversial past research: Rushton (1992)

A few days ago, James Heathers and I were cc'd in on a Twitter thread.

and
Neither of us had ever heard of J. Philippe Rushton before. At one point James tweeted this extract (which I later worked out was from this critical article), which included citations of authors pointing out "various technical errors in Rushton's procedures and theory".
I thought it might be interesting to look at an example of these "technical errors", so I picked one of those citations more of less at random (Cain & Vanderwolf, 1990, since it seemed like it would be easy to find on Google with only the authors' names), and downloaded both that article and Rushton's response to it. The latter was interesting because, although not an empirical article, it cited a number of other articles by Rushton. So I chose the one with the most empirical looking title, which was marked as being "In preparation", but ended up as this:
Rushton, J. P. (1992). Life-history comparisons between orientals and whites at a Canadian university. Personality and Individual Differences, 13, 439442. http://dx.doi.org/10.1016/0191-8869(92)90072-W
I found a PDF copy of this article at a "memorial site" dedicated to Rushton's work.

Now I don't know much about this area of research ("race differences"), or the kinds of questions that Rushton was asking in his survey, but it seems to me that there are a few strange things about this article. There were 73 "Oriental" and 211 "Non-Oriental" undergraduate participants (the latter apparently also being non-Black, non-Native American, etc., judging from the title of the article), who took first a two-hour and then a three-hour battery of tests in return for course credit. Some of these were regular psychological questionnaires, but then it all got a bit... biological (pp. 439440):
In the first session, lasting 2 hr, Ss completed a full-length intelligence test, the Multidimensional Aptitude Battery (Jackson, 1984). In the second session, lasting 3 hr, Ss completed the Eysenck Personality Questionnaire (Eysenck & Eysenck, 1975); the Sexual Opinion Survey (Fisher, Byrne, White & Kelley, 1988), the Self-Report Delinquency Scale (Rushton & Chrisjohn, 1981), and the Seriousness of Illness Rating Scale (Wyler, Masuda & Holmes, 1968), as well as self-report items assessing aspects of health, speed of maturation, sexual behaviour, and other life-history variables, many of which were similar to those used by Bogaert and Rushton (1989). Sex-combined  composites were formed from many of these items: Family Health included health ratings of various family members; Family Longevity included longevity ratings for various family members: Speed of Physical Maturation included age of puberty, age of pubic hair growth, age of menarche (for females), and age of first shaving (for males); Speed of Sexual Maturation included age of first masturbation, age of first petting, and age of first sexual intercourse; Reproductive Effort-Structures included size of genitalia, menstrual cycle length (for females), and amount of ejaculate (for males); Reproductive Effort-Behavioural included maximum number of orgasms in one 24 hr period, average number of orgasms per week, and maximum number of sexual partners in one month; and Family Altruism included parental marital status and self-ratings of altruism to family. Each S also rank ordered Blacks, Orientals, and Whites on several dimensions.
Whoa. Back up a bit there... (emphasis added)
Reproductive Effort-Structures included size of genitalia, menstrual cycle length (for females), and amount of ejaculate (for males)
The second and third of those variables are specified as being sex-specific, but the first, "size of genitalia", is not, suggesting that it was reported by men and women. Now, while most men have probably placed a ruler along their erect penis at some point, and might be prepared to report the result with varying degrees of desirability bias, I'm wondering how one measures "size of genitalia" in human females, not just in general, but also in the specific context of a bunch of people sitting in a room completing questionnaires. Similarly, I very much doubt if many of the men who had just put down their penis-measuring device had also then proceeded to ejaculate into a calibrated test tube and commit the resulting number of millilitres to memory in the same way as the result that they obtained from the ruler; yet, it would again appear to be challenging to accurately record this number (which, I suspect, is probably quite variable within subjects) in a lecture hall or other large space at a university where this type of study might take place.

I also have some doubts about some of the reported numbers. For example (p. 441):
At the item level ... the reported percentage frequency of reaching orgasm in each act of intercourse was 77% for Oriental males, 88% for White males, 40% for Oriental females, and 57% for White females.
Again, I'm not a sex researcher, but my N=1, first-hand experience of having been a healthy male undergraduate (full disclosure: this was mostly during the Carter administration) is that a typical frequency of reaching orgasm during intercourse is quite a lot higher than 88%. I checked with a sex researcher (who asked for their identity not to be used), and they told me that these appear to be exceptionally low rates for sexually functional young men in Canada, unless the question had been asked in an ambiguous way, such as "Did you finish?". (They also confirmed that measures of the dimensions of women's genitalia are nearly non-existent.)

Rushton also stated (p. 441) that "small positive correlations were found between head size and
general intelligence in both the Oriental (r = 0.14) and White samples (r = 0.21)"; indeed, he added in the Discussion section (p. 441) that "It is worth drawing attention to our replication of the head size-IQ relationship within an Oriental sample". However, with a sample size of 73, the 95% confidence interval around an r value of .14 is (.09, .37), which many researchers might not regard as indicative of any sort of replication.

There are some other numerical questions to be asked about this study. Look at Table 2, which shows the mean ratings given by the students of different "races" (Rushton's term) to three "races" (Black, White, and Oriental) on various attributes. That is, on, say, "Anxiety", each student ranked Blacks, Whites, and Orientals from 1 to 3, in some order, and the means of those integers were reported.

Did I just say "means of integers"? Maybe we can use GRIM here! With only two decimal places, we can't do anything with the 211 "Non-Oriental" participants, but we can check the means of the 73 "Orientals". And when we do(*), we find that seven of the 30 means are GRIM-inconsistent; that is, they are not the result of correctly rounding an integer total score that has been divided by 73. Those means are highlighted in this copy of Table 2,

It's important to note here that seven out of 30 (23%) inconsistent means is a lot when N=73, because if you just generate two-digit decimal values randomly with this sample size, 73% of them will be GRIM-consistent (i.e., 27% will be inconsistent) just by chance. A minute with an online binomial calculator shows that the chance of getting seven or more inconsistent means from 30 random values is about 58.5%; in other words, it's close to a toss-up.

A further issue is the totals of the three mean rankings in each row for each participant "race" do not always add up to 6.0. For example, the reported rounded Oriental rankings of Intelligence sum to 5.92, and even if these numbers had been rounded down from a mean that was 0.005 larger than the reported values in the table (i.e., 2.865, 1.985, and 1.085), the rounded row total would have been only 5.94. A similar problem affects the majority of the rows for Oriental rankings.

Of course, it is possible that either of these issues (i.e., the GRIM inconsistencies and the existence of total rankings below 6.00) could have been caused by missing values, although (a) Rushton did not report anything about completion rates and (b) some of the rankings could perhaps have been imputed very accurately (e.g., if 1 and 2 were filled in, the remaining value would be 3). It is, however, more difficult to explain how the total mean rankings by White participants for "Anxiety" and "Rule-following" came to have means of 6.03 and 6.02, respectively. Even if we assume that the component numbers in each case had been rounded up from a mean that was 0.005 smaller than the reported values in the table (e.g., for "Anxiety", these would be 2.145, 1.995, and 1.875), the rounded values for the row totals would be 6.02 for "Anxiety" and 6.01 for "Rule-following".

Another thing, pointed out by James Heathers, is that Rushton claimed (p. 441) that "No differences were found on the Speed of Physical Maturation or the Speed of Sexual Maturation composites"; and indeed there are no significance stars next to these two variables in Table 1. But these groups are in fact substantially different; the reported means and SDs imply t statistics of 5.1 (p < .001, 3 significance stars) and 2.8 (p < .01, 2 significance stars), respectively.

Finally, let's take a look at final-digit distribution of Rushton's numbers. I took the last digits of all of the means and SDs in Table 1, and all of the means in Table 2, and obtained this histogram:

Does this tell us anything? Well, we might expect the last digits of random variables to be uniformly distributed, although there could also be a small "Benford's Law" effect, particularly with the means and SDs that only have two significant figures, causing the distribution to be a little more right-skewed (i.e., with more smaller digits). We certainly don't have any reason to expect a substantially larger number of 4s, 5s, and 8s. The chi-square test here has a value of 15.907 on 9 degrees of freedom, for a p value of .069 (which might be a little lower if our expected distribution was a little heavier on the smaller numbers). Not the sort of thing you can take to the ORI on its own and demand action for, perhaps, but those peaks do look a little worrying.

The bottom line here is that if Dr. Rushton were alive today, I would probably be writing to him to ask for a close look at his data set.

(*) The Excel sheet and (ultra-minimal) R code for this post can be found here.

[Update 2019-07-31 08:23 UTC: Added discussion of the missing significance stars in Table 1.]

## 10 July 2019

### An open letter to Dr. Jerker Rönnberg

**** Begin update 2019-07-10 15:15 UTC ****
Dr. Rönnberg has written to me to say that he has been made aware of this post (thanks to whoever alerted him), and he has now read my e-mail.
**** End update 2019-07-10 15:15 UTC ****

At the bottom of this post is the text of an e-mail that I have now sent three times (on May 9, June 11, and June 25 of this year) to Dr. Jerker Rönnberg, who --- according to the website of the Scandinavian Journal of Psychology --- is the editor-in-chief of that journal. I have received no reply to any of these three attempts to contact Dr. Rönnberg. Nor did I receive any sort of non-delivery notification or out-of-office reply. Hence, I am making my request public here. I hope that this will not be seen as an presumptuous, unprofessional, or unreasonable.

I sent the mail to two different e-mail addresses that I found listed for Dr. Rönnberg, namely sjoped@ibv.liu.se (on the journal website) and jerker.ronnberg@liu.se (on the Linköping University website). Of course, it is possible that those two addresses lead to the same mailbox.

A possibility that cannot be entirely discounted is that each of my e-mails was treated as spam, and either deleted silently by Linköping University's system on arrival, or re-routed to Dr. Rönnberg's "junk" folder. I find this rather unlikely because, even after acknowledging my bias in this respect, I do not see anything in the text of the e-mail that would trigger a typical spam filter. Additionally, when spam is deleted on arrival it is customary for the system to respond with "550 spam detected"; I would also hope that after 20 or more years of using e-mail as a daily communication tool, most people would check at least the subject lines of the messages in their "junk" folder every so often before emptying that folder. Another possibility is that Dr. Rönnberg is away on sabbatical and has omitted to put in place an out-of-office reply. Whatever the explanation, however, the situation appears to be that the editor of the Scandinavian Journal of Psychology is, de facto, unreachable by e-mail.

My frustration here is with the complete absence of any form of acknowledgement that my e-mail has even been read. If, as I presume may be the case, my e-mails were indeed delivered to Dr. Rönnberg's inbox, I would have imagined that it would not have been a particularly onerous task to reply with a message such as "I will look into this." Indeed, even a reply such as "I will not look into this, please stop wasting my time" would have been less frustrating than the current situation. It is going to be difficult for people who want to correct the scientific literature to do so if editors, who are surely the first point of contact in the publishing system, are not available to communicate with them.

I will leave it up to readers of this blog to judge whether the request that I made to Dr. Rönnberg in my e-mails is sufficiently relevant to be worthy of at least a minimal reply, and also whether it is reasonable for me to "escalate" it here in the form of an open letter. In the meantime, if any members of the Editorial Board of the Scandinavian Journal of Psychology, or any other colleagues of Dr. Rönnberg, know of a way to bring this message to his attention, I would be most grateful.

From: Nick Brown
Date: Thu, 9 May 2019 at 23:32
Subject: Concerns with an article in Scandinavian Journal of Psychology
To: <jerker.ronnberg@liu.se>
Cc: James Heathers <jamesheathers@gmail.com>

Dear Dr. Rönnberg,

I am writing to you to express some serious concerns about the article "Women’s hairstyle and men’s behavior: A field experiment" by Dr. Nicolas Guéguen, published in Scandinavian Journal of Psychology in November 2015 (doi: 10.1111/sjop.12253). My colleague James Heathers (in CC) and I have described our concerns about this article, as well as a number of other problems in Dr. Guéguen's body of published work, in a document that I have attached to this e-mail, which we made public via a blog post (https://steamtraen.blogspot.com/2017/12/a-review-of-research-of-dr-nicolas.html) in December 2017.

More recently, we have been made aware of evidence suggesting that the research described in the article was in fact entirely designed and carried out by three undergraduate students. You will find a description of this issue in our most recent blog post (https://steamtraen.blogspot.com/2019/05/an-update-on-our-examination-of.html). For your convenience, I have also attached the report that these students wrote as part of their assignment, with their names redacted. (The original version with their names visible is available, but it is spread across several files; please let me know if you need it, and I will stitch those together.)

I have two principal concerns here. First, there would seem to be a severe ethical problem when a full professor writes an empirical article describing research that was apparently designed and carried out entirely by his students, without offering them authorship or indeed giving them any form of acknowledgement in the article. Second, we believe that the results are extremely implausible (e.g., an effect size corresponding to a Cohen's d of 2.44, and a data set that contains some unlikely patterns of regularity), which in turn leads us to believe that the students may have fabricated their work, as is apparently not uncommon in Dr. Guéguen's class (cf. the comments from the former student who contacted us).

Kind regards,
Nick Brown
PhD candidate, University of Groningen

## 09 May 2019

### An update on our examination of the research of Dr. Nicolas Guéguen

(Joint post by Nick Brown and James Heathers)

It's now well over a year since we published our previous blog post about the work of Dr. Nicolas Guéguen. Things have moved on since then, so here is an update.

*** Note: We have received a reply from the Scientific Integrity Officer at the University of Rennes-2, Alexandre Serres. See the update of 2019-05-22 at the bottom of this post ***

We have seen two documents from the Scientific Integrity Officer at the University of Rennes-2, which appears to have been the institution charged with investigating the apparent problems in Dr. Guéguen's work. The first of these dates from June 2018 and is entitled (our translation from French), "Preliminary Investigation Report Regarding the Allegations of Fraud against Nicolas Guéguen".

It is unfortunate that we have been told that we are not entitled to disseminate this document further, as it is considerably more trenchant in its criticism of Dr. Guéguen's work than its successor, described in the next paragraph of this blog post. We would also like to stress that the title of this document is extremely inexact. We have not made, and do not make, any specific allegations of fraud, nor are any implied. The initial document that we released is entitled “A commentary on some articles by Dr. Nicolas Guéguen” and details a long series of inconsistencies in research methods, procedures, and data. The words “fraud” and “misconduct” do not appear in this document, nor in any of our communications with the people who helped with the investigation. We restrict ourselves to pointing out that results are “implausible” (p. 2) or that scenarios are “unlikely [to] be enacted in practice” (p. 31).

The origin of inconsistencies (be it typographical errors, inappropriate statistical methods, analytical mistakes, inappropriate data handling, misconduct, or something else) is also irrelevant to the outcome of any assessment of research. Any research object with a strong and obvious series of inconsistencies may be deemed too inaccurate to trust, irrespective of their source. In other words, the description of inconsistency makes no presumption about the source of that inconsistency.

The second document, entitled "Memorandum of Understanding Regarding the Allegations of Lack of Scientific Integrity Concerning Nicolas Guéguen", is dated October 2018, and became effective on 10 December 2018. It describes the outcome of a meeting held on 10 September 2018 between (1) Dr. Guéguen, (2) the above-mentioned Scientific Integrity Officer, (3) a representative from the University of Rennes-2 legal department, and (4) an external expert who was, according to the report, "contacted by [Brown and Heathers] at the start of their inquiry". (We are not quite certain who this last person is, although the list of candidates is quite short.)

The Memorandum of Understanding is, frankly, not very hard-hitting. Dr. Guéguen admits to some errors in his general approach to research, notably using the results of undergraduate fieldwork projects as the basis of his articles, and he agrees that within three months of the date of effect of the report, he will retract two articles: "High heels increase women's attractiveness" in Archives of Sexual Behavior (J1) and "Color and women hitchhikers’ attractiveness: Gentlemen drivers prefer red" in Color Research and Application (J2). Recall that our original report into problems with Dr. Guéguen's research listed severe deficiencies in 10 articles; the other eight are barely mentioned.

On the question of Dr. Guéguen's use of undergraduate fieldwork: We were contacted in November 2018 by a former student from Dr. Guéguen's class, who gave us some interesting information. Here are a few highlights of what this person told us (our translation from French):
I was a student on an undergraduate course in <a social science field>. ... The university where Dr. Guéguen teaches has no psychology department. ... As part of an introductory class entitled "Methodology of the social sciences", we had to carry out a field study. ... This class was poorly integrated with the rest of the course, which had nothing to do with psychology. As a result, most of the students were not very interested in this class. Plus, we were fresh out of high school, and most of us knew nothing about statistics. Because we worked without any supervision, yet the class was graded, many students simply invented their data. I can state formally that I personally fabricated an entire experiment, and I know that many others did so too. ... At no point did Dr. Guéguen suggest to us that our results might be published.
Our correspondent also sent us an example of a report of one of these undergraduate field studies. This report had been distributed to the class by Dr. Guéguen himself as an example of good work by past students, and has obvious similarities to his 2015 articleIt was written by a student workgroup from such an undergraduate class, who claimed to have conducted similar tests on passers-by; the most impressive of the three sets of results (on page 7 of the report) was what appeared in the published article. The published version also contains some embellishments to the experimental procedure; for example, the article states that the confederate walked "in the same direction as the participant about three meters away" (p. 638), a detail that is not present in the original report by the students. A close reading of the report, combined with our correspondent's comments about the extent of admitted fabrication of data by the students, leads us to question whether the field experiments were carried out as described (for example, it is claimed that the three students tested 270 participants between them in a single afternoon, which is extraordinarily fast progress for this type of fieldwork).

(As we mentioned in our December 2017 blog post, at one point in our investigation Dr. Guéguen sent us, via the French Psychological Society, a collection of 25 reports of field work carried out by his students. None of these corresponded to any of the articles that we critiqued. Presumably he could have sent us the report that appears to have become the article "Women’s hairstyle and men’s behavior: A field experiment", but apparently he chose not to do so. Note also that the Memorandum of Understanding does not list this article as one that Dr. Guéguen is required to retract.)

We have made a number of documents available at https://osf.io/98nzj/, as follows:
• "20190509 Annotated Guéguen report and response.pdf" will probably be of most relevance to non French-speaking readers. It contains the most relevant paragraphs of the Memorandum of Understanding, in French and (our translation) English, accompanied by our responses in English, which then became the basis of our formal response.
• "Protocole d'accord_NG_2018-11-29.pdf" is the original "Memorandum of Understanding" document, in French.
• "20181211 Réponse Brown-Heathers au protocole d'accord.pdf" is our formal response, in French, to the "Summary" document.
• "20190425 NB-JH analysis of Gueguen articles.pdf" is the latest version of our original report into the problems we found in 10 articles by Dr. Guéguen.
• "Hairstyle report.pdf" is the student report of the fieldwork (in French) with a strong similarity to the article "Women’s hairstyle and men’s behavior: A field experiment", redacted to remove the names of the authors.
Alert readers will have noted that almost five months have elapsed since we wrote our response to the "Memorandum of Understanding" document. We have not commented publicly since then, because we were planning to publish this blog post in response to the first retraction of one of Dr. Guéguen's articles, which could either have been one that he was required to retract by the agreement, or one from another journal. (We are aware that at least two other journals, J3 and J4, are actively investigating multiple articles by Dr. Guéguen that they published.)

However, our patience has now run out. The two articles that Dr. Guéguen was required to retract are still untouched on the respective journals' websites, and our e-mails to the editors of those journals asking if they have received a request to retract the articles have gone unanswered (i.e., we haven't even been told to mind our own business) after several weeks and a reminder. No other journal has yet taken any action in the form of a retraction, correction, or expression of concern.

All of this leaves us dissatisfied. The Memorandum of Understanding notes on page 5 that Dr. Guéguen has 336 articles on ResearchGate published between 1999 and 2017. We have read approximately 40 of these articles, and we have concerns about the plausibility of the methods and results in a very large proportion of those. Were this affair to be considered closed after the retraction of just two articles—not including one that seems to have been published without attribution from the work of the author’s own students—it seems to us that this would leave a substantial amount of serious inconsistencies unresolved.

Accordingly, we feel it would be prudent for the relevant editors of journals in psychology, marketing, consumer behaviour, and related disciplines to take action. In light of what we now know about the methods deployed to collect the student project data, we do not think it would be excessive for every article by Dr. Guéguen to be critically re-examined by one or more external reviewers.

[ Update 2019-05-09 15:03 UTC: An updated version of our comments on the Memorandum of Understanding was uploaded to fix some minor errors, and the filename listed here was changed to reflect that. ]

[ Update 2019-05-09 18:51 UTC: Fixed a couple of typos. Thanks to Jordan Anaya. ]

[ Update 2019-05-10 16:33 UTC: Fixed a couple of typos and stylistic errors. ]

[ Update 2019-05-22 15:51 UTC:
We have received a reply to our post from Alexandre Serres, who is the Scientific Integrity Officer at the University of Rennes-2. This took the form of a 3-page document (in both French and English versions) that did not fit into the comments box of a Blogger.com post, so we have made these two versions available at our OSF page. The filenames are "Réponse_billet de Brown et Heathers_2019-05-20.pdf" (in French) and "Réponse_billet de Brown et Heathers_2019-05-20 EN" (in English).

We have also added a document that was created by the university before the inquiry took place (filename "Procédure_traitement des allégations de fraude_Univ Rennes2_2018-01-31.pdf"), which established the ground rules and procedural framework for the inquiry into Dr. Guéguen's research.

We thank Alexandre Serres for these clarifications, and would only add that, while we are disappointed in the outcome of the process in terms of the very limited impact that it seems to have had on the problems that we identified in the public literature, we do not have any specific criticisms of the way in which the procedure was carried out.
]

## 01 May 2019

### The results of my crowdsourced reanalysis project

Just over a year ago, in this post, I asked for volunteers to help me reanalyze an article that I had read entirely by chance, and which seemed to have a few statistical issues. About ten people offered to help, and three of them (Jan van Rongen, Jakob van de Velde, and Matt Williams) stayed the course. Today we have released our preprint on PsyArXiv detailing what we found.

The article in question is "Is Obesity Associated with Major Depression? Results from the Third National Health and Nutrition Examination Survey" (2003) by Onyike, Crum, Lee, Lyketsos, and  Eaton. This has 951 citations according to Google Scholar, making it quite an important paper in the literature on obesity and mental health. As I mentioned in my earlier blog post, I contacted the lead author, Dr. Chiadi Onyike, when I first had questions about the paper, but our correspondence petered out before anything substantial was discussed.

It turns out that most of the original problems that I thought I had found were due to me misunderstanding the method; I had overlooked that the authors had a weighted survey design. However, even within this design, we found a number of issues with the reported results. The power calculations seem to be post hoc and may not have carried out appropriately; this makes us wonder whether the main conclusion of the article (i.e., that severe obesity is strongly associated with major depressive disorder) is well supported. There are a couple of simple transcription errors in the tables, which as a minimum seem to merit a correction. There are also inconsistencies in the sample sizes.

I should make it clear that there is absolutely no suggestion of any sort of misconduct here. Standards of reproducibility have advanced considerably since Onyike et al.'s article was published, as has our understanding of statistical power; and the remaining errors are of the type that anyone who has tried to assemble results from computer output into a manuscript will recognise.

I think that all four of us found the exercise interesting; I know I did. Everyone downloaded the publicly available dataset separately and performed their analyses independently, until we pooled the results starting in October of this year. We all did our analyses in R, whereas I had hoped for more diversity (especially if someone had used Stata, which is what the original authors used); however, this had the advantage that I was able to combine everybody's contribution into a single script file. You can find the summary of our analyses in an OSF repository (the URL for which is in the preprint).

We intend to submit the preprint for publication, initially to the American Journal of Epidemiology (where the original article first appeared). I'll post here if there are any interesting developments.

If you have something to say about the preprint, or any questions or remarks that you might have about this way of doing reanalyses, please feel free to comment!

## 19 February 2019

### Just another week in real-world science: Butler, Pentoney, and Bong (2017).

This is a joint post by Nick Brown and Stuart Ritchie. All royalty cheques arising from the post will be split between us, as will all the legal bills.

Butler, H. A., Pentoney, C., & Bong, M. P. (2017). Predicting real-world outcomes: Critical thinking ability is a better predictor of life decisions than intelligence. Thinking Skills and Creativity, 25, 38–46. http://dx.doi.org/10.1016/j.tsc.2017.06.005

We are not aware of any official publicly available copies of this article, but readers with institutional access to Elsevier journals should have no trouble in finding it, and otherwise we believe there may exist other ways to get hold of a copy using the DOI.

Butler et al.'s article received some favourable coverage when it appeared, including in Forbes, Psychology Today, the BPS Digest, and an article by the lead author in Scientific American that was picked up by the blog of the noted skeptic (especially of homeopathy) Edzard Ernst. Its premise is that the ability to think critically (measured by an instrument called the Halpern Critical Thinking Assessment, HCTA) is a better predictor than IQ (measured with a set of tests called the Intelligence Structure Battery, or INSBAT) of making life decisions that lead to negative outcomes, measured by the Real-World Outcomes (RWO) Inventory, which was described by its creator in a previous article (Butler, 2012).

In theory, we’d expect both critical thinking and IQ to act favourably to reduce negative experiences. The correlations between both predictors and the outcome in this study would thus be expected to be negative, and indeed they were. For critical thinking the correlation was −.330 and for IQ it was −.264. But is this a "significant" difference?

To test this, Butler et al. conducted a hierarchical regression, entering IQ (INSBAT) and then critical thinking (HCTA) as predictors. They concluded that, since the difference in R² when the second predictor (HCTA) was added was statistically significant, this indicated that the difference between the correlations of each predictors with the outcome (the correlation for HCTA being the larger) was also significant. But this is a mistake. On its own, the fact that the addition of a second predictor variable to a model causes a substantial increase in R² might tell us that both variables add incrementally to the prediction of the outcome, but it tells us nothing about the relative strength of the correlations between the two predictors and the outcome. This is because the change in R² is also dependent on the correlation between the two predictors (here, .380). The usual way to compare the strength of two correlations, taking into account the third variable, is to use Steiger’s z, as shown by the following R code:

> library(cocor)
> cocor.dep.groups.overlap(-.264, -.330, .380, 244, "steiger1980", alt="t")
<some lines of output omitted for brevity>
z = 0.9789, p-value = 0.3276

So the Steiger’s z test tells us that there’s no statistically significant difference between the sizes of these two (dependent) correlations in this sample, p = .328.

We noted a second problem, namely that the reported bivariate correlations are not compatible with the results of the regression reported in Table 2. In a multiple regression model, the standardized regression coefficients are determined (only) by the pattern of correlations between the variables, and in the case of the two-predictor regression, these coefficients can be determined by a simple formula. Using that formula, we calculated that the coefficients for INSBAT and HCTA in model 2 should be −.162 and −.268, respectively, whereas Butler et al.’s Table 2 reports them as −.158 and −.323. When we wrote to Dr. Butler in July 2017 to point out these issues, she was unable to provide us with the data set, but she did send us an SPSS output file in which neither the correlations nor the regression coefficients exactly matched the values reported in the article.

There was a very minor third problem: The coefficient of .264 in the first cell of Table 2 is missing its minus sign. (Dr. Butler also noticed that there was an issue with the significance stars in this table.)

We wrote to the two joint editors-in-chief of Thinking Skills and Creativity in November 2017. They immediately indicated that they would handle the points that we had raised with the "journal management team" (i.e., Elsevier). We found this rather surprising, as we had only raised scientific issues that we imagined would be entirely an editorial matter. Over the following year we occasionally sent out messages asking if any progress had been made. In November 2018, we were told by the Elsevier representative that following a review of the Butler et al. article by two independent reviewers who are "senior statistical experts in this field", the journal had decided to issue a correction for... the missing minus sign in Table 2. And nothing else.

We were, to say the least, somewhat disappointed by this. We wrote to ask for a copy of the report by these senior statistical experts, but received no reply (and, after more than three months, we guess we aren't going to get one). Perhaps the experts disagree with us about the relevance of Steiger's z, but the inconsistencies between the correlations and the regression coefficients are a matter of simple mathematics and the evidence of numerical discrepancies between the authors' own SPSS output and the published article is indisputable.

So apparently Butler et al.'s result will stand, and another minor urban legend with no empirical support will be added to the folklore of "forget IQ, you just have to work hard (and I can show you how for only \$499)" coaches. Of course, both of us are in favour of critical thinking. We just wish that people involved in publishing research about it were as well.

We had been planning to wait for the correction to be issued before we wrote this post, but as far as we can tell it still hasn't appeared (well over a year since we originally contacted the editors, and 19 months since we first contacted the authors). Some recent events make us believe that now would be an appropriate moment to bring this matter to public attention. Most important among these are the two new papers from Ben Goldacre and his team, showing what (a) editors and (b) researchers did when problems were pointed out in medical trial study protocols (spoiler: very often, not much). Then the inimitable James Heathers tweeted this thread expressing some of the frustrations that he (sometimes abetted by Nick) has had when trying to get editors to fix problems. And last week we also saw the case of a publisher taking a ridiculous amount of time to retract an article that was published in one of their journals published after it had been stolen, accompanied by an editorial note of the "move along, nothing to see here" variety.

There seems to be a real problem with academic editors, especially those at the journals of certain publishers, being reluctant, unwilling, or unable to take action on even the simplest problems without the approval of the publisher, whose evaluation of the situation may be based as much on the need to save face as to correct the scientific record.

A final anecdote: One of us (Nick) has been told of a case where the editor would like to retract at least two fraudulent articles but is waiting for the publisher (not Elsevier, in that case) to determine whether the damage to their reputation caused by retracting would be greater than that caused by not retracting. Is this really the kind of consideration to which we want the scientific literature held hostage?

References

Butler, H. A. (2012). Halpern critical thinking assessment predicts real-world outcomes of critical thinking. Applied Cognitive Psychology, 26, 721–729. http://dx.doi.org/10.1002/acp.2851