30 July 2019

Applying some error detection techniques to controversial past research: Rushton (1992)

A few days ago, James Heathers and I were cc'd in on a Twitter thread.

Neither of us had ever heard of J. Philippe Rushton before. At one point James tweeted this extract (which I later worked out was from this critical article), which included citations of authors pointing out "various technical errors in Rushton's procedures and theory".
I thought it might be interesting to look at an example of these "technical errors", so I picked one of those citations more of less at random (Cain & Vanderwolf, 1990, since it seemed like it would be easy to find on Google with only the authors' names), and downloaded both that article and Rushton's response to it. The latter was interesting because, although not an empirical article, it cited a number of other articles by Rushton. So I chose the one with the most empirical looking title, which was marked as being "In preparation", but ended up as this:
Rushton, J. P. (1992). Life-history comparisons between orientals and whites at a Canadian university. Personality and Individual Differences, 13, 439442. http://dx.doi.org/10.1016/0191-8869(92)90072-W
I found a PDF copy of this article at a "memorial site" dedicated to Rushton's work.

Now I don't know much about this area of research ("race differences"), or the kinds of questions that Rushton was asking in his survey, but it seems to me that there are a few strange things about this article. There were 73 "Oriental" and 211 "Non-Oriental" undergraduate participants (the latter apparently also being non-Black, non-Native American, etc., judging from the title of the article), who took first a two-hour and then a three-hour battery of tests in return for course credit. Some of these were regular psychological questionnaires, but then it all got a bit... biological (pp. 439440):
In the first session, lasting 2 hr, Ss completed a full-length intelligence test, the Multidimensional Aptitude Battery (Jackson, 1984). In the second session, lasting 3 hr, Ss completed the Eysenck Personality Questionnaire (Eysenck & Eysenck, 1975); the Sexual Opinion Survey (Fisher, Byrne, White & Kelley, 1988), the Self-Report Delinquency Scale (Rushton & Chrisjohn, 1981), and the Seriousness of Illness Rating Scale (Wyler, Masuda & Holmes, 1968), as well as self-report items assessing aspects of health, speed of maturation, sexual behaviour, and other life-history variables, many of which were similar to those used by Bogaert and Rushton (1989). Sex-combined  composites were formed from many of these items: Family Health included health ratings of various family members; Family Longevity included longevity ratings for various family members: Speed of Physical Maturation included age of puberty, age of pubic hair growth, age of menarche (for females), and age of first shaving (for males); Speed of Sexual Maturation included age of first masturbation, age of first petting, and age of first sexual intercourse; Reproductive Effort-Structures included size of genitalia, menstrual cycle length (for females), and amount of ejaculate (for males); Reproductive Effort-Behavioural included maximum number of orgasms in one 24 hr period, average number of orgasms per week, and maximum number of sexual partners in one month; and Family Altruism included parental marital status and self-ratings of altruism to family. Each S also rank ordered Blacks, Orientals, and Whites on several dimensions.
Whoa. Back up a bit there... (emphasis added)
Reproductive Effort-Structures included size of genitalia, menstrual cycle length (for females), and amount of ejaculate (for males)
The second and third of those variables are specified as being sex-specific, but the first, "size of genitalia", is not, suggesting that it was reported by men and women. Now, while most men have probably placed a ruler along their erect penis at some point, and might be prepared to report the result with varying degrees of desirability bias, I'm wondering how one measures "size of genitalia" in human females, not just in general, but also in the specific context of a bunch of people sitting in a room completing questionnaires. Similarly, I very much doubt if many of the men who had just put down their penis-measuring device had also then proceeded to ejaculate into a calibrated test tube and commit the resulting number of millilitres to memory in the same way as the result that they obtained from the ruler; yet, it would again appear to be challenging to accurately record this number (which, I suspect, is probably quite variable within subjects) in a lecture hall or other large space at a university where this type of study might take place.

I also have some doubts about some of the reported numbers. For example (p. 441):
At the item level ... the reported percentage frequency of reaching orgasm in each act of intercourse was 77% for Oriental males, 88% for White males, 40% for Oriental females, and 57% for White females.
Again, I'm not a sex researcher, but my N=1, first-hand experience of having been a healthy male undergraduate (full disclosure: this was mostly during the Carter administration) is that a typical frequency of reaching orgasm during intercourse is quite a lot higher than 88%. I checked with a sex researcher (who asked for their identity not to be used), and they told me that these appear to be exceptionally low rates for sexually functional young men in Canada, unless the question had been asked in an ambiguous way, such as "Did you finish?". (They also confirmed that measures of the dimensions of women's genitalia are nearly non-existent.)

Rushton also stated (p. 441) that "small positive correlations were found between head size and
general intelligence in both the Oriental (r = 0.14) and White samples (r = 0.21)"; indeed, he added in the Discussion section (p. 441) that "It is worth drawing attention to our replication of the head size-IQ relationship within an Oriental sample". However, with a sample size of 73, the 95% confidence interval around an r value of .14 is (.09, .37), which many researchers might not regard as indicative of any sort of replication.

There are some other numerical questions to be asked about this study. Look at Table 2, which shows the mean ratings given by the students of different "races" (Rushton's term) to three "races" (Black, White, and Oriental) on various attributes. That is, on, say, "Anxiety", each student ranked Blacks, Whites, and Orientals from 1 to 3, in some order, and the means of those integers were reported.

Did I just say "means of integers"? Maybe we can use GRIM here! With only two decimal places, we can't do anything with the 211 "Non-Oriental" participants, but we can check the means of the 73 "Orientals". And when we do(*), we find that seven of the 30 means are GRIM-inconsistent; that is, they are not the result of correctly rounding an integer total score that has been divided by 73. Those means are highlighted in this copy of Table 2,

It's important to note here that seven out of 30 (23%) inconsistent means is a lot when N=73, because if you just generate two-digit decimal values randomly with this sample size, 73% of them will be GRIM-consistent (i.e., 27% will be inconsistent) just by chance. A minute with an online binomial calculator shows that the chance of getting seven or more inconsistent means from 30 random values is about 58.5%; in other words, it's close to a toss-up.

A further issue is the totals of the three mean rankings in each row for each participant "race" do not always add up to 6.0. For example, the reported rounded Oriental rankings of Intelligence sum to 5.92, and even if these numbers had been rounded down from a mean that was 0.005 larger than the reported values in the table (i.e., 2.865, 1.985, and 1.085), the rounded row total would have been only 5.94. A similar problem affects the majority of the rows for Oriental rankings.

Of course, it is possible that either of these issues (i.e., the GRIM inconsistencies and the existence of total rankings below 6.00) could have been caused by missing values, although (a) Rushton did not report anything about completion rates and (b) some of the rankings could perhaps have been imputed very accurately (e.g., if 1 and 2 were filled in, the remaining value would be 3). It is, however, more difficult to explain how the total mean rankings by White participants for "Anxiety" and "Rule-following" came to have means of 6.03 and 6.02, respectively. Even if we assume that the component numbers in each case had been rounded up from a mean that was 0.005 smaller than the reported values in the table (e.g., for "Anxiety", these would be 2.145, 1.995, and 1.875), the rounded values for the row totals would be 6.02 for "Anxiety" and 6.01 for "Rule-following".

Another thing, pointed out by James Heathers, is that Rushton claimed (p. 441) that "No differences were found on the Speed of Physical Maturation or the Speed of Sexual Maturation composites"; and indeed there are no significance stars next to these two variables in Table 1. But these groups are in fact substantially different; the reported means and SDs imply t statistics of 5.1 (p < .001, 3 significance stars) and 2.8 (p < .01, 2 significance stars), respectively.

Finally, let's take a look at final-digit distribution of Rushton's numbers. I took the last digits of all of the means and SDs in Table 1, and all of the means in Table 2, and obtained this histogram:

Does this tell us anything? Well, we might expect the last digits of random variables to be uniformly distributed, although there could also be a small "Benford's Law" effect, particularly with the means and SDs that only have two significant figures, causing the distribution to be a little more right-skewed (i.e., with more smaller digits). We certainly don't have any reason to expect a substantially larger number of 4s, 5s, and 8s. The chi-square test here has a value of 15.907 on 9 degrees of freedom, for a p value of .069 (which might be a little lower if our expected distribution was a little heavier on the smaller numbers). Not the sort of thing you can take to the ORI on its own and demand action for, perhaps, but those peaks do look a little worrying.

The bottom line here is that if Dr. Rushton were alive today, I would probably be writing to him to ask for a close look at his data set.

(*) The Excel sheet and (ultra-minimal) R code for this post can be found here.

[Update 2019-07-31 08:23 UTC: Added discussion of the missing significance stars in Table 1.]

10 July 2019

An open letter to Dr. Jerker Rönnberg

**** Begin update 2019-07-10 15:15 UTC ****
Dr. Rönnberg has written to me to say that he has been made aware of this post (thanks to whoever alerted him), and he has now read my e-mail.
**** End update 2019-07-10 15:15 UTC ****

At the bottom of this post is the text of an e-mail that I have now sent three times (on May 9, June 11, and June 25 of this year) to Dr. Jerker Rönnberg, who --- according to the website of the Scandinavian Journal of Psychology --- is the editor-in-chief of that journal. I have received no reply to any of these three attempts to contact Dr. Rönnberg. Nor did I receive any sort of non-delivery notification or out-of-office reply. Hence, I am making my request public here. I hope that this will not be seen as an presumptuous, unprofessional, or unreasonable.

I sent the mail to two different e-mail addresses that I found listed for Dr. Rönnberg, namely sjoped@ibv.liu.se (on the journal website) and jerker.ronnberg@liu.se (on the Linköping University website). Of course, it is possible that those two addresses lead to the same mailbox.

A possibility that cannot be entirely discounted is that each of my e-mails was treated as spam, and either deleted silently by Linköping University's system on arrival, or re-routed to Dr. Rönnberg's "junk" folder. I find this rather unlikely because, even after acknowledging my bias in this respect, I do not see anything in the text of the e-mail that would trigger a typical spam filter. Additionally, when spam is deleted on arrival it is customary for the system to respond with "550 spam detected"; I would also hope that after 20 or more years of using e-mail as a daily communication tool, most people would check at least the subject lines of the messages in their "junk" folder every so often before emptying that folder. Another possibility is that Dr. Rönnberg is away on sabbatical and has omitted to put in place an out-of-office reply. Whatever the explanation, however, the situation appears to be that the editor of the Scandinavian Journal of Psychology is, de facto, unreachable by e-mail.

My frustration here is with the complete absence of any form of acknowledgement that my e-mail has even been read. If, as I presume may be the case, my e-mails were indeed delivered to Dr. Rönnberg's inbox, I would have imagined that it would not have been a particularly onerous task to reply with a message such as "I will look into this." Indeed, even a reply such as "I will not look into this, please stop wasting my time" would have been less frustrating than the current situation. It is going to be difficult for people who want to correct the scientific literature to do so if editors, who are surely the first point of contact in the publishing system, are not available to communicate with them.

I will leave it up to readers of this blog to judge whether the request that I made to Dr. Rönnberg in my e-mails is sufficiently relevant to be worthy of at least a minimal reply, and also whether it is reasonable for me to "escalate" it here in the form of an open letter. In the meantime, if any members of the Editorial Board of the Scandinavian Journal of Psychology, or any other colleagues of Dr. Rönnberg, know of a way to bring this message to his attention, I would be most grateful.

From: Nick Brown <nicholasjlbrown@gmail.com>
Date: Thu, 9 May 2019 at 23:32
Subject: Concerns with an article in Scandinavian Journal of Psychology
To: <jerker.ronnberg@liu.se>
Cc: James Heathers <jamesheathers@gmail.com>

Dear Dr. Rönnberg,

I am writing to you to express some serious concerns about the article "Women’s hairstyle and men’s behavior: A field experiment" by Dr. Nicolas GuĂ©guen, published in Scandinavian Journal of Psychology in November 2015 (doi: 10.1111/sjop.12253). My colleague James Heathers (in CC) and I have described our concerns about this article, as well as a number of other problems in Dr. GuĂ©guen's body of published work, in a document that I have attached to this e-mail, which we made public via a blog post (https://steamtraen.blogspot.com/2017/12/a-review-of-research-of-dr-nicolas.html) in December 2017.

More recently, we have been made aware of evidence suggesting that the research described in the article was in fact entirely designed and carried out by three undergraduate students. You will find a description of this issue in our most recent blog post (https://steamtraen.blogspot.com/2019/05/an-update-on-our-examination-of.html). For your convenience, I have also attached the report that these students wrote as part of their assignment, with their names redacted. (The original version with their names visible is available, but it is spread across several files; please let me know if you need it, and I will stitch those together.)

I have two principal concerns here. First, there would seem to be a severe ethical problem when a full professor writes an empirical article describing research that was apparently designed and carried out entirely by his students, without offering them authorship or indeed giving them any form of acknowledgement in the article. Second, we believe that the results are extremely implausible (e.g., an effect size corresponding to a Cohen's d of 2.44, and a data set that contains some unlikely patterns of regularity), which in turn leads us to believe that the students may have fabricated their work, as is apparently not uncommon in Dr. GuĂ©guen's class (cf. the comments from the former student who contacted us).

The decision about what, if anything, should be done about this situation is of course entirely in your hands. Please do not hesitate to ask if you require any further information.

Kind regards,
Nick Brown
PhD candidate, University of Groningen