30 July 2019

Applying some error detection techniques to controversial past research: Rushton (1992)

A few days ago, James Heathers and I were cc'd in on a Twitter thread.

Neither of us had ever heard of J. Philippe Rushton before. At one point James tweeted this extract (which I later worked out was from this critical article), which included citations of authors pointing out "various technical errors in Rushton's procedures and theory".
I thought it might be interesting to look at an example of these "technical errors", so I picked one of those citations more of less at random (Cain & Vanderwolf, 1990, since it seemed like it would be easy to find on Google with only the authors' names), and downloaded both that article and Rushton's response to it. The latter was interesting because, although not an empirical article, it cited a number of other articles by Rushton. So I chose the one with the most empirical looking title, which was marked as being "In preparation", but ended up as this:
Rushton, J. P. (1992). Life-history comparisons between orientals and whites at a Canadian university. Personality and Individual Differences, 13, 439442. http://dx.doi.org/10.1016/0191-8869(92)90072-W
I found a PDF copy of this article at a "memorial site" dedicated to Rushton's work.

Now I don't know much about this area of research ("race differences"), or the kinds of questions that Rushton was asking in his survey, but it seems to me that there are a few strange things about this article. There were 73 "Oriental" and 211 "Non-Oriental" undergraduate participants (the latter apparently also being non-Black, non-Native American, etc., judging from the title of the article), who took first a two-hour and then a three-hour battery of tests in return for course credit. Some of these were regular psychological questionnaires, but then it all got a bit... biological (pp. 439440):
In the first session, lasting 2 hr, Ss completed a full-length intelligence test, the Multidimensional Aptitude Battery (Jackson, 1984). In the second session, lasting 3 hr, Ss completed the Eysenck Personality Questionnaire (Eysenck & Eysenck, 1975); the Sexual Opinion Survey (Fisher, Byrne, White & Kelley, 1988), the Self-Report Delinquency Scale (Rushton & Chrisjohn, 1981), and the Seriousness of Illness Rating Scale (Wyler, Masuda & Holmes, 1968), as well as self-report items assessing aspects of health, speed of maturation, sexual behaviour, and other life-history variables, many of which were similar to those used by Bogaert and Rushton (1989). Sex-combined  composites were formed from many of these items: Family Health included health ratings of various family members; Family Longevity included longevity ratings for various family members: Speed of Physical Maturation included age of puberty, age of pubic hair growth, age of menarche (for females), and age of first shaving (for males); Speed of Sexual Maturation included age of first masturbation, age of first petting, and age of first sexual intercourse; Reproductive Effort-Structures included size of genitalia, menstrual cycle length (for females), and amount of ejaculate (for males); Reproductive Effort-Behavioural included maximum number of orgasms in one 24 hr period, average number of orgasms per week, and maximum number of sexual partners in one month; and Family Altruism included parental marital status and self-ratings of altruism to family. Each S also rank ordered Blacks, Orientals, and Whites on several dimensions.
Whoa. Back up a bit there... (emphasis added)
Reproductive Effort-Structures included size of genitalia, menstrual cycle length (for females), and amount of ejaculate (for males)
The second and third of those variables are specified as being sex-specific, but the first, "size of genitalia", is not, suggesting that it was reported by men and women. Now, while most men have probably placed a ruler along their erect penis at some point, and might be prepared to report the result with varying degrees of desirability bias, I'm wondering how one measures "size of genitalia" in human females, not just in general, but also in the specific context of a bunch of people sitting in a room completing questionnaires. Similarly, I very much doubt if many of the men who had just put down their penis-measuring device had also then proceeded to ejaculate into a calibrated test tube and commit the resulting number of millilitres to memory in the same way as the result that they obtained from the ruler; yet, it would again appear to be challenging to accurately record this number (which, I suspect, is probably quite variable within subjects) in a lecture hall or other large space at a university where this type of study might take place.

I also have some doubts about some of the reported numbers. For example (p. 441):
At the item level ... the reported percentage frequency of reaching orgasm in each act of intercourse was 77% for Oriental males, 88% for White males, 40% for Oriental females, and 57% for White females.
Again, I'm not a sex researcher, but my N=1, first-hand experience of having been a healthy male undergraduate (full disclosure: this was mostly during the Carter administration) is that a typical frequency of reaching orgasm during intercourse is quite a lot higher than 88%. I checked with a sex researcher (who asked for their identity not to be used), and they told me that these appear to be exceptionally low rates for sexually functional young men in Canada, unless the question had been asked in an ambiguous way, such as "Did you finish?". (They also confirmed that measures of the dimensions of women's genitalia are nearly non-existent.)

Rushton also stated (p. 441) that "small positive correlations were found between head size and
general intelligence in both the Oriental (r = 0.14) and White samples (r = 0.21)"; indeed, he added in the Discussion section (p. 441) that "It is worth drawing attention to our replication of the head size-IQ relationship within an Oriental sample". However, with a sample size of 73, the 95% confidence interval around an r value of .14 is (.09, .37), which many researchers might not regard as indicative of any sort of replication.

There are some other numerical questions to be asked about this study. Look at Table 2, which shows the mean ratings given by the students of different "races" (Rushton's term) to three "races" (Black, White, and Oriental) on various attributes. That is, on, say, "Anxiety", each student ranked Blacks, Whites, and Orientals from 1 to 3, in some order, and the means of those integers were reported.

Did I just say "means of integers"? Maybe we can use GRIM here! With only two decimal places, we can't do anything with the 211 "Non-Oriental" participants, but we can check the means of the 73 "Orientals". And when we do(*), we find that seven of the 30 means are GRIM-inconsistent; that is, they are not the result of correctly rounding an integer total score that has been divided by 73. Those means are highlighted in this copy of Table 2,

It's important to note here that seven out of 30 (23%) inconsistent means is a lot when N=73, because if you just generate two-digit decimal values randomly with this sample size, 73% of them will be GRIM-consistent (i.e., 27% will be inconsistent) just by chance. A minute with an online binomial calculator shows that the chance of getting seven or more inconsistent means from 30 random values is about 58.5%; in other words, it's close to a toss-up.

A further issue is the totals of the three mean rankings in each row for each participant "race" do not always add up to 6.0. For example, the reported rounded Oriental rankings of Intelligence sum to 5.92, and even if these numbers had been rounded down from a mean that was 0.005 larger than the reported values in the table (i.e., 2.865, 1.985, and 1.085), the rounded row total would have been only 5.94. A similar problem affects the majority of the rows for Oriental rankings.

Of course, it is possible that either of these issues (i.e., the GRIM inconsistencies and the existence of total rankings below 6.00) could have been caused by missing values, although (a) Rushton did not report anything about completion rates and (b) some of the rankings could perhaps have been imputed very accurately (e.g., if 1 and 2 were filled in, the remaining value would be 3). It is, however, more difficult to explain how the total mean rankings by White participants for "Anxiety" and "Rule-following" came to have means of 6.03 and 6.02, respectively. Even if we assume that the component numbers in each case had been rounded up from a mean that was 0.005 smaller than the reported values in the table (e.g., for "Anxiety", these would be 2.145, 1.995, and 1.875), the rounded values for the row totals would be 6.02 for "Anxiety" and 6.01 for "Rule-following".

Another thing, pointed out by James Heathers, is that Rushton claimed (p. 441) that "No differences were found on the Speed of Physical Maturation or the Speed of Sexual Maturation composites"; and indeed there are no significance stars next to these two variables in Table 1. But these groups are in fact substantially different; the reported means and SDs imply t statistics of 5.1 (p < .001, 3 significance stars) and 2.8 (p < .01, 2 significance stars), respectively.

Finally, let's take a look at final-digit distribution of Rushton's numbers. I took the last digits of all of the means and SDs in Table 1, and all of the means in Table 2, and obtained this histogram:

Does this tell us anything? Well, we might expect the last digits of random variables to be uniformly distributed, although there could also be a small "Benford's Law" effect, particularly with the means and SDs that only have two significant figures, causing the distribution to be a little more right-skewed (i.e., with more smaller digits). We certainly don't have any reason to expect a substantially larger number of 4s, 5s, and 8s. The chi-square test here has a value of 15.907 on 9 degrees of freedom, for a p value of .069 (which might be a little lower if our expected distribution was a little heavier on the smaller numbers). Not the sort of thing you can take to the ORI on its own and demand action for, perhaps, but those peaks do look a little worrying.

The bottom line here is that if Dr. Rushton were alive today, I would probably be writing to him to ask for a close look at his data set.

(*) The Excel sheet and (ultra-minimal) R code for this post can be found here.

[Update 2019-07-31 08:23 UTC: Added discussion of the missing significance stars in Table 1.]

1 comment:

  1. The long Methods segment you quote has other weird entries too. Maximum orgasms per 24 hour period? Average orgasms per week? Age of first petting? Age of pubic hair growth? Does anyone think asking a bunch of undergraduates these questions would produce signal-containing data? And no, combining a bunch of garbage variables into a composite garbage scale probably won't help!