04 July 2016

Old stereotypes

For an assortment of reasons, I found myself reading this article one day: This Old Stereotype: The Pervasiveness and Persistence of the Elderly Stereotype by Amy J.C. Cuddy, Michael I. Norton, and Susan T. Fiske (Journal of Social Issues, 2005).

The premise was (roughly) that elderly people are stereotyped as "warmer" to the extent that they are also perceived as incompetent (as in "Grandma's adorable, but she is a bit doddery").  The authors wrote:

We might expect a competent elderly person to be seen as less warm than a reassuringly incompetent elderly person. The open question is whether this predicted loss of warmth is offset by increases in perceived competence, or whether efforts to gain competence may backfire, decreasing rated warmth without corresponding benefits in competence(*).

The experimental scenario was fairly simple.  There were 55 participants in three conditions.  In the Control condition, participants read a neutral story about an elderly man, named George.  In the High Incompetence (hereafter, just High) condition, the story had extra information suggesting George was rather forgetful.  In the Low Incompetence (hereafter, just Low) condition, by contrast, the story had extra information suggesting George had a pretty good memory for his age.   The dependent variable was a rating of how warmly participants felt towards George: whether they thought he was warm, friendly, and good-natured.  Each of those was measured on a 1-9 scale.

Here is the results section:

Let's see.  The three warmth ratings were averaged, and then a one-way ANOVA was performed.  This was statistically significant, but of course that doesn't tell us exactly where the differences are coming from.  You might expect to see this investigated with standard ANOVA post-hoc tests (such as Tukey's HSD), but in this case, the authors apparently chose to report simple t tests --- "Paired comparisons" (**) --- comparing the groups.  Between High and Low, the t value was reported as 5.03, and between High and Control, it was 11.14.  These values are always going to be statistically significant; for 5.03 with 35 dfs this is a p of around .00001 and for 11.14 with 34 dfs, the p value is bordering on the homeopathic, certainly far below .00000001.

Hold on a minute.  The overall 3x1 ANOVA was just about significant at p < .03, but two of the three possible t tests were slam-dunk certainties?  That doesn't feel right.

Let's plug those means and SDs into a t test calculator.  There are several available online (e.g., this one), or you can build your own in a few seconds with Excel: put the means in A1 and B1, the Ns in C1 and D1, the SDs in E1 and F1, and then put this formula in G1:
  =(A1-B1)/SQRT((E1*E1/C1)+(F1*F1/D1))
(That just gives you the Student's t statistic; adding p values is left as an exercise for the reader, as is the extension to Welch's t test.)

Before we can run our t test, though, we need the sizes of each sample.  We know that nHigh + nLow + nControl equals 55.  Also, the t test for High/Low had 35 dfs, meaning nHigh + nLow equals 37, and the t test for High/Control had 34 dfs, meaning nHigh + nControl equals 36.  Putting those together gives us 18 for nHigh, 19 for nLow, and 18 for nControl.

OK, now we can do our calculations.  Here's what we get:
High/Low: t(35) = 1.7961, p = .0811
High/Control: t(34) = 3.2874, p = .0024
Low/Control: t(35) = 0.7185, p = .4772 (just for completeness)

So there is no statistically significant difference between the High and Low conditions.  And, while the High/Control comparison is significant, its strength is far less than what was reported. If you ran this experiment, you might conclude that the intervention was maybe doing something, but it's not clear what.  Certainly, the authors' conclusions seem to need substantial revision.

But wait... there's more.  (Alert readers will recognise some of the ideas in what follows from our GRIM preprint).

Remember our sample sizes: nHigh = 18, nLow = 19, nControl = 18.  And the measure of warmth was the means of three items on a 1-9 scale.  So the possible total warmth scores across the 18 or 19 participants, when you add up the three-item means, were (18.000, 18.333, 18.666, ..., 161.666, 162.000) for High and Control, and (18.000, 18.333, 18.666, ..., 170.666, 171.000) for Low.

Now, the mean of the High scores was reported as 7.47.  Multiply that by 18 and you get 134.46.  Of course, 7.47 was probably rounded, so we need to look at what it could have been rounded from.  The candidate total scores either side of 134.46 are 134.333 and 134.666.  But when you divide 134.333 (recurring) by 18, you get 7.46296, which rounds (and truncates) to 7.46, not 7.47.  And when you divide 134.666 (recurring) by 18, you get 7.48148, which rounds (and truncates) to 7.48, not 7.47.

Let's look at the Low scores.  The mean was reported as 6.85.  Multiply that by 19 and you get 130.15.  Candidate total scores in that range are 130.000 and 130.333.  But when you divide 130.000 by 19, you get 6.84211, which rounds (and truncates) to 6.84, not 6.85.  And when you divide 130.333 (recurring) by 19, you get 6.85956, which rounds to 6.86.  (It could be truncated to 6.85 if you really weren't paying attention, I suppose.)

For completeness, the Control mean of 6.59 is possible: 6.59 times 18 is 118.62, and 118.666 divided by 18 is 6.59259, which rounds and truncates to 6.59.

So this means that, given the dfs as they are reported in Cuddy et al.'s article, the two means corresponding to the experimentally manipulated conditions are necessarily incorrect.

A possible solution that allows the means to work is if the dfs of the second t test were misreported.  If you change t(35) to t(34), that implies nHigh = 19, nLow = 18, nControl = 18, and now the means can be computed correctly.  But one way or another, there's yet more uncertainty here.

To summarise, either:
/a/ Both of the t statistics, both of the p values, and one of the dfs in the sentence about paired comparisons is wrong;
or
/b/ "only" the t statistics and p values in that sentence are wrong, and the means on which they are based are wrong.

And yet, the sentence about paired comparisons is pretty much the only evidence for the authors' purported effect.  Try removing that sentence from the Results section and see if you're impressed by their findings, especially if you know that the means that went into the first ANOVA are possibly wrong too.

As of today, Cuddy et al.'s article has 523 citations, according to Google Scholar; yet, presumably, none of the people citing it, nor indeed the reviewers, can have actually read it very carefully.  So I guess some of the old stereotypes are true, at least when it comes to what people say about social psychology.

(*) Note that the study design arguably did not really test any efforts by the elderly person to gain competence; it tested how participants reacted to descriptions of the person's competence by a third party, which is not quite the same thing.

(**) I presume that the term "paired comparisons" refers to the fact that the comparison was between a pair of groups in each case, e.g., High/Low or High/Control.  The authors can't have performed a paired samples t test, since the samples were independent.

[Update 2016-07-04 13:32 UTC: Thanks to Simon Columbus for his comment, pointing out the PubPeer thread on this article.  Apparently a correction has been drafted (or maybe published already?) that fixed the t values, and then claims, utterly bizarrely, that this does not change the conclusion of the paper.  But even if we accept that for a nanosecond, it does not address the question of why the means were not correctly reported.  It looks like a second correction may be in order.  I wonder what Lady Bracknell would say?]

[Update 2016-07-09 22:17 UTC: Fixed an error; see comment by John Bullock.]