13 February 2025

From the GRIM archive: Eskine et al. (2011)

Readers who have followed the development of GRIM might recall that we (that's me and the Wild Man of Boston) asked 21 researchers for their data. We got nine datasets, and the responses from the other 12 varied between silence and borderline hostilityin two cases, which shared no co-authors, identically-worded borderline hostility.

However, one of the most surprising responses was not a response at all; our e-mail to the lead and corresponding author bounced. Undeliverable. Address not found.

Not so unusual, you might think; after all, people move to new institutions, and their faculty e-mail addresses often evaporate shortly after. But in this case, it was the author's Gmail address that had stopped working.

Spooky

Do I have your attention? Then let's begin...

The article

This is the article that attracted our attention:

Eskine, K. J., Kacinik, N. A., & Prinz, J. J. (2011). A bad taste in the mouth: Gustatory disgust influences moral judgment. Psychological Science, 22(3), 295–299. https://doi.org/10.1177/0956797611398497

On 2025-02-13 I was able to download the PDF file from here. By that point it had acquired 675 citations on Google Scholar and 267 on Web of Science, which seems like quite a lot.

Briefly, the article describes a study that investigated whether giving people a small amount of something pleasant or unpleasant to drink also made them more or less harsh in their judgment of moral transgressions that they read about in a series of vignettes (e.g., second cousins having sex, or someone eating their already-dead dog). There were three conditions, depending on which beverage the participants consumed: Bitter, Sweet, or Water (with the latter condition sometimes being referred to as "control" in the article). For added fun, and because this was back before the tsunami hit social psychology and it was absolutely standard for any article to have a gratuitous pop at (U.S.) conservatives, there was a test to see whether conservatives were more susceptible to the effects than liberals.

I'll let you guess the results, but yeah, you're right. People who were drinking the bitter beverage became more, well, bitter in their judgments of the various transgressions, and the conservatives even more so. Or as the authors put it, "Taken together, these results suggest that physical disgust helps instantiate moral disgust, and that these effects are more salient in individuals with politically conservative views than in individuals with politically liberal views."

We flagged up this article because it has several GRIM inconsistencies in Table 1.


The article reports (p. 296) that "An overall moral-judgment score was obtained for each of the ... participants (bitter condition: n = 15; sweet condition: n = 18; control condition: n = 21)". But the highlighted numbers are not consistent with those sample sizes. For example, for "Bitter taste"/"Sweet drink", 1.76 * 18 = 31.68, so the nearest possible integer—corresponding to the sum of the participants' ratings—is 32. But 32 / 18 = 1.7777 which should be reported as 1.78. And so on.

The combination of seven GRIM problems out of 12 means, plus the effect sizes, made us want to look at the dataset. But as already mentioned, the lead author had deleted his Gmail and apparently gone off the grid. I contacted one of the co-authors of the article but she did not know where Kendall Eskine or the data might be found, and his most recent employer (Loyola University) told me that he had quit without leaving a forwarding address.

However, a few months later, James and I were discussing this particular article with a colleague, who — as it turned out — had been looking at Eskine's work for some time, and had managed to persuade him to share some datasets, including the one for this article. It was in SPSS format, but it was easy enough for me to convert it to CSV and then analyse it in R.

I was able to reproduce all of the results perfectly, apart from the two "contrasts" mentioned in the last paragraph of the Results section. In that case the means and SDs are all fine, but if I do a simple t test the t value is off slightly and the DFs are 1 more than what is reported, and I don't see how to get those numbers with ANOVA contrasts either. If anyone can help with this I'd be most grateful.

The GRIM issue turns out to be a fairly innocuous mix-up in the reporting of the sample sizes, with the Sweet condition actually having 21 participants and the Water (control) condition having 18, rather than the other way round as reported in the article. Going back to the first number in the Sweet column: 1.76 * 21 = 36.96; the nearest integer is 37; 37 / 21 = 1.7619 which rounds to 1.76; and so on.

So, the dataset matches the results and there are no GRIM problems once you fix a typo. Can we go now? (Of course not. Spoiler: this is a blog post, and it has a point to make.)

Looking at the dataset, it seems that participants were asked two items about each of the six vignettes, rather than just one.

The naming of the variables is a little haphazard at points, but we can see that for every vignette ("Incest", "Eatingdog", etc) there are two variables, one measuring how (im)moral the participant considered the behaviour to be, and the other apparently measuring how "appalling" they found it. However, there is absolutely mention of the second set of measurements anywhere in the article (try it: you won't find the string "appal" anywhere in the PDF).

It is tempting to imagine that the authors omitted to report the results (or even the existence) of the "appalling" measure because these were not statistically significant. For example, the main effect of beverage that underpins the results, which for the "moral" items was F(2,51)=7.368, p=.002, ηp²=.224, only gives F(2,51)=2.511, p=.091, ηp²=.090 for the "appalling" items, and the other results are similarly unimpressive. The manuscript was submitted is 2010, and back then you weren't going to get into Psychological Science with p=.091.

Another noteworthy thing is that the effect sizes are huge. By consuming just two teaspoons of a bitter, sweet, or neutral-tasting beverage, participants' responses came to differ by over one standard deviation. Again, this was all considered entirely plausible in 2010. Furthermore, the sample sizes are small, and for the conservative–liberal contrasts they are tiny, with just 6 conservatives and 7 liberals in the Bitter condition.

Now, there is rarely any mystery about where effects like these are coming from in the data. It's either a very large difference in means, or very small standard deviations (leading to small standard errors and hence a larger t ratio)—unless you have a huge sample size, in which case that on its own will give you a small SE. Very small SDs are often a sign of problems, especially if they are associated with means in the middle of the range of possible values, as this suggests that almost everyone was answering in the same "non-extreme" way, which doesn't often happen. Looking at the numbers here, however, the SDs appear reasonable, given the authors' claim of a large effect. Put another way, if there really was a true large effect—which is always possible—these are something like the SDs that we would expect to see, including the fact that they are smaller for the larger means as the latter approach the limits of the 0–100 rating scale. The effect is thus being driven principally by the difference in the means across the two groups.

So what else can we look at? Well, let's think about what the authors were measuring. They asked participants about two attitudes ("Is it (im)moral?" and "Is it appalling?") to six transgressions. While they didn't claim to be setting out to make a six-item "Transgression Judgement Scale", and they did not average the response scores across the six vignettes, we might reasonably expect that people who found any one of them to be immoral/appalling might also have a similar opinion about the others. We can measure the extent to which the authors created an internally consistent measure of people's attitudes to these transgressions by examining the Cronbach's alpha values of each measure in each condition. 

Here are the alphas for the two attitudes and the three conditions:


As you can see, for the Bitter condition, the reliability of the measure is considerably less robust, even though the manipulation is meant to be making people more attuned to moral issues, or more readily appalled, than those sampling the more usual beverages.

Let's see what happens when we compute the alphas after breaking the sample down further by political orientation:


Did you even know that Cronbach's alpha can be negative? Well, it can, and it's a useful test of a scale that has reverse-scored items — if you forget to do the reversing, you can end up with a negative alpha. But here, those bitter ol' conservatives are answering the same questions as everyone else. They've just been driven so gosh-darn mad by this appalling immorality that they have no idea what they're doing any more. Or... perhaps something else happened to the data, especially in the bitter condition, and especially for conservatives. <miss_piggy_flutters_eyelashes.gif>

Either way, these results tell us very little about what the experiment was supposed to measure. If anyone has an e-mail address for the lead author that still works, perhaps they could send him a pointer to this post and see if he has an explanation for all this.

Supporting information

I have made the R code and dataset for this post, as well as my annotated copy of the article PDF file, available here.


Footnotes

 Unless, of course, the reviewers or editors told the authors to "simplify" the story. Simmons et al.'s classic "False-Positive Psychology" article did not land on the editor-in-chief's desk at Psychological Science for another four months after Eskine et al.'s article was published.

 I try, not always successfully, to avoid pointing and laughing at studies purely for having large effect sizes. If we dismiss the possibility of large effects a priori, while also dismissing very small effects as being probably due to noise or unmeasured confounding (and in any case probably not of practical interest), we risk declaring that science is only about "Goldilocks"-sized effects. I don't know what the effect size of penicillin on staphylococcus was back in the late 1920s, but I suspect that it might cause eyebrows to be raised if Alexander Fleming were to write it up for publication today. That said, I remain skeptical of all effect sizes in social psychology above about d=0.4 that were not written about by Plato or Shakespeare.