This post is about some issues in Study 3 of the following article:
Lee, J. J., Hardin, A. E., Parmar, B., & Gino, F. (2019). The
interpersonal costs of dishonesty: How dishonest behavior reduces
individuals' ability to read others' emotions. Journal of Experimental
Psychology: General, 148(9), 1557–1574. https://doi.org/10.1037/xge0000639
On 2023-11-06 I was able to download the article from here.
This article is paper #29 in the Many Co-Authors project, where researchers who have co-authored papers with Professor Francesca Gino are reporting the provenance of the data in those papers, following the discovery of problems with the data in four articles co-authored by Professor Gino.
In the tables on the Many Co-Authors page for paper #29, two of the three co-authors of this article have so far (2023-11-10) provided
information about the provenance of the data for this article, with both indicating that Professor Gino was involved in the data collection for Study 3. This note from Julia Lee Cunningham, the lead author, provides further confirmation:
For Study 3, Gino’s research assistant ran the laboratory study at Harvard Business School Research Lab for the partial data on Gino’s Qualtrics account. The co-authors have access to the raw data and were able to reproduce the key published results for Study 3.
In this study, pairs of participants interacted by telling each other stories. In one condition ("dishonest"), one member of the pair (A) told a fake story and the other (B) told a true story. In the other condition ("honest"), both members of the pair told true stories. Then, B evaluated their emotions during the exercise, and A evaluated their perceptions of B's emotions. The dependent variable ("emotional accuracy") was the ability of A to accurately evaluate how B had been feeling during the exercise. The results showed that when A had been dishonest (by telling a fake story), they were less accurate in their evaluation of B's emotional state.
The dataset for Study 3 is available as part of the OSF repository for the whole article here. It consists of an SPSS data file (.SAV) and a "syntax" (code) file (.SPS). I do not currently have an SPSS licence, so I was unable to run the code, but it seems to be fairly straightforward, running the focal t test from the study followed by the ANCOVAs to test whether gender moderated the relationship between condition and emotional accuracy.
I converted the dataset file to .CSV format in R and was then able to replicate the focal result of the study ("participants in the dishonest condition (M = 1.58, SD = 0.63) were significantly less
accurate at detecting others’ mental and affective states than those in the honest condition (M=1.39, SD = 0.54), t(209) = 2.37, p = .019", p. 1564, emphasis in original). My R code gave me this result:
accurate at detecting others’ mental and affective states than those in the honest condition (M=1.39, SD = 0.54), t(209) = 2.37, p = .019", p. 1564, emphasis in original). My R code gave me this result:
> t.test(df.repl.dis$EmoAcc, df.repl.hon$EmoAcc, var.equal=TRUE)
Two Sample t-test
data: df.repl.dis$EmoAcc and df.repl.hon$EmoAcc
t = 2.369, df = 209, p-value = 0.01875
However, this is not the whole story. Although the dataset contains records from 250 pairs of participants, the article states (p. 1564):
As determined by research assistants monitoring each session, pairs were excluded for the following reasons: the wrong partner told their story first; they asked so many questions during the session that it became apparent they were not actually reading their survey instructions or questions (e.g., “What story am I supposed to be telling?”); or they were actively on their phone during the storytelling portion of the session. Exclusions were due to the actions of either individual in the pair; thus, of the 500 individuals, 39 did not follow instructions. This resulted in 106 pairs in the dishonest condition and 105 pairs in the honest condition.
The final total of 211 pairs is confirmed by the 209 degrees of freedom of the above t test.
Conveniently, the results for the 39 excluded pairs are available in the dataset. They are excluded from analysis based on a variable named "Exclude_LabNotes" (although sadly, despite this name, the OSF data repository does not contain any lab notes that might explain the basis on which each exclusion was made). It is thus possible to run the analyses on the full dataset of 250 pairs, with no exclusions. When I did that, I obtained this result:
> t.test(df.full.dis$EmoAcc, df.full.hon$EmoAcc, var.equal=TRUE)
Two Sample t-test
data: df.full.dis$EmoAcc and df.full.hon$EmoAcc
t = 0.20148, df = 246, p-value = 0.8405
As you can see, there is quite a difference from the previous t test (p = 0.8405 versus p = 0.01875). Had these 37 participant pairs not been excluded, there would be no difference between the conditions; put another way, the exclusions drive the entire effect. I ran the same t test on (only) these excluded participants:
> t.test(df.dis$EmoAcc, df.hon$EmoAcc, var.equal=TRUE)
Two Sample t-test
data: df.exconly.dis$EmoAcc and df.exconly.hon$EmoAcc
t = -4.1645, df = 35, p-value = 0.0001935
Cohen's d for this test is 1.412, which is a very large effect indeed among people who are not paying attention.
I think it is worth illustrating these results graphically. First, a summary of the three t tests:
Second, an illustration of where each observation was dropped from its respective per-condition sample:
[[Edit 2023-11-11 19:05 UTC: I updated the second figure above. The previous version reported "t(34) = 4.56", reflecting a t test with equal variances not assumed in which the calculated degrees of freedom were 34.466. This is actually the more correct way to calculate the t statistic, but I have been using "equal variances assumed" in all of the other analyses in this post for compatibility with the original article, which used analyses from SPSS in which the assumption of equal variances is the default. See also this article. ]]
This is quite remarkable. One might imagine that participants who were not paying attention to the instructions, or goofing off on their phones, would, overall, give responses that would show no effect, because their individual responses would have been noisy and/or because the set of excluded participants was approximately balanced across conditions (and there is no difference between the conditions for the full sample). Indeed, a legitimate reason to exclude these participants would be that their results are likely to be uninformative noise and so, if they were numerous enough, their inclusion might lead to a Type II error. But instead, it seems that these excluded participants showed a very strong effect in the opposite direction to the hypothesis (as shown by the negative t statistic). That is, if these results are to be believed, something about the fact that either A or B was not following the study instructions made A much better (or less bad) at determining B's emotions when telling a fake (versus true) story. There were 14 excluded participant pairs in the "dishonest" condition, with a mean emotional accuracy score (lower = more accurate) for A of 1.143, and 23 in the "honest" condition, with a mean emotional accuracy score of 2.076; for comparison, the mean score for the full sample across both conditions is 1.523.
I hope the reader will forgive me for saying that this explanation does not seem very likely — and if it were true, it would presumably be the basis of intense interest among psychologists. Rather, there seem to be two other plausible explanations (but feel free to point out any others that you can think of in the comments). One is that the extreme results of the excluded participants arose by chance — and, hence, the apparent effect in favour of the authors' hypothesis caused by their exclusion was also the result of chance. The other, painful though it is to contemplate, is that the research assistants may have excluded participants in order to give a result in line with the hypotheses.
I simulated how likely it would be for the removal of 37 random participant pairs from the sample of 248 complete records to give a statistically significant result. I ran 1,000,000 simulations and obtained only 12 p values less than 0.05 for the t test on the resulting sample of 211 pairs. The smallest p value that I obtained was 0.03387, which is higher than the one reported in the article. To put it another way, out of a million attempts I was unable to obtain even one result as extreme as the published one by chance.
So it seems to me that, by elimination, the most plausible remaining explanation is that the research assistants selected which participants to exclude based on their scores, in such a way as to produce results that favoured the authors' hypothesis. Exactly how they were able to do that, given that those scores were only available in Qualtrics, when their job was (presumably) to help participants sitting in the laboratory to understand the process and to check who was spending time on their phone, is unclear to me, but doubtless there is a coherent explanation. Indeed, Professor Gino has already suggested (see point 274 here) that research assistants may have been responsible for perceived anomalies in other studies on which she was an author, although so far no details on how exactly this might have happened have been made public. I hope that she will be able to track down the RAs in this case and establish the truth of the matter with them.