10 November 2023

Attack of the 50 foot research assistants: Lee et al. (2019), Study 3

This post is about some issues in Study 3 of the following article:

 Lee, J. J., Hardin, A. E., Parmar, B., & Gino, F. (2019). The interpersonal costs of dishonesty: How dishonest behavior reduces individuals' ability to read others' emotions. Journal of Experimental Psychology: General, 148(9), 1557–1574. https://doi.org/10.1037/xge0000639

On 2023-11-06 I was able to download the article from here.

This article is paper #29 in the Many Co-Authors project, where researchers who have co-authored papers with Professor Francesca Gino are reporting the provenance of the data in those papers, following the discovery of problems with the data in four articles co-authored by Professor Gino.
In the tables on the Many Co-Authors page for paper #29, two of the three co-authors of this article have so far (2023-11-10) provided information about the provenance of the data for this article, with both indicating that Professor Gino was involved in the data collection for Study 3. This note from Julia Lee Cunningham, the lead author, provides further confirmation:

For Study 3, Gino’s research assistant ran the laboratory study at Harvard Business School Research Lab for the partial data on Gino’s Qualtrics account. The co-authors have access to the raw data and were able to reproduce the key published results for Study 3.

In this study, pairs of participants interacted by telling each other stories. In one condition ("dishonest"), one member of the pair (A) told a fake story and the other (B) told a true story. In the other condition ("honest"), both members of the pair told true stories. Then, B evaluated their emotions during the exercise, and A evaluated their perceptions of B's emotions. The dependent variable ("emotional accuracy") was the ability of A to accurately evaluate how B had been feeling during the exercise. The results showed that when A had been dishonest (by telling a fake story), they were less accurate in their evaluation of B's emotional state.
The dataset for Study 3 is available as part of the OSF repository for the whole article here. It consists of an SPSS data file (.SAV) and a "syntax" (code) file (.SPS). I do not currently have an SPSS licence, so I was unable to run the code, but it seems to be fairly straightforward, running the focal t test from the study followed by the ANCOVAs to test whether gender moderated the relationship between condition and emotional accuracy.
I converted the dataset file to .CSV format in R and was then able to replicate the focal result of the study ("participants in the dishonest condition (M = 1.58, SD = 0.63) were significantly less
accurate at detecting others’ mental and affective states than those in the honest condition (M=1.39, SD = 0.54), t(209) = 2.37, p = .019", p. 1564, emphasis in original). My R code gave me this result:

> t.test(df.repl.dis$EmoAcc, df.repl.hon$EmoAcc, var.equal=TRUE)

Two Sample t-test

data: df.repl.dis$EmoAcc and df.repl.hon$EmoAcc
t = 2.369, df = 209, p-value = 0.01875

However, this is not the whole story. Although the dataset contains records from 250 pairs of participants, the article states (p. 1564):
As determined by research assistants monitoring each session, pairs were excluded for the following reasons: the wrong partner told their story first; they asked so many questions during the session that it became apparent they were not actually reading their survey instructions or questions (e.g., “What story am I supposed to be telling?”); or they were actively on their phone during the storytelling portion of the session. Exclusions were due to the actions of either individual in the pair; thus, of the 500 individuals, 39 did not follow instructions. This resulted in 106 pairs in the dishonest condition and 105 pairs in the honest condition.

The final total of 211 pairs is confirmed by the 209 degrees of freedom of the above t test.

Conveniently, the results for the 39 excluded pairs are available in the dataset. They are excluded from analysis based on a variable named "Exclude_LabNotes" (although sadly, despite this name, the OSF data repository does not contain any lab notes that might explain the basis on which each exclusion was made). It is thus possible to run the analyses on the full dataset of 250 pairs, with no exclusions. When I did that, I obtained this result:

> t.test(df.full.dis$EmoAcc, df.full.hon$EmoAcc, var.equal=TRUE)

Two Sample t-test

data: df.full.dis$EmoAcc and df.full.hon$EmoAcc
t = 0.20148, df = 246, p-value = 0.8405

(Alert readers may notice that the degrees of freedom for this independent t test are only 246 rather than the expected 248. On inspection of the dataset, it appears that one record has NA for experimental condition and another has NA for emotional accuracy. Both of these records were also manually excluded by the research assistants, but they could not have been used in any of the t tests anyway. Hence, it seems fairer to say that 37 out of 248 participant pairs, rather than 39 out of 250, were excluded based on notes made by the RAs.)
As you can see, there is quite a difference from the previous t test (p = 0.8405 versus p = 0.01875). Had these 37 participant pairs not been excluded, there would be no difference between the conditions; put another way, the exclusions drive the entire effect. I ran the same t test on (only) these excluded participants:

> t.test(df.dis$EmoAcc, df.hon$EmoAcc, var.equal=TRUE)

Two Sample t-test

data: df.exconly.dis$EmoAcc and df.
t = -4.1645, df = 35, p-value = 0.0001935

Cohen's d for this test is 1.412, which is a very large effect indeed among people who are not paying attention.
I think it is worth illustrating these results graphically. First, a summary of the three t tests:
Second, an illustration of where each observation was dropped from its respective per-condition sample:
[[Edit 2023-11-11 19:05 UTC: I updated the second figure above. The previous version reported "t(34) = 4.56", reflecting a t test with equal variances not assumed in which the calculated degrees of freedom were 34.466. This is actually the more correct way to calculate the t statistic, but I have been using "equal variances assumed" in all of the other analyses in this post for compatibility with the original article, which used analyses from SPSS in which the assumption of equal variances is the default. See also this article. ]]
This is quite remarkable. One might imagine that participants who were not paying attention to the instructions, or goofing off on their phones, would, overall, give responses that would show no effect, because their individual responses would have been noisy and/or because the set of excluded participants was approximately balanced across conditions (and there is no difference between the conditions for the full sample). Indeed, a legitimate reason to exclude these participants would be that their results are likely to be uninformative noise and so, if they were numerous enough, their inclusion might lead to a Type II error. But instead, it seems that these excluded participants showed a very strong effect in the opposite direction to the hypothesis (as shown by the negative t statistic). That is, if these results are to be believed, something about the fact that either A or B was not following the study instructions made A much better (or less bad) at determining B's emotions when telling a fake (versus true) story. There were 14 excluded participant pairs in the "dishonest" condition, with a mean emotional accuracy score (lower = more accurate) for A of 1.143, and 23 in the "honest" condition, with a mean emotional accuracy score of 2.076; for comparison, the mean score for the full sample across both conditions is 1.523.

I hope the reader will forgive me for saying that this explanation does not seem very likely — and if it were true, it would presumably be the basis of intense interest among psychologists. Rather, there seem to be two other plausible explanations (but feel free to point out any others that you can think of in the comments). One is that the extreme results of the excluded participants arose by chance — and, hence, the apparent effect in favour of the authors' hypothesis caused by their exclusion was also the result of chance. The other, painful though it is to contemplate, is that the research assistants may have excluded participants in order to give a result in line with the hypotheses.
I simulated how likely it would be for the removal of 37 random participant pairs from the sample of 248 complete records to give a statistically significant result. I ran 1,000,000 simulations and obtained only 12 p values less than 0.05 for the t test on the resulting sample of 211 pairs. The smallest p value that I obtained was 0.03387, which is higher than the one reported in the article. To put it another way, out of a million attempts I was unable to obtain even one result as extreme as the published one by chance.

Now, this process can surely be subjected to some degree of criticism from a statistical inference point of view. After all, if the report in the article is correct, the excluded participants were not selected truly at random, and they might differ in other ways from the rest of the sample, with those differences perhaps interacting with the experimental condition. There might be other, more formally correct ways to test the idea that the exclusions of participant pairs were independent of their scores. However, as mentioned earlier, I do not think that it can be seriously argued that there was some extremely powerful psychological process, contrary to the study hypothesis, taking place specifically with the excluded participants.

So it seems to me that, by elimination, the most plausible remaining explanation is that the research assistants selected which participants to exclude based on their scores, in such a way as to produce results that favoured the authors' hypothesis. Exactly how they were able to do that, given that those scores were only available in Qualtrics, when their job was (presumably) to help participants sitting in the laboratory to understand the process and to check who was spending time on their phone, is unclear to me, but doubtless there is a coherent explanation. Indeed, Professor Gino has already suggested (see point 274 here) that research assistants may have been responsible for perceived anomalies in other studies on which she was an author, although so far no details on how exactly this might have happened have been made public. I hope that she will be able to track down the RAs in this case and establish the truth of the matter with them.

Supporting files

I have made my analysis code available here. I encourage you to use the authors' original SPSS data file from the OSF link given above, and convert it to CSV format using the commented-out line at the top of the script. However, as a convenience, I have made a copy of that CSV file available along with my code.


I thank Daniël Lakens and two anonymous readers of earlier drafts of this post for their comments. One of those people also kindly provided the two charts in the post.


I wrote to Julia Lee Cunningham to give her a heads up about this post. With her permission, I will quote any reply that she might make here.

09 July 2023

Data errors in Mo et al.'s (2023) analysis of wind speed and voting patterns

This post is about some issues in the following article, and most notably its dataset:

Mo, C. H., Jachimowicz, J. M., Menges, J. I., & Galinsky, A. D. (2023). The impact of incidental environmental factors on vote choice: Wind speed is related to more prevention‑focused voting. Political Behavior. Advance online publication. https://doi.org/10.1007/s11109-023-09865-y

You can download the article from here, the Supplementary Information from here [.docx], and the dataset from here. Credit is due to the authors for making their data available so that others can check their work.


The premise of this article, which was brought to my attention in a direct message by a Twitter user, is that the wind speed observed on the day of an "election" (although in fact, all the cases studied by the authors were referendums) affects the behaviour of voters, but only if the question on the ballot represents a choice between prevention- and promotion-focused options, in the sense of regulatory focus theory. The authors stated in their abstract that "we find that individuals exposed to higher wind speeds become more prevention-focused and more likely to support prevention-focused electoral options".

This article (specifically the part that focused on the UK's referendum on leaving the European Union ("Brexit") has already been critiqued by Erik Gahner here.

I should state from the outset that I was skeptical about this article when I read the abstract, and things did not get better when I found a couple of basic factual errors in the descriptions of the Brexit referendum:
  1. On p. 9 the authors claim that "The referendum for UK to leave the European Union (EU) was advanced by the Conservative Party, one of the three largest parties in the UK", and again, on p. 12, they state "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU". However, this is completely incorrect. The Conservative Party was split over how to vote, but the majority of its members of parliament, including David Cameron, the party leader and Prime Minister, campaigned for a Remain vote (source).
  2. At several points, the authors claim that the question posed in the Brexit referendum required a "Yes"/"No" answer. On p. 7 we read "For Brexit, the “No” option advanced by the Stronger In campaign was seen as clearly prevention-oriented ... whereas the “Yes” option put forward by the Vote Leave campaign was viewed as promotion-focused". The reports of result coding on p. 8, and the note to Table 1 on p. 10, repeat this claim. But this is again entirely incorrect. The options given to voters were to "Remain" (in the EU) or "Leave" (the EU). As the authors themselves note, the official campaign against EU membership was named "Vote Leave" (and there was also an unofficial campaign named "Leave.EU"). Indeed, this choice was adopted, rather than "Yes" or "No" responses to the question "Should the United Kingdom remain a member of the European Union?", precisely to avoid any perception of "positivity bias" in favour of a "Yes" vote (source). Note also here that, had this change not been made, the pro-EU vote would have been "Yes", and not the (prevention-focused) "No" claimed by the authors. (*)
Nevertheless, the article's claims are substantial, with remarkable implications for politics if they were to be confirmed. So I downloaded the data and code and tried to reproduce the results. Most of the analysis was done in Stata, which I don't have access to, but I saw that there was an R script to generate Figure 2 of the study that analysed the Swiss referendum results, so I ran that.

My reproduction of the original Figure 2 from the article. The regression coefficient for the line in the "Regulatory Focus Difference" condition is B=0.545 (p=0.00006), suggesting that every 1km/h increase in wind speed produces an increase of more than half a percentage point in the vote for the prevention-oriented campaign.

Catastrophic data problems

I had no problem in reproducing Figure 2 from the article. However, when I looked a little closer at the dataset (**) I noticed a big problem in the numbers. Take a look at the "DewPoint" and "Humidity" variables for "Election 50", which corresponds to Referendum 24 (***) in the Supplementary Information, and see if you can spot the problem.

Neither of those variables can possibly be correct for "Election 50" (note that the same issues affect the records for every "State", i.e., Swiss canton):
  • DewPoint, which would normally be a Fahrenheit temperature a few degrees below the actual air temperature, contains numbers between 0.401 and 0.626. The air temperature ranges from 45.3 to 66.7 degrees. For the dew point temperatures to be correct would require the relative humidity to be around 10% (calculator), which seems unlikely in Switzerland on a mild day in May. Perhaps these DewPoint values in fact correspond to the relative humidity?
  • Humidity (i.e., relative atmospheric humidity), which by definition should be a fraction between 0 and 1, is instead a number in the range from 1008.2 to 1015.7. I am not quite sure what might have caused this. These numbers look like they could represent some measure of atmospheric pressure, but they only correlate at 0.538 with the "Pressure" variable for "Election 50".
To evaluate the impact of these strange numbers on the authors' model, I modified their R script, Swiss_Analysis.R, to remove the records for "Election 50" and obtained this result from the remaining 23 referendums:
Figure 2 with "Election 50" (aka Referendum 24) removed from the model.

The angle of the regression line on the right is considerably less jaunty in this version of Figure 2. The coefficient has gone from B=0.545 (SE=0.120, p=0.000006) to B=0.266 (SE=0.114, p=0.02), simply by removing the damaged data that were apparently causing havoc with the model.

How robust is the model now?

A p value of 0.02 does not seem like an especially strong result. To test this, after removing the damaged data for "Election 50", I iterated over the dataset removing a further different single "Election" each time. In seven cases (removing "Election" 33, 36, 39, 40, 42, 46, or 47) the coefficient for the interaction in the resulting model had a p value above the conventional significance level of 0.05. In the most extreme case, removing "Election 40" (i.e., Referendum 14, "Mindestumwandlungsgesetz") caused the coefficient for the interaction to drop to 0.153 (SE=0.215, p=0.478), as shown in the next figure. It seems to me that if the statistical significance of an effect disappears with the omission of just one of the 23 (****) valid data points in 30% of the possible cases, this could indicate a lack of robustness in the effect.
Figure 2 with "Election 50" (aka Referendum 24) and "Election 40" (aka Referendum 14) removed from the model.

Other issues

Temperature precision
The ambient temperatures on the days of the referendums (variable "Temp") are reported with eight decimal places. It is not clear where this (apparently spurious) precision could have come from. Judging from their range the temperatures would appear to be in degrees Fahrenheit, whereas one would expect the original Swiss meteorological data to be expressed in degrees Celsius. However, the conversion between the two scales is simple (F = C * 1.8 + 32) and cannot introduce more than one extra decimal place. The authors state that "Weather data were collected from www.forecast.io/raw/", but unfortunately that link redirects to a page that suggests that this source is no longer available.

Cloud cover
The "CloudCover" variable takes only eight distinct values across the entire dataset, namely 2, 3, 5, 6, 8, 24, 34, and 38. It is not clear what these values represent, but it seems unlikely that they (all) correspond to a percentage or fraction of the sky covered by clouds. Yet, this variable is included in the regression models as a linear predictor. If the values represent some kind of ordinal or even nominal coding scheme, rather than being a parameter of some meteorological process, then including this variable could have arbitrary consequences for the regression (after all, 24, 34, and 38 might equally well have been coded ordinally as 9, 10, and 11, or perhaps nominally as -99, -45, and 756). If the intention is for these numbers to represent obscured eighths of the sky ("oktas"), then there is clearly a problem with the values above 8, which constitute 218 of the 624 records in the dataset (34.9%).

It would also be interesting to know the source of the "Income" data for each Swiss canton, and what this variable represents (e.g., median salary, household income, gross regional product, etc). After extracting the income data and canton numbers, and converting the latter into names, I consulted several Swiss or Swiss-based colleagues, who expressed skepticism that the cantons of Schwyz, Glarus, and Jura would have the #1, #3, and #4 incomes by any measure. I am slightly concerned that there may have been an issue with the sorting of the cantons when the Income variable was populated. The Supplementary Information says "Voting and socioeconomic information was obtained from the Swiss Federal Office of Statistics (Bundesamt für Statistik 2015)", and that reference points to a web page entitled “Detaillierte Ergebnisse Der Eidgenössischen Volksabstimmungen” with URL http://www.bfs.admin.ch/bfs/portal/de/index/themen/17/03/blank/data/01.html, but that link is dead (and in any case, the title means "Detailed results of Federal referendums"; such a page would generally not be expected to contain socioeconomic data).

Swiss cantons (using the "constitution order" mapping from numbers to names) and their associated "Income", presumably an annual figure in Swiss francs. Columns "Income(Mo)" and the corresponding rank order "IncRank" are from Mo et al.'s dataset; "Statista" and "StatRank" are from statista.com.

I obtained some fairly recent Swiss canton-level household income data from here and compared it with the data from the article. The results are shown in the figure above. The Pearson correlation between the two sets of numbers was 0.311, with the rank-order correlation being 0.093. I think something may have gone quite badly wrong here.

The value of the "Turnout" variable is the same for all cantons. This suggests that the authors may have used some national measure of turnout here. I am not sure how much value such a variable can add. The authors note (footnote 12, p. 17) that "We found that, except for one instance, no other weather indicator was correlated with the number of prevention-focused votes without simultaneously also affecting turnout rates. Temperature was an exception, as increased temperature was weakly correlated with a decrease in prevention-focused vote and not correlated with turnout". It is not clear to me what the meaning would be of calculating a correlation between canton-level temperature and national-level turnout.

Voting results do not always sum to 1
Another minor point about whatever cleaning has been performed on the dataset is that in 68 out of 624 cases (10.9%), the sum of "VotingResult1" and "VotingResult2" — representing the "Yes" and "No" votes — is 1.01 and not 1.00. Perhaps this is the result of the second number being generated by the first being subtracted from 1.00 when the first number was expressed as a percentage with one decimal place, with both numbers subsequently being rounded and something ambiguous happening with the last digit 5. In any case, it would seem important for these two numbers to sum to 1.00. This might not make an enormous amount of difference to the results, but it does suggest that the preparation of the data file may not have been done with excessive care.

Mean-centred variables
Two of the control variables, "Pressure" and "CloudCover", appear in the dataset in two versions, raw and mean-centred. There doesn't seem to be any reason to mean-centre these variables, but it is something that is commonly done when calculating interaction terms. I wonder whether at some point in the analyses the authors tested atmospheric pressure and cloud cover, rather than wind speed, as possible drivers of an effect on voting. Certainly there seems to be quite a lot of scope for the authors to have wandered around Andrew Gelman's "Garden of forking paths" in these analyses, which do not appear to have been pre-registered.

No measure of population
Finally, a huge (to me, anyway) limitation of this study is that there is no measure of, or attempt to weight the results by, the population of the cantons. The most populous Swiss canton (Zürich) has a population about 90 times that of the least populous (Appenzell Innerrhoden), yet the cantons all have equal weight in the models. The authors barely mention this as a limitation; they only mention the word "population" once, in the context of determining the average wind speed in Study 1. Of course, the ecological fallacy [.pdf] is always lurking whenever authors try to draw conclusions about the behaviour of individuals, whether or not the population density is taken into account, although this did not stop the authors from claiming in their abstract that "we find that individuals [emphasis added] exposed to higher wind speeds become more prevention-focused and more likely to support prevention-focused electoral options", or (on p. 4) stating that "We ... tested whether higher wind speed increased individual’s [punctuation sic; emphasis added] prevention focus".


I wrote this post principally to draw attention to the obviously damaging errors in the records for "Election 50" in the Swiss data file. I have also written to the authors to report those issues, because these are clearly in need of urgent correction. Until that has happened, and perhaps until someone else (with access to Stata) has conducted a re-analysis of the results for both the "Swiss" and "Brexit/Scotland" studies, I think that caution should be exercised before citing this paper. The other issues that I have raised in this post are, of course, open to critique regarding their importance or relevance. For the avoidance of doubt, given the nature of some of the other posts that I have made on this blog, I am not suggesting that anything untoward has taken place here, other than perhaps a degree of carelessness.

Supporting files

I have made my modified version of Mo et al.'s code to reproduce Figure 2 available here, in the file "(Nick) Swiss_Analysis.R". If you decide to run it, I encourage you to use the authors' original data file ("Swiss.dta") from the ZIP file that can be downloaded from the link at the top of this post. However, as a convenience, I have made a copy of this file available along with my code. In the same place you will also find a small Excel table ("Cantons.xls") containing data for my analysis of the canton-level income question.


Thanks to Jean-Claude Fox for doing some further digging on the Swiss income numbers after this post was first published.


(*) Interestingly, the title of Table 1 and, even more explicitly, the footnote on p. 10 ("Remain" with an uppercase initial letter) suggest that the authors may have been aware that the actual voting choices were "Remain" and "Leave". Perhaps these were simplified to "No" and "Yes", respectively, for consistency with the reports of the Scottish independence referendum; but if so, this should have been reported.
(**) I exported the dataset from Stata's .dta format to .csv format using rio::convert(). I also confirmed that the errors that I report in this post were present in the Stata file by inspecting the data structure after the Stata file had been read in to R.
(***) The authors coded the Swiss referendums, which are listed with numbers 1–24 in the Supplementary Information, as 27–50, by adding 26. They also coded the 26 cantons of Switzerland as 51–76, apparently by adding 50 to the constitutional order number (1 = Zürich, 26 = Jura; see here), perhaps to ensure that no small integer that might creep into the data would be seen as either a valid referendum or canton (a good practice in general). I was able to check that the numerical order of the "State" variable is indeed the same as the constitutional order by examining the provided latitude and longitude for each canton on Google Maps (e.g., "State" 67, corresponding to the canton of St Gallen with constitutional order 17, has reported coordinates of 47.424482, 9.376717, which are in the centre of the town of St Gallen).
(****) I am not sure whether a single referendum in 26 cantons represents 1 or 26 data points. The results from one canton to the next are clearly not independent. I suppose I could have written "4.3% of the data" here.

02 July 2023

Strange numbers in the dataset of Zhang, Gino, & Norton (2016)

In this post I'm going to be discussing this article:

Zhang, T., Gino, F., & Norton, M. I. (2016). The surprising effectiveness of hostile mediators. Management Science, 63(6), 1972–1992. https://doi.org/10.1287/mnsc.2016.2431
You can download the article from here and the dataset from here.

[[ Begin update 2023-07-03 16:12 UTC ]]
Following feedback from several sources, I now see how it is in fact possible that these data could have been the result of using a slider to report the amount of money being requested. I still think that this would be a terrible way to design a study (see my previous update, below), as it causes a loss of precision for no obvious good reason compared to having participants type a maximum of 6 digits, and indeed the input method is not reported in the article. However, if a slider was used, then with multiple platforms the observed variety of data could have arisen.
In the interests of transparency I will leave this post up, but with the caveat that readers should apply caution in interpreting it until we learn the truth from the various inquiries and resolution exercises that are ongoing in the Gino case.
[[ End update 2023-07-03 16:12 UTC ]]

The focus of my interest here is Study 5. Participants (MTurk workers) were asked to imagine that they were a carpenter who had been given a contract to furnish a number of houses, but decided to use better materials than had been specified and so had overspent the budget by $300,000. The contractor did not want to reimburse them for this. The participants were presented with interactions that represented a mediation process (in the social sense of the word "mediation", not the statistical one) between the carpenter and the contractor. The mediator's interactions were portrayed as "Nice", "Bilateral hostile" (nasty to both parties), or "Unilateral hostile" (nasty to the carpenter only). After this exercise, the participants were asked to say how much of the $300,000 they would ask from the contractor. This was the dependent variable to show how effective the different forms of mediation were.

The authors reported (p. 1986):

We conducted a between-subjects ANOVA using participants’ demands from their counterpart as the dependent variable. This analysis revealed a significant effect for mediator’s level and directedness of hostility, F(2, 134) = 6.86, p < 0.001, partial eta² = 0.09. Post hoc tests using LSD corrections indicated that participants in the bilateral hostile mediator condition demanded less from their counterpart (M = $149,457, SD = 65,642) compared with participants in the unilateral hostile mediator condition (M = $208,807, SD = 74,379, p < 0.001) and the nice mediator condition (M = $183,567, SD = 85,616, p = 0.04). The difference between the latter two conditions was not significant (p = 0.11).

Now, imagine that you are a participant in this study. You are being paid $1.50 to pretend to be someone who feels that they are owed $300,000. How much are you going to ask for? I'm guessing you might ask for all $300,000; or perhaps you are prepared to compromise and ask for $200,000; or you might split the difference and ask for $150,000; or you might be in a hurry and think that 0 is the quickest number to type.

Let's look at what the participants actually entered. In this table, each cell is one participant; I have arranged them in columns, with each column being a condition, and sorted the values in ascending order.


This makes absolutely no sense. Not only did 95 out of 139 participants choose a number that wasn't a multiple of $1,000, but also, they chose remarkably similar non-round numbers. Twelve participants chose to ask for exactly $150,323 (and four others asked for $10,323, $170,323, or $250,323). Sixteen participants asked for exactly $150,324. Ten asked for $149,676, which interestingly is equal to $300,000 minus the aforementioned $150,324. There are several other six-digit, non-round numbers that occur multiple times in the data. Remember, every number in this table represents the response of an independent MTurk worker, taking the survey in different locations across the United States.

To coin a phrase, it is not clear how these numbers could have arisen as a result of a natural process. If the authors can explain it, that would be great.

[[ Begin update 2023-07-02 12:38 UTC ]]

Several people have asked, on Twitter and in the comments here, whether these numbers could be explained by a slider having been used, instead of a numerical input field. I don't think so, for several reasons:
  1. It makes no sense to use a slider to ask people to indicate a dollar amount. It's a number. The authors report the mean amount to the nearest dollar. They are, ostensibly at least, interested in capturing the precise dollar amount.
  2. Had the authors used a slider, they would presumably have done so for a very specific reason, which one would imagine they would have reported.
  3. Some of the values reported are 153,023, 153,024, 150,647, 150,972, 151,592, and 151,620. The differences between these values are 1, 323, 325, 620, and 28. In another sequence, we see 180,130, 180,388, and 180,778, separated by 258 and 390; and in another, 200,216, 200,431, 200,864, and 201,290, separated by 215, 433, and 426. Even if we assume that the difference of 1 is a rounding error, in order for a slider to have the granularity to be able to indicate all of those numbers while also covering the range from 0 to 300,000, it would have to be many thousands of pixels wide. Real-world on-screen sliders typically run from 0 to 100 or 0 to 1000, with each of 400 or 500 pixels representing perhaps 0.20% or 0.25% of the available range.
Of course, all of this could be checked if someone had access to the original Qualtrics account. Perhaps Harvard will investigate this paper too...

[[ End update 2023-07-02 12:38 UTC ]]



Thanks to James Heathers for useful discussions, and to Economist 1268 at econjobrumors.com for suggesting that this article might be worth looking at.

28 June 2023

A coda to the Wansink story

The investigation of scientific misconduct by Ivy League universities is once again in the news at the moment, which prompts me to write up something that I should have written up quite a while ago. (The time I spend thinking about, and trying to help people understand, the Russian invasion of Ukraine has made as big a dent in my productivity as Covid-19.)

On October 31, 2018, I sent an open letter, signed by me and 50 colleagues, to Cornell. In it, I asked that they release the report the full text of the report of their inquiry into the misconduct of Professor Brian Wansink. On November 5, 2018, I received a reply from Michael Kotlikoff, the Provost of Cornell. He explained why the full text of the report was not being released (an explanation that did not impress Ivan Oransky at Retraction Watch), and added the following:

Cornell is now conducting a Phase II investigation to determine the degree to which any acts of research misconduct may have affected federally (NIH and USDA) funded research projects. ... As part of Phase II of the university’s investigation, Cornell has required Professor Wansink to collect and submit research data and records for all of his publications since 2005, when he came to the university, so that those records may be examined. We will provide a summary of this Phase II investigation at its conclusion. [emphasis added]

The Wansink story faded into the background after that, but a few months ago a small lightbulb fizzled into life above my head and I decided to find out what happened to that Phase II report. So I wrote to Provost Kotlikoff. He has kindly given me permission to quote his response verbatim:

Following my November 5 letter we indeed conducted a comprehensive Phase II analysis, but this was restricted to those scientific papers from Professor Wansink’s group that identified, or could be linked to, support from federal funds. This analysis, which was conducted on a subset of papers and followed federal guidelines, was reported to the NIH and to the USDA (the relevant funding organizations), and accepted by them. I should point out that this Phase II analysis did not include many of the papers identified by you and others as failing to meet scientific norms, as those were not associated with federal support, and therefore was not a comprehensive summary of the scientific issues surrounding Professor Wansink’s work.

I am sorry to say that Cornell does not release scientific misconduct reports provided to the NIH and the USDA. However, I believe that Cornell has appropriately addressed the scientific concerns that were identified by you and others (for which I thank you), and considers this matter closed.

So that seems to be it. We are apparently not going to see a summary of the Phase II investigation. Perhaps it was Cornell's initial intention to release this, but they were unable to do so for legal reasons. In any case, it's a little disappointing.

10 March 2023

Some interesting discoveries in a shared dataset: Néma et al. (2022).

In this post I'm going to be discussing this article, but mostly its dataset:

Néma, J., Zdara, J.,  Lašák, P., Bavlovič, J., Bureš, M., Pejchal, J., & Schvach, H. (2023). Impact of cold exposure on life satisfaction and physical composition of soldiers. BMJ Military Health. Advance online publication. https://doi.org/10.1136/military-2022-002237

The article itself doesn't need much commentary from me, since it has already been covered by Stuart Ritchie on Twitter here and in his iNews column here, as well as by Gideon Meyerowitz-Katz on Twitter here. So I will just cite or paraphrase some sentences from the Abstract:

[T]he aim of this study was to examine the effect of regular cold exposure on the psychological status and physical composition of healthy young soldiers in the Czech Army. A total of 49 (male and female) soldiers aged 19–30 years were randomly assigned to one of the two groups (intervention and control). The participants regularly underwent cold exposure for 8 weeks, in outdoor and indoor environments. Questionnaires were used to evaluate life satisfaction and anxiety, and an "InBody 770" device was used to measure body composition. Among other  results, systematic exposure to cold significantly lowered perceived anxiety (p=0.032). Cold water exposure can be recommended as an addition to routine military training regimens and is likely to reduce anxiety among soldiers.

The article PDF file contains a link to a repository in which the authors originally placed an Excel file  of their main dataset named "Dataset_ColdExposure_sorted_InBody.xlsx" (behind the link entitled "InBody - Body Composition"). I downloaded this file and explored it, and found some interesting things that complement the investigations of the article itself; these discoveries form the main part of this blog post.

Recently, however—probably in reaction to the authors being warned by Gideon or someone else that their dataset contained personally identifying information (PII)—this file has been replaced with one named "Datasets_InBody+WC_ColdExposure.csv". I will discuss the new file near the end of this post, but for now, the good news is that the file containing PII is no longer publicly available.

[[ Update 2023-03-11 23:23 UTC: Added new information here about the LSQ dataset, and—further along in this post—a paragraph about the analysis of these data. ]]

The repository also contains a data file called "Dataset_ColdExposure_LSQ.csv", which represents the participants' responses at two timepoints to the Life Satisfaction Questionnaire. I downloaded this file and attempted to match the participant data across the two datasets.

The structure of the main dataset file

The Excel file that I downloaded contains six worksheets. Four of these contain the data for the two conditions that were reported in the article (Cold and Control), one each at baseline and at the end of the treatment period. Within those worksheets, participants are split into male and female, and within each gender a sequence number starting at 1 identifies each participant. A fifth worksheet named "InBody Začátek" contains the baseline data for each participant, and a sixth, named "InBody1", appears to contain every data record for each participant, as well as some columns which, while mostly empty, appear designed to hold contact information for each person.

Every participant's name and date of birth is in the file (!)

The first and most important problem in the file as it was uploaded, and was in place until a couple of days ago, is that a lot of PII was left in there. Specifically, the file contained the first and last names and date of birth of every participant. This study was carried out in the Czech Republic, and I am not familiar with the details of research ethics in that country, but it seems to me to be pretty clear that it is not acceptable to conduct before-and-after physiological measurements on people and then publish those numbers along with information that in most cases probably identifies them uniquely among the population of their country.

I have modified the dataset file to remove this PII before I share it. Specifically, I did this:

  1. Replaced the names of participants with random fake names assembled from lists of popular English-language first and last names. I use these names below where I need to identify a particular participant's data.
  2. Replaced the date of birth with a fake date consisting of the same year, but a random month and day. As a result of this, the "Age" column, which appears to have been each participant's age at their last birthday before they gave their baseline data, may no longer match the reported (fake) date of birth.

What actually happened in the study?

The Abstract states (see above) that 49 participants were in two conditions: exposure to Cold (Chlad, in Czech) and a no-treatment Control group (Kontrolní). But in the dataset there are 99 baseline measurement records, and the participants are recorded as being in four conditions. As well as Cold and Control, there is a condition called Mindfulness (the English word is used), and another called Spánek, which means Sleep in Czech.

This is concerning because these additional participants and conditions are not mentioned in the article. The Method section states that "A total of 49 soldiers (15 women and 34 men) participated in the study and were randomly divided into two groups (control and intervention) before the start of the experiment". If the extra participants and conditions were part of the same study, this should have been reported; the above sentence, as written, seems to be stretching the idea of innocent omission quite a bit. Omitting conditions and participants is a powerful "researcher degree of freedom" in sense of Simmons et al.'s classic paper entitled "False-Positive Psychology". If these participants and conditions were not part of the same study then something very strange is happening, as it would imply that there were at least two studies being conducted with the same participant ID sequence number assignment and reported in the same data file.

Data seem to have been collected in two principal waves, January (leden) and March (březen) 2022. It is not clear why this was done. A few tests seem to have been performed in late 2021 or in February, April, or May 2022, but whatever the date, all participants were assigned to one of the two month groups. For reasons that are not clear, one participant whose baseline data were collected in December 2021 ("Martin Byrne") was assigned to the March group, although his data did not end up in the final group worksheets that formed the basis of the published article. Meanwhile, six participants ("George Fletcher", "Harold Gregory", "Harvey Barton", "Nicole Armstrong", "Christopher Bishop", and "Graham Foster") were assigned to the January group even though their data were collected in March 2022 or later; five of these (all except for "Nicole Armstrong") did end up in the final group worksheets. It seems that the grouping into "January" and "March" did not affect the final analyses, but it does make me wonder what the authors had in mind in creating these groups and then assigning people to them without apparently respecting the exact dates on which the data were collected. Again, it seems that plenty of researcher degrees of freedom were available.

How were participants filtered out?

There are 99 records in the baseline worksheet. These are in conditions as follows: Chlad (Cold), 28 (18 “leden/January”, 10 “březen/March”); Kontrolní (Control), 41 (33 “leden/January”, 8 “březen/March”); Mindfulness, 11 (all “leden/January”); Spánek (Sleep), 19 (11 “leden/January”, 8 “březen/March”).

Of the 28 participants in the Cold condition, three do not appear in the final Cold group worksheets that were use for the final analyses. Of the 41 in the Control condition, 17 do not appear in the final Control group worksheets. It is not clear what criteria were used to exclude these 3 (of 28) or 17 (of 41) people. The three in the Cold condition were all aged over 30, which corresponds to the reported cutoff age from the article, but does rather suggest that this cutoff might have been decided post hoc. Of the 17 people in the Control condition who did not make it to the final analyses, 11 were aged over 30, but six were not, so it is even more unclear why they were excluded. 

Despite the claim by the authors that participants were aged 19–30, four people in the final Cold condition worksheets ("Anthony Day", "Hunter Dunn", "Eric Collins", and "Harvey Barton") are aged between 31 and 35.

Were the authors participants themselves?

At least two, and likely five, of the participants appear to be authors of the article. I base this observation on the fact that in five cases, a participant has the same last name and initial as an author. In two of those cases, an e-mail address is reported that appears to correspond to the institution of that author. For the other three, I contacted a Czech friend, who used this website to look up the frequencies of the names in question; he told me that the last names (with any initials) only correspond to 55, 10, and 3 people in the entire Czech Republic, out of a population of 10.5 million.

Now, perhaps all of these people—one of whom ended up in the final Control group—are also active-duty military personnel, but it still does not seem appropriate for a participant in a psychological study that involves self-reported measures of one's attitudes before and after an intervention to also be an author on the associated article and hence at least implicitly involved with the design of the study. This also calls into question the randomisation and allocation process, as it is unlikely a randomised trial could have been conducted appropriately if investigators were also participants. (The article itself gives no detail about the randomisation process.)

Some of the participants in the final sample are duplicates (!)

The authors claimed that their sample (which I will refer to as the "final sample", given the uncertainty over the number of people who actually participated in the study) consisted of 49 people, which the reader might reasonably assume means 49 unique individuals. Yet, there are some obvious duplicates in the worksheets that describe the Cold and Control groups:
  1. The participant to whom I have assigned the fake name "Stella Arnold" appears both in the Cold group with record ID #7 and in the Control group with record ID #6, both with Gender=F (there are separate sequences of ID numbers for male and female participants within each worksheet, with both sequences starting at 1, so the gender is needed to distinguish between them). The corresponding baseline measurements are to be found in rows 9 and 96 of the "InBody Začátek" (baseline measurements) worksheet.
  2. The participant to whom I have assigned the fake name "Harold Gregory" appears both in the Cold group with record ID #15 and in the Control group with record ID #12, both with Gender=M. The corresponding baseline measurements are to be found in rows 38 and 41 of the "InBody Začátek" worksheet.
  3. The participant to whom I have assigned the fake name "Stephanie Bird" appears twice in the Control groups with record IDs #4 and #7, both with Gender=F. The corresponding baseline measurements are to be found in rows 73 and 99 of the "InBody Začátek" worksheet.
  4. The participant to whom I have assigned the fake name "Joe Gill" appears twice in the Control groups with record IDs #4 and #7, both with Gender=M. The corresponding baseline measurements are to be found in rows 57 and 62 of the "InBody Začátek" worksheet.
In view of this, it seems difficult to be certain about the actual sample size of the final (two-condition) study, as reported in the article.

Many other participants were assigned to more than one of the four conditions

36 of the 99 records in the baseline worksheet have duplicated names. Put another way, 18 people appear to have been enrolled in the overall (four-condition) study in two different conditions. Of these, five were in the Control condition in both time periods ("leden/January" and "březen/March"); four were in the Control condition once and a non-Control condition once; and nine were recorded as being in two non-Control conditions. In 17 cases the two conditions were labelled with different time periods, but in one case ("Harold Gregory"), both conditions (Cold/Control) were labelled "leden/January". This participant was one of the two who appeared in both final conditions (see previous paragraph); he is also one of the six participants assigned to a "January" group with data that were actually collected in March 2022.

Continuing on this point, two records in the worksheet ("InBody1") that contains the record of all tests, both baseline and subsequent, appear to refer to the same person, as the dates of the birth are the same (although the participant ID numbers are different) and the original Czech names differ only in the addition/omission of one character; for example, if the names were English, this might be "John Davis" and "John Davies". The fake names of these two records in the dataset that I am sharing are "Anthony Day" and "Arthur Burton", with "Anthony Day" appearing in the final worksheets and being 31 years old, as mentioned above. The height and other physiological data for these two records, dated two months apart, are similar but not identical.

Inconsistencies across the datasets

The LSQ data contains records for 49 people, with 25 in the Cold group and 24 in the Control group, which matches the main dataset. However, there are some serious inconsistencies between the two datasets.

First, the gender split of the Control group is not the same between the datasets. In the main dataset, and in the article, there were 17 men and 7 women in this group. However, in the LSQ dataset, there are 14 men and 10 women in the Control group.

Second, the ages that are reported for the participants in the LSQ data do not match the ages in the main dataset. There is not enough information in the LSQ dataset—which has its own participant ID numbering scheme—to reliably connect individual participants across the two, but both datasets report the age of the participants and so as a minimum it should be possible to find correspondences at that level. However, this is not the case. Leaving aside the three participants who differ on gender (which I chose to do by assuming that the main database was correct, since its gender split matches the article), there are 11 other entries in the LSQ dataset where I was unable to find a corresponding match on age in the main dataset. Of those 11, three differ by just one year, which could perhaps just be explained by the participant having a birthday between two data collection timepoints, but for the other eight, the difference is at least 3 years, no matter how one arranges the records.

In summary, the LSQ dataset is inconsistent with the main dataset on 14 out of 49 (28%) of its records.

Other curiosities

Several participants, including two in the final Control condition, have decimal commas instead of decimal points for their non-integer values. There are also several instances in the original datafile where cells in analysed columns have numbers recorded as text. It is not clear how these mixed formats could have been either generated or (conveniently) analysed by software.

Participants have a “Date of Registration”. It is not clear what this means. In 57 out of 99 cases, this date is the same as the date of the tests, which might suggest that this is the date when the participant joined the study, but some of the dates go back as far as 2009.

Data were sometimes collected on more than two occasions per participant (or on more than four occasions for some participants who were, somehow, assigned to two conditions). For example, data were collected four times from "Orlando Goodwin" between 2022-03-07 and 2022-05-21, while he was in the Cold condition (to add to the two times when data were collected from him between 2022-01-13 and 2022-02-15, when he was in the Sleep condition). However, the observed values in the record for this participant in the "Cold_Group_After" worksheet suggest that the last of these four measurements to be used was the third, on 2022-05-05. The purpose of the second and fourth measurements of this person in the Cold condition is thus unclear, but again it seems that this practice could lead to abundant researcher degrees of freedom.

There are three different formats for the participant ID field in the worksheet that contains all the measurement records. In the majority of the cases, the ID seems to be the date of registration in YYMMDD format, followed by a one- or two-digit sequence number, for example "220115-4" for the fourth participant registered on January 15, 2022. In some cases the ID is the letters "lb" followed by what appears to be a timestamp in YYMMDDHHMMSS format, such as "lb151210070505". Finally, one participant (fake name "Keith Gordon") has the ID "d14". This degree of inconsistency does not convey an atmosphere of rigour.

The new data file

I downloaded the original (XLSX format) data file on 2023-03-05 (March 5) at 20:52 UTC. That file (or  at least, the link to it) was still there on 2023-03-07 (March 7) at 11:59 UTC. When I checked on 2023-03-08 (March 8) at 17:26 UTC the link was dead, implying that the file with PII had been removed at some intermediate point in the previous 30 hours. At some point after that a new dataset file was uploaded to the same location, which I downloaded on 2023-03-09 (March 9) at 14:27 UTC. This new file, in CSV format, is greatly simplified compare to the original. Specifically:

  1. The data for the two conditions (Cold/Control) and the two timepoints (baseline/end) have been combined into one sheet in place of four.
  2. The worksheets with the PII and other experimental conditions have been removed.
  3. Most of the data fields have been removed; I assume that the remaining fields are sufficient to reproduce the analyses from the paper, but I haven't checked that as it isn't my purpose here.
  4. Two data fields have been added. One of these, named "SMM(%)", appears to be calculated as the fraction of the participant's weight that is accounted for by their skeletal muscle mass, both of which were present in the initial dataset. However, the other, named "WC (waist circumference)", appears to be new, as I cannot find it anywhere in the initial dataset. This might make one wonder what other variables were collected but not reported.
Apart from these changes, however, the data concerning the final conditions (49 participants, Cold and Control) are identical to the first dataset file. That is, the four duplicate participants described above are still in there; it's just harder to spot them now without the baseline record worksheet to tie the conditions together.

Data availability

I have made two censored versions of the dataset available here. One of these ("Simply_anonymized_dataset.xlsx") has been made very quickly from the original dataset by simply deleting the participants' names, dates of birth, and (where present) e-mail addresses. The other ("Public_analysis_dataset.xls") has been cleaned up from the original in several ways, and includes the fake names and dates of birth discussed above. This file is probably easier to follow if you want to reproduce my analyses. I believe that in both cases I have taken sufficient steps to make it impractical to identify any of the participants from the remaining information.
At the same location I have also placed another file ("Compare_datasets.xls") in which I compare the data from the initial and new dataset files, and demonstrate that where the same fields are present, their values are identical.

If anyone wants to check my work against the original, untouched dataset file, which includes the PII of the participants, then please contact me and we can discuss it. There's no obvious reason why I should be entitled to see this PII and another suitably qualified researcher should not, but of course it would not be a good idea to share it for all to see.


My thanks go to:

  • Gideon Meyerowitz-Katz (@GidMK) for interesting discussions and contributing a couple of the points in this post, including making the all-important discovery of the PII (and writing to the authors to get them to take it down).
  • @matejcik for looking up the frequencies of Czech names.
  • Stuart Ritchie for tweeting skeptically about the hyping of the results of the study.