09 July 2023

Data errors in Mo et al.'s (2023) analysis of wind speed and voting patterns

This post is about some issues in the following article, and most notably its dataset:

Mo, C.H., Jachimowicz, J. M., Menges, J. I., & Galinsky, A. D. (2023). The impact of incidental environmental factors on vote choice: Wind speed is related to more prevention‑focused voting. Political Behavior. Advance online publication. https://doi.org/10.1007/s11109-023-09865-y

You can download the article from here, the Supplementary Information from here [.docx], and the dataset from here. Credit is due to the authors for making their data available so that others can check their work.

Introduction

The premise of this article, which was brought to my attention in a direct message by a Twitter user, is that the wind speed observed on the day of an "election" (although in fact, all the cases studied by the authors were referendums) affects the behaviour of voters, but only if the question on the ballot represents a choice between prevention- and promotion-focused options, in the sense of regulatory focus theory. The authors stated in their abstract that "we find that individuals exposed to higher wind speeds become more prevention-focused and more likely to support prevention-focused electoral options".

This article (specifically the part that focused on the UK's referendum on leaving the European Union ("Brexit") has already been critiqued by Erik Gahner here.

I should state from the outset that I was skeptical about this article when I read the abstract, and things did not get better when I found a couple of basic factual errors in the descriptions of the Brexit referendum:
  1. On p. 9 the authors claim that "The referendum for UK to leave the European Union (EU) was advanced by the Conservative Party, one of the three largest parties in the UK", and again, on p. 12, they state "In the case of the Brexit vote, the Conservative Party advanced the campaign for the UK to leave the EU". However, this is completely incorrect. The Conservative Party was split over how to vote, but the majority of its members of parliament, including David Cameron, the party leader and Prime Minister, campaigned for a Remain vote (source).
  2. At several points, the authors claim that the question posed in the Brexit referendum required a "Yes"/"No" answer. On p. 7 we read "For Brexit, the “No” option advanced by the Stronger In campaign was seen as clearly prevention-oriented ... whereas the “Yes” option put forward by the Vote Leave campaign was viewed as promotion-focused". The reports of result coding on p. 8, and the note to Table 1 on p. 10, repeat this claim. But this is again entirely incorrect. The options given to voters were to "Remain" (in the EU) or "Leave" (the EU). As the authors themselves note, the official campaign against EU membership was named "Vote Leave" (and there was also an unofficial campaign named "Leave.EU"). Indeed, this choice was adopted, rather than "Yes" or "No" responses to the question "Should the United Kingdom remain a member of the European Union?", precisely to avoid any perception of "positivity bias" in favour of a "Yes" vote (source). Note also here that, had this change not been made, the pro-EU vote would have been "Yes", and not the (prevention-focused) "No" claimed by the authors. (*)
Nevertheless, the article's claims are substantial, with remarkable implications for politics if they were to be confirmed. So I downloaded the data and code and tried to reproduce the results. Most of the analysis was done in Stata, which I don't have access to, but I saw that there was an R script to generate Figure 2 of the study that analysed the Swiss referendum results, so I ran that.

My reproduction of the original Figure 2 from the article. The regression coefficient for the line in the "Regulatory Focus Difference" condition is B=0.545 (p=0.00006), suggesting that every 1km/h increase in wind speed produces an increase of more than half a percentage point in the vote for the prevention-oriented campaign.

Catastrophic data problems

I had no problem in reproducing Figure 2 from the article. However, when I looked a little closer at the dataset (**) I noticed a big problem in the numbers. Take a look at the "DewPoint" and "Humidity" variables for "Election 50", which corresponds to Referendum 24 (***) in the Supplementary Information, and see if you can spot the problem.


Neither of those variables can possibly be correct for "Election 50" (note that the same issues affect the records for every "State", i.e., Swiss canton):
  • DewPoint, which would normally be a Fahrenheit temperature a few degrees below the actual air temperature, contains numbers between 0.401 and 0.626. The air temperature ranges from 45.3 to 66.7 degrees. For the dew point temperatures to be correct would require the relative humidity to be around 10% (calculator), which seems unlikely in Switzerland on a mild day in May. Perhaps these DewPoint values in fact correspond to the relative humidity?
  • Humidity (i.e., relative atmospheric humidity), which by definition should be a fraction between 0 and 1, is instead a number in the range from 1008.2 to 1015.7. I am not quite sure what might have caused this. These numbers look like they could represent some measure of atmospheric pressure, but they only correlate at 0.538 with the "Pressure" variable for "Election 50".
To evaluate the impact of these strange numbers on the authors' model, I modified their R script, Swiss_Analysis.R, to remove the records for "Election 50" and obtained this result from the remaining 23 referendums:
Figure 2 with "Election 50" (aka Referendum 24) removed from the model.

The angle of the regression line on the right is considerably less jaunty in this version of Figure 2. The coefficient has gone from B=0.545 (SE=0.120, p=0.000006) to B=0.266 (SE=0.114, p=0.02), simply by removing the damaged data that were apparently causing havoc with the model.

How robust is the model now?

A p value of 0.02 does not seem like an especially strong result. To test this, after removing the damaged data for "Election 50", I iterated over the dataset removing a further different single "Election" each time. In seven cases (removing "Election" 33, 36, 39, 40, 42, 46, or 47) the coefficient for the interaction in the resulting model had a p value above the conventional significance level of 0.05. In the most extreme case, removing "Election 40" (i.e., Referendum 14, "Mindestumwandlungsgesetz") caused the coefficient for the interaction to drop to 0.153 (SE=0.215, p=0.478), as shown in the next figure. It seems to me that if the statistical significance of an effect disappears with the omission of just one of the 23 (****) valid data points in 30% of the possible cases, this could indicate a lack of robustness in the effect.
Figure 2 with "Election 50" (aka Referendum 24) and "Election 40" (aka Referendum 14) removed from the model.

Other issues

Temperature precision
The ambient temperatures on the days of the referendums (variable "Temp") are reported with eight decimal places. It is not clear where this (apparently spurious) precision could have come from. Judging from their range the temperatures would appear to be in degrees Fahrenheit, whereas one would expect the original Swiss meteorological data to be expressed in degrees Celsius. However, the conversion between the two scales is simple (F = C * 1.8 + 32) and cannot introduce more than one extra decimal place. The authors state that "Weather data were collected from www.forecast.io/raw/", but unfortunately that link redirects to a page that suggests that this source is no longer available.

Cloud cover
The "CloudCover" variable takes only eight distinct values across the entire dataset, namely 2, 3, 5, 6, 8, 24, 34, and 38. It is not clear what these values represent, but it seems unlikely that they (all) correspond to a percentage or fraction of the sky covered by clouds. Yet, this variable is included in the regression models as a linear predictor. If the values represent some kind of ordinal or even nominal coding scheme, rather than being a parameter of some meteorological process, then including this variable could have arbitrary consequences for the regression (after all, 24, 34, and 38 might equally well have been coded ordinally as 9, 10, and 11, or perhaps nominally as -99, -45, and 756). If the intention is for these numbers to represent obscured eighths of the sky ("oktas"), then there is clearly a problem with the values above 8, which constitute 218 of the 624 records in the dataset (34.9%).

Income
It would also be interesting to know the source of the "Income" data for each Swiss canton, and what this variable represents (e.g., median salary, household income, gross regional product, etc). After extracting the income data and canton numbers, and converting the latter into names, I consulted several Swiss or Swiss-based colleagues, who expressed skepticism that the cantons of Schwyz, Glarus, and Jura would have the #1, #3, and #4 incomes by any measure. I am slightly concerned that there may have been an issue with the sorting of the cantons when the Income variable was populated. The Supplementary Information says "Voting and socioeconomic information was obtained from the Swiss Federal Office of Statistics (Bundesamt für Statistik 2015)", and that reference points to a web page entitled “Detaillierte Ergebnisse Der Eidgenössischen Volksabstimmungen” with URL http://www.bfs.admin.ch/bfs/portal/de/index/themen/17/03/blank/data/01.html, but that link is dead (and in any case, the title means "Detailed results of Federal referendums"; such a page would generally not be expected to contain socioeconomic data).


Swiss cantons (using the "constitution order" mapping from numbers to names) and their associated "Income", presumably an annual figure in Swiss francs. Columns "Income(Mo)" and the corresponding rank order "IncRank" are from Mo et al.'s dataset; "Statista" and "StatRank" are from statista.com.

I obtained some fairly recent Swiss canton-level household income data from here and compared it with the data from the article. The results are shown in the figure above. The Pearson correlation between the two sets of numbers was 0.311, with the rank-order correlation being 0.093. I think something may have gone quite badly wrong here.

Turnout
The value of the "Turnout" variable is the same for all cantons. This suggests that the authors may have used some national measure of turnout here. I am not sure how much value such a variable can add. The authors note (footnote 12, p. 17) that "We found that, except for one instance, no other weather indicator was correlated with the number of prevention-focused votes without simultaneously also affecting turnout rates. Temperature was an exception, as increased temperature was weakly correlated with a decrease in prevention-focused vote and not correlated with turnout". It is not clear to me what the meaning would be of calculating a correlation between canton-level temperature and national-level turnout.

Voting results do not always sum to 1
Another minor point about whatever cleaning has been performed on the dataset is that in 68 out of 624 cases (10.9%), the sum of "VotingResult1" and "VotingResult2" — representing the "Yes" and "No" votes — is 1.01 and not 1.00. Perhaps this is the result of the second number being generated by the first being subtracted from 1.00 when the first number was expressed as a percentage with one decimal place, with both numbers subsequently being rounded and something ambiguous happening with the last digit 5. In any case, it would seem important for these two numbers to sum to 1.00. This might not make an enormous amount of difference to the results, but it does suggest that the preparation of the data file may not have been done with excessive care.

Mean-centred variables
Two of the control variables, "Pressure" and "CloudCover", appear in the dataset in two versions, raw and mean-centred. There doesn't seem to be any reason to mean-centre these variables, but it is something that is commonly done when calculating interaction terms. I wonder whether at some point in the analyses the authors tested atmospheric pressure and cloud cover, rather than wind speed, as possible drivers of an effect on voting. Certainly there seems to be quite a lot of scope for the authors to have wandered around Andrew Gelman's "Garden of forking paths" in these analyses, which do not appear to have been pre-registered.

No measure of population
Finally, a huge (to me, anyway) limitation of this study is that there is no measure of, or attempt to weight the results by, the population of the cantons. The most populous Swiss canton (Zürich) has a population about 90 times that of the least populous (Appenzell Innerrhoden), yet the cantons all have equal weight in the models. The authors barely mention this as a limitation; they only mention the word "population" once, in the context of determining the average wind speed in Study 1. Of course, the ecological fallacy [.pdf] is always lurking whenever authors try to draw conclusions about the behaviour of individuals, whether or not the population density is taken into account, although this did not stop the authors from claiming in their abstract that "we find that individuals [emphasis added] exposed to higher wind speeds become more prevention-focused and more likely to support prevention-focused electoral options", or (on p. 4) stating that "We ... tested whether higher wind speed increased individual’s [punctuation sic; emphasis added] prevention focus".

Conclusion

I wrote this post principally to draw attention to the obviously damaging errors in the records for "Election 50" in the Swiss data file. I have also written to the authors to report those issues, because these are clearly in need of urgent correction. Until that has happened, and perhaps until someone else (with access to Stata) has conducted a re-analysis of the results for both the "Swiss" and "Brexit/Scotland" studies, I think that caution should be exercised before citing this paper. The other issues that I have raised in this post are, of course, open to critique regarding their importance or relevance. For the avoidance of doubt, given the nature of some of the other posts that I have made on this blog, I am not suggesting that anything untoward has taken place here, other than perhaps a degree of carelessness.

Supporting files

I have made my modified version of Mo et al.'s code to reproduce Figure 2 available here, in the file "(Nick) Swiss_Analysis.R". If you decide to run it, I encourage you to use the authors' original data file ("Swiss.dta") from the ZIP file that can be downloaded from the link at the top of this post. However, as a convenience, I have made a copy of this file available along with my code. In the same place you will also find a small Excel table ("Cantons.xls") containing data for my analysis of the canton-level income question.

Acknowledgements

Thanks to Jean-Claude Fox for doing some further digging on the Swiss income numbers after this post was first published.

Footnotes

(*) Interestingly, the title of Table 1 and, even more explicitly, the footnote on p. 10 ("Remain" with an uppercase initial letter) suggest that the authors may have been aware that the actual voting choices were "Remain" and "Leave". Perhaps these were simplified to "No" and "Yes", respectively, for consistency with the reports of the Scottish independence referendum; but if so, this should have been reported.
(**) I exported the dataset from Stata's .dta format to .csv format using rio::convert(). I also confirmed that the errors that I report in this post were present in the Stata file by inspecting the data structure after the Stata file had been read in to R.
(***) The authors coded the Swiss referendums, which are listed with numbers 1–24 in the Supplementary Information, as 27–50, by adding 26. They also coded the 26 cantons of Switzerland as 51–76, apparently by adding 50 to the constitutional order number (1 = Zürich, 26 = Jura; see here), perhaps to ensure that no small integer that might creep into the data would be seen as either a valid referendum or canton (a good practice in general). I was able to check that the numerical order of the "State" variable is indeed the same as the constitutional order by examining the provided latitude and longitude for each canton on Google Maps (e.g., "State" 67, corresponding to the canton of St Gallen with constitutional order 17, has reported coordinates of 47.424482, 9.376717, which are in the centre of the town of St Gallen).
(****) I am not sure whether a single referendum in 26 cantons represents 1 or 26 data points. The results from one canton to the next are clearly not independent. I suppose I could have written "4.3% of the data" here.


02 July 2023

Strange numbers in the dataset of Zhang, Gino, & Norton (2016)

In this post I'm going to be discussing this article:

Zhang, T., Gino, F., & Norton, M. I. (2016). The surprising effectiveness of hostile mediators. Management Science, 63(6), 1972–1992. https://doi.org/10.1287/mnsc.2016.2431
 
You can download the article from here and the dataset from here.

[[ Begin update 2023-07-03 16:12 UTC ]]
Following feedback from several sources, I now see how it is in fact possible that these data could have been the result of using a slider to report the amount of money being requested. I still think that this would be a terrible way to design a study (see my previous update, below), as it causes a loss of precision for no obvious good reason compared to having participants type a maximum of 6 digits, and indeed the input method is not reported in the article. However, if a slider was used, then with multiple platforms the observed variety of data could have arisen.
 
In the interests of transparency I will leave this post up, but with the caveat that readers should apply caution in interpreting it until we learn the truth from the various inquiries and resolution exercises that are ongoing in the Gino case.
[[ End update 2023-07-03 16:12 UTC ]]

The focus of my interest here is Study 5. Participants (MTurk workers) were asked to imagine that they were a carpenter who had been given a contract to furnish a number of houses, but decided to use better materials than had been specified and so had overspent the budget by $300,000. The contractor did not want to reimburse them for this. The participants were presented with interactions that represented a mediation process (in the social sense of the word "mediation", not the statistical one) between the carpenter and the contractor. The mediator's interactions were portrayed as "Nice", "Bilateral hostile" (nasty to both parties), or "Unilateral hostile" (nasty to the carpenter only). After this exercise, the participants were asked to say how much of the $300,000 they would ask from the contractor. This was the dependent variable to show how effective the different forms of mediation were.

The authors reported (p. 1986):

We conducted a between-subjects ANOVA using participants’ demands from their counterpart as the dependent variable. This analysis revealed a significant effect for mediator’s level and directedness of hostility, F(2, 134) = 6.86, p < 0.001, partial eta² = 0.09. Post hoc tests using LSD corrections indicated that participants in the bilateral hostile mediator condition demanded less from their counterpart (M = $149,457, SD = 65,642) compared with participants in the unilateral hostile mediator condition (M = $208,807, SD = 74,379, p < 0.001) and the nice mediator condition (M = $183,567, SD = 85,616, p = 0.04). The difference between the latter two conditions was not significant (p = 0.11).

Now, imagine that you are a participant in this study. You are being paid $1.50 to pretend to be someone who feels that they are owed $300,000. How much are you going to ask for? I'm guessing you might ask for all $300,000; or perhaps you are prepared to compromise and ask for $200,000; or you might split the difference and ask for $150,000; or you might be in a hurry and think that 0 is the quickest number to type.

Let's look at what the participants actually entered. In this table, each cell is one participant; I have arranged them in columns, with each column being a condition, and sorted the values in ascending order.

 

This makes absolutely no sense. Not only did 95 out of 139 participants choose a number that wasn't a multiple of $1,000, but also, they chose remarkably similar non-round numbers. Twelve participants chose to ask for exactly $150,323 (and four others asked for $10,323, $170,323, or $250,323). Sixteen participants asked for exactly $150,324. Ten asked for $149,676, which interestingly is equal to $300,000 minus the aforementioned $150,324. There are several other six-digit, non-round numbers that occur multiple times in the data. Remember, every number in this table represents the response of an independent MTurk worker, taking the survey in different locations across the United States.


To coin a phrase, it is not clear how these numbers could have arisen as a result of a natural process. If the authors can explain it, that would be great.


[[ Begin update 2023-07-02 12:38 UTC ]]

Several people have asked, on Twitter and in the comments here, whether these numbers could be explained by a slider having been used, instead of a numerical input field. I don't think so, for several reasons:
  1. It makes no sense to use a slider to ask people to indicate a dollar amount. It's a number. The authors report the mean amount to the nearest dollar. They are, ostensibly at least, interested in capturing the precise dollar amount.
  2. Had the authors used a slider, they would presumably have done so for a very specific reason, which one would imagine they would have reported.
  3. Some of the values reported are 153,023, 153,024, 150,647, 150,972, 151,592, and 151,620. The differences between these values are 1, 323, 325, 620, and 28. In another sequence, we see 180,130, 180,388, and 180,778, separated by 258 and 390; and in another, 200,216, 200,431, 200,864, and 201,290, separated by 215, 433, and 426. Even if we assume that the difference of 1 is a rounding error, in order for a slider to have the granularity to be able to indicate all of those numbers while also covering the range from 0 to 300,000, it would have to be many thousands of pixels wide. Real-world on-screen sliders typically run from 0 to 100 or 0 to 1000, with each of 400 or 500 pixels representing perhaps 0.20% or 0.25% of the available range.
Of course, all of this could be checked if someone had access to the original Qualtrics account. Perhaps Harvard will investigate this paper too...

[[ End update 2023-07-02 12:38 UTC ]]

 

Acknowledgements

Thanks to James Heathers for useful discussions, and to Economist 1268 at econjobrumors.com for suggesting that this article might be worth looking at.



28 June 2023

A coda to the Wansink story

The investigation of scientific misconduct by Ivy League universities is once again in the news at the moment, which prompts me to write up something that I should have written up quite a while ago. (The time I spend thinking about, and trying to help people understand, the Russian invasion of Ukraine has made as big a dent in my productivity as Covid-19.)

On October 31, 2018, I sent an open letter, signed by me and 50 colleagues, to Cornell. In it, I asked that they release the report the full text of the report of their inquiry into the misconduct of Professor Brian Wansink. On November 5, 2018, I received a reply from Michael Kotlikoff, the Provost of Cornell. He explained why the full text of the report was not being released (an explanation that did not impress Ivan Oransky at Retraction Watch), and added the following:

Cornell is now conducting a Phase II investigation to determine the degree to which any acts of research misconduct may have affected federally (NIH and USDA) funded research projects. ... As part of Phase II of the university’s investigation, Cornell has required Professor Wansink to collect and submit research data and records for all of his publications since 2005, when he came to the university, so that those records may be examined. We will provide a summary of this Phase II investigation at its conclusion. [emphasis added]

The Wansink story faded into the background after that, but a few months ago a small lightbulb fizzled into life above my head and I decided to find out what happened to that Phase II report. So I wrote to Provost Kotlikoff. He has kindly given me permission to quote his response verbatim:

Following my November 5 letter we indeed conducted a comprehensive Phase II analysis, but this was restricted to those scientific papers from Professor Wansink’s group that identified, or could be linked to, support from federal funds. This analysis, which was conducted on a subset of papers and followed federal guidelines, was reported to the NIH and to the USDA (the relevant funding organizations), and accepted by them. I should point out that this Phase II analysis did not include many of the papers identified by you and others as failing to meet scientific norms, as those were not associated with federal support, and therefore was not a comprehensive summary of the scientific issues surrounding Professor Wansink’s work.

I am sorry to say that Cornell does not release scientific misconduct reports provided to the NIH and the USDA. However, I believe that Cornell has appropriately addressed the scientific concerns that were identified by you and others (for which I thank you), and considers this matter closed.

So that seems to be it. We are apparently not going to see a summary of the Phase II investigation. Perhaps it was Cornell's initial intention to release this, but they were unable to do so for legal reasons. In any case, it's a little disappointing.


10 March 2023

Some interesting discoveries in a shared dataset: Néma et al. (2022).

In this post I'm going to be discussing this article, but mostly its dataset:

Néma, J., Zdara, J.,  Lašák, P., Bavlovič, J., Bureš, M., Pejchal, J., & Schvach, H. (2023). Impact of cold exposure on life satisfaction and physical composition of soldiers. BMJ Military Health. Advance online publication. https://doi.org/10.1136/military-2022-002237

The article itself doesn't need much commentary from me, since it has already been covered by Stuart Ritchie on Twitter here and in his iNews column here, as well as by Gideon Meyerowitz-Katz on Twitter here. So I will just cite or paraphrase some sentences from the Abstract:

[T]he aim of this study was to examine the effect of regular cold exposure on the psychological status and physical composition of healthy young soldiers in the Czech Army. A total of 49 (male and female) soldiers aged 19–30 years were randomly assigned to one of the two groups (intervention and control). The participants regularly underwent cold exposure for 8 weeks, in outdoor and indoor environments. Questionnaires were used to evaluate life satisfaction and anxiety, and an "InBody 770" device was used to measure body composition. Among other  results, systematic exposure to cold significantly lowered perceived anxiety (p=0.032). Cold water exposure can be recommended as an addition to routine military training regimens and is likely to reduce anxiety among soldiers.

The article PDF file contains a link to a repository in which the authors originally placed an Excel file  of their main dataset named "Dataset_ColdExposure_sorted_InBody.xlsx" (behind the link entitled "InBody - Body Composition"). I downloaded this file and explored it, and found some interesting things that complement the investigations of the article itself; these discoveries form the main part of this blog post.

Recently, however—probably in reaction to the authors being warned by Gideon or someone else that their dataset contained personally identifying information (PII)—this file has been replaced with one named "Datasets_InBody+WC_ColdExposure.csv". I will discuss the new file near the end of this post, but for now, the good news is that the file containing PII is no longer publicly available.

[[ Update 2023-03-11 23:23 UTC: Added new information here about the LSQ dataset, and—further along in this post—a paragraph about the analysis of these data. ]]

The repository also contains a data file called "Dataset_ColdExposure_LSQ.csv", which represents the participants' responses at two timepoints to the Life Satisfaction Questionnaire. I downloaded this file and attempted to match the participant data across the two datasets.

The structure of the main dataset file

The Excel file that I downloaded contains six worksheets. Four of these contain the data for the two conditions that were reported in the article (Cold and Control), one each at baseline and at the end of the treatment period. Within those worksheets, participants are split into male and female, and within each gender a sequence number starting at 1 identifies each participant. A fifth worksheet named "InBody Začátek" contains the baseline data for each participant, and a sixth, named "InBody1", appears to contain every data record for each participant, as well as some columns which, while mostly empty, appear designed to hold contact information for each person.

Every participant's name and date of birth is in the file (!)

The first and most important problem in the file as it was uploaded, and was in place until a couple of days ago, is that a lot of PII was left in there. Specifically, the file contained the first and last names and date of birth of every participant. This study was carried out in the Czech Republic, and I am not familiar with the details of research ethics in that country, but it seems to me to be pretty clear that it is not acceptable to conduct before-and-after physiological measurements on people and then publish those numbers along with information that in most cases probably identifies them uniquely among the population of their country.

I have modified the dataset file to remove this PII before I share it. Specifically, I did this:

  1. Replaced the names of participants with random fake names assembled from lists of popular English-language first and last names. I use these names below where I need to identify a particular participant's data.
  2. Replaced the date of birth with a fake date consisting of the same year, but a random month and day. As a result of this, the "Age" column, which appears to have been each participant's age at their last birthday before they gave their baseline data, may no longer match the reported (fake) date of birth.

What actually happened in the study?

The Abstract states (see above) that 49 participants were in two conditions: exposure to Cold (Chlad, in Czech) and a no-treatment Control group (Kontrolní). But in the dataset there are 99 baseline measurement records, and the participants are recorded as being in four conditions. As well as Cold and Control, there is a condition called Mindfulness (the English word is used), and another called Spánek, which means Sleep in Czech.

This is concerning because these additional participants and conditions are not mentioned in the article. The Method section states that "A total of 49 soldiers (15 women and 34 men) participated in the study and were randomly divided into two groups (control and intervention) before the start of the experiment". If the extra participants and conditions were part of the same study, this should have been reported; the above sentence, as written, seems to be stretching the idea of innocent omission quite a bit. Omitting conditions and participants is a powerful "researcher degree of freedom" in sense of Simmons et al.'s classic paper entitled "False-Positive Psychology". If these participants and conditions were not part of the same study then something very strange is happening, as it would imply that there were at least two studies being conducted with the same participant ID sequence number assignment and reported in the same data file.

Data seem to have been collected in two principal waves, January (leden) and March (březen) 2022. It is not clear why this was done. A few tests seem to have been performed in late 2021 or in February, April, or May 2022, but whatever the date, all participants were assigned to one of the two month groups. For reasons that are not clear, one participant whose baseline data were collected in December 2021 ("Martin Byrne") was assigned to the March group, although his data did not end up in the final group worksheets that formed the basis of the published article. Meanwhile, six participants ("George Fletcher", "Harold Gregory", "Harvey Barton", "Nicole Armstrong", "Christopher Bishop", and "Graham Foster") were assigned to the January group even though their data were collected in March 2022 or later; five of these (all except for "Nicole Armstrong") did end up in the final group worksheets. It seems that the grouping into "January" and "March" did not affect the final analyses, but it does make me wonder what the authors had in mind in creating these groups and then assigning people to them without apparently respecting the exact dates on which the data were collected. Again, it seems that plenty of researcher degrees of freedom were available.

How were participants filtered out?

There are 99 records in the baseline worksheet. These are in conditions as follows: Chlad (Cold), 28 (18 “leden/January”, 10 “březen/March”); Kontrolní (Control), 41 (33 “leden/January”, 8 “březen/March”); Mindfulness, 11 (all “leden/January”); Spánek (Sleep), 19 (11 “leden/January”, 8 “březen/March”).

Of the 28 participants in the Cold condition, three do not appear in the final Cold group worksheets that were use for the final analyses. Of the 41 in the Control condition, 17 do not appear in the final Control group worksheets. It is not clear what criteria were used to exclude these 3 (of 28) or 17 (of 41) people. The three in the Cold condition were all aged over 30, which corresponds to the reported cutoff age from the article, but does rather suggest that this cutoff might have been decided post hoc. Of the 17 people in the Control condition who did not make it to the final analyses, 11 were aged over 30, but six were not, so it is even more unclear why they were excluded. 

Despite the claim by the authors that participants were aged 19–30, four people in the final Cold condition worksheets ("Anthony Day", "Hunter Dunn", "Eric Collins", and "Harvey Barton") are aged between 31 and 35.

Were the authors participants themselves?

At least two, and likely five, of the participants appear to be authors of the article. I base this observation on the fact that in five cases, a participant has the same last name and initial as an author. In two of those cases, an e-mail address is reported that appears to correspond to the institution of that author. For the other three, I contacted a Czech friend, who used this website to look up the frequencies of the names in question; he told me that the last names (with any initials) only correspond to 55, 10, and 3 people in the entire Czech Republic, out of a population of 10.5 million.

Now, perhaps all of these people—one of whom ended up in the final Control group—are also active-duty military personnel, but it still does not seem appropriate for a participant in a psychological study that involves self-reported measures of one's attitudes before and after an intervention to also be an author on the associated article and hence at least implicitly involved with the design of the study. This also calls into question the randomisation and allocation process, as it is unlikely a randomised trial could have been conducted appropriately if investigators were also participants. (The article itself gives no detail about the randomisation process.)

Some of the participants in the final sample are duplicates (!)


The authors claimed that their sample (which I will refer to as the "final sample", given the uncertainty over the number of people who actually participated in the study) consisted of 49 people, which the reader might reasonably assume means 49 unique individuals. Yet, there are some obvious duplicates in the worksheets that describe the Cold and Control groups:
  1. The participant to whom I have assigned the fake name "Stella Arnold" appears both in the Cold group with record ID #7 and in the Control group with record ID #6, both with Gender=F (there are separate sequences of ID numbers for male and female participants within each worksheet, with both sequences starting at 1, so the gender is needed to distinguish between them). The corresponding baseline measurements are to be found in rows 9 and 96 of the "InBody Začátek" (baseline measurements) worksheet.
  2. The participant to whom I have assigned the fake name "Harold Gregory" appears both in the Cold group with record ID #15 and in the Control group with record ID #12, both with Gender=M. The corresponding baseline measurements are to be found in rows 38 and 41 of the "InBody Začátek" worksheet.
  3. The participant to whom I have assigned the fake name "Stephanie Bird" appears twice in the Control groups with record IDs #4 and #7, both with Gender=F. The corresponding baseline measurements are to be found in rows 73 and 99 of the "InBody Začátek" worksheet.
  4. The participant to whom I have assigned the fake name "Joe Gill" appears twice in the Control groups with record IDs #4 and #7, both with Gender=M. The corresponding baseline measurements are to be found in rows 57 and 62 of the "InBody Začátek" worksheet.
In view of this, it seems difficult to be certain about the actual sample size of the final (two-condition) study, as reported in the article.

Many other participants were assigned to more than one of the four conditions

36 of the 99 records in the baseline worksheet have duplicated names. Put another way, 18 people appear to have been enrolled in the overall (four-condition) study in two different conditions. Of these, five were in the Control condition in both time periods ("leden/January" and "březen/March"); four were in the Control condition once and a non-Control condition once; and nine were recorded as being in two non-Control conditions. In 17 cases the two conditions were labelled with different time periods, but in one case ("Harold Gregory"), both conditions (Cold/Control) were labelled "leden/January". This participant was one of the two who appeared in both final conditions (see previous paragraph); he is also one of the six participants assigned to a "January" group with data that were actually collected in March 2022.

Continuing on this point, two records in the worksheet ("InBody1") that contains the record of all tests, both baseline and subsequent, appear to refer to the same person, as the dates of the birth are the same (although the participant ID numbers are different) and the original Czech names differ only in the addition/omission of one character; for example, if the names were English, this might be "John Davis" and "John Davies". The fake names of these two records in the dataset that I am sharing are "Anthony Day" and "Arthur Burton", with "Anthony Day" appearing in the final worksheets and being 31 years old, as mentioned above. The height and other physiological data for these two records, dated two months apart, are similar but not identical.

Inconsistencies across the datasets

The LSQ data contains records for 49 people, with 25 in the Cold group and 24 in the Control group, which matches the main dataset. However, there are some serious inconsistencies between the two datasets.

First, the gender split of the Control group is not the same between the datasets. In the main dataset, and in the article, there were 17 men and 7 women in this group. However, in the LSQ dataset, there are 14 men and 10 women in the Control group.

Second, the ages that are reported for the participants in the LSQ data do not match the ages in the main dataset. There is not enough information in the LSQ dataset—which has its own participant ID numbering scheme—to reliably connect individual participants across the two, but both datasets report the age of the participants and so as a minimum it should be possible to find correspondences at that level. However, this is not the case. Leaving aside the three participants who differ on gender (which I chose to do by assuming that the main database was correct, since its gender split matches the article), there are 11 other entries in the LSQ dataset where I was unable to find a corresponding match on age in the main dataset. Of those 11, three differ by just one year, which could perhaps just be explained by the participant having a birthday between two data collection timepoints, but for the other eight, the difference is at least 3 years, no matter how one arranges the records.

In summary, the LSQ dataset is inconsistent with the main dataset on 14 out of 49 (28%) of its records.

Other curiosities


Several participants, including two in the final Control condition, have decimal commas instead of decimal points for their non-integer values. There are also several instances in the original datafile where cells in analysed columns have numbers recorded as text. It is not clear how these mixed formats could have been either generated or (conveniently) analysed by software.

Participants have a “Date of Registration”. It is not clear what this means. In 57 out of 99 cases, this date is the same as the date of the tests, which might suggest that this is the date when the participant joined the study, but some of the dates go back as far as 2009.

Data were sometimes collected on more than two occasions per participant (or on more than four occasions for some participants who were, somehow, assigned to two conditions). For example, data were collected four times from "Orlando Goodwin" between 2022-03-07 and 2022-05-21, while he was in the Cold condition (to add to the two times when data were collected from him between 2022-01-13 and 2022-02-15, when he was in the Sleep condition). However, the observed values in the record for this participant in the "Cold_Group_After" worksheet suggest that the last of these four measurements to be used was the third, on 2022-05-05. The purpose of the second and fourth measurements of this person in the Cold condition is thus unclear, but again it seems that this practice could lead to abundant researcher degrees of freedom.

There are three different formats for the participant ID field in the worksheet that contains all the measurement records. In the majority of the cases, the ID seems to be the date of registration in YYMMDD format, followed by a one- or two-digit sequence number, for example "220115-4" for the fourth participant registered on January 15, 2022. In some cases the ID is the letters "lb" followed by what appears to be a timestamp in YYMMDDHHMMSS format, such as "lb151210070505". Finally, one participant (fake name "Keith Gordon") has the ID "d14". This degree of inconsistency does not convey an atmosphere of rigour.

The new data file

I downloaded the original (XLSX format) data file on 2023-03-05 (March 5) at 20:52 UTC. That file (or  at least, the link to it) was still there on 2023-03-07 (March 7) at 11:59 UTC. When I checked on 2023-03-08 (March 8) at 17:26 UTC the link was dead, implying that the file with PII had been removed at some intermediate point in the previous 30 hours. At some point after that a new dataset file was uploaded to the same location, which I downloaded on 2023-03-09 (March 9) at 14:27 UTC. This new file, in CSV format, is greatly simplified compare to the original. Specifically:

  1. The data for the two conditions (Cold/Control) and the two timepoints (baseline/end) have been combined into one sheet in place of four.
  2. The worksheets with the PII and other experimental conditions have been removed.
  3. Most of the data fields have been removed; I assume that the remaining fields are sufficient to reproduce the analyses from the paper, but I haven't checked that as it isn't my purpose here.
  4. Two data fields have been added. One of these, named "SMM(%)", appears to be calculated as the fraction of the participant's weight that is accounted for by their skeletal muscle mass, both of which were present in the initial dataset. However, the other, named "WC (waist circumference)", appears to be new, as I cannot find it anywhere in the initial dataset. This might make one wonder what other variables were collected but not reported.
Apart from these changes, however, the data concerning the final conditions (49 participants, Cold and Control) are identical to the first dataset file. That is, the four duplicate participants described above are still in there; it's just harder to spot them now without the baseline record worksheet to tie the conditions together.

Data availability


I have made two censored versions of the dataset available here. One of these ("Simply_anonymized_dataset.xlsx") has been made very quickly from the original dataset by simply deleting the participants' names, dates of birth, and (where present) e-mail addresses. The other ("Public_analysis_dataset.xls") has been cleaned up from the original in several ways, and includes the fake names and dates of birth discussed above. This file is probably easier to follow if you want to reproduce my analyses. I believe that in both cases I have taken sufficient steps to make it impractical to identify any of the participants from the remaining information.
At the same location I have also placed another file ("Compare_datasets.xls") in which I compare the data from the initial and new dataset files, and demonstrate that where the same fields are present, their values are identical.

If anyone wants to check my work against the original, untouched dataset file, which includes the PII of the participants, then please contact me and we can discuss it. There's no obvious reason why I should be entitled to see this PII and another suitably qualified researcher should not, but of course it would not be a good idea to share it for all to see.

Acknowledgements

My thanks go to:

  • Gideon Meyerowitz-Katz (@GidMK) for interesting discussions and contributing a couple of the points in this post, including making the all-important discovery of the PII (and writing to the authors to get them to take it down).
  • @matejcik for looking up the frequencies of Czech names.
  • Stuart Ritchie for tweeting skeptically about the hyping of the results of the study.



22 September 2022

Further apparent (self-)plagiarism in the work of Dr Paul McCrory

In an earlier post I reported on a number of apparently plagiarised or self-plagiarised articles by Dr Paul McCrory. Since then there have been a number of developments in this story, which has attracted attention from media worldwide, especially in Australia, led by Melissa Davey of the Guardian, who has written about the present blog post here.

In this post I present 10 more examples of apparent text recycling by the same author. These are admittedly quite similar in style and content to the first set, but I felt that having done the work to identify these supplementary issues it was worthwhile documenting them. I feel that they demonstrate the scale of the problem: Dr McCrory has been churning out very similar stories (mostly about concussion in various sports) for 20 years, while—as far as I have been able to establish—performing very little original empirical or other research in that time.

Note that Exhibits 8, 9, and 10 include the apparent recycling of text from work that does not have Dr McCrory listed as an author (i.e., apparent plagiarism-proper rather than self-plagiarism). I omitted the case of a chapter in a book (B) that contained many paragraphs of recycled text from a chapter by Dr McCrory in an earlier book (A), because in book B he is only listed as a contributor at the start of the book (i.e., his name does not appear directly on the chapter in question).

There are probably still a few more cases to be uncovered, but it can be laborious work, especially when the only source of a document is the Google Books preview.

Finally, as a "bonus", and to show just how big a problem plagiarism is in science and academia more generally, I've included a case where Dr McCrory's work was apparently plagiarised by other authors.


Exhibit 1

McCrory, P. (2005). Does second impact syndrome exist? Clinical Journal of Sports Medicine, 11(3), 144–149. https://doi.org/10.1097/00042752-200107000-00004 

About 60% of this article appears to have been copied, verbatim and without appropriate attribution, from:

  • McCrory, P. R., & Berkovic, S. F. (1998). Second impact syndrome. Neurology, 50(3), 677–683. https://doi.org/10.1212/WNL.50.3.677

The copied text is highlighted in blue here:

Exhibit 2

McCrory, P., le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Rehabilitation of acute head and facial injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 95–100). Wiley. 

About 70% of this chapter appears to have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: McCrory, P., & Rise, I. R. (2004). Head and face. In R. Bahr & S. Maehlum (Eds.), Clinical guide to sports injuries (pp. 55–90). Human Kinetics. https://books.google.com/books?id=mmRnr0x0p4QC
  • Blue: McCrory, P. (2007). Who should retire after repeated concussions? In D. MacAuley & T. M. Best (Eds.), Evidence-based sports medicine (2nd ed., pp. 93–107). Blackwell.
  • Green: McCrory, P., Meeuwisse, W., Johnston, K., Dvorak, J., Aubry, M., Molloy, M., & Cantu, R. (2009). Consensus statement on concussion in sport: The 3rd international conference on concussion in sport held in Zurich, November 2008. Journal of Athletic Training, 44(4), 434–448. https://doi.org/10.4085/1062-6050-44.4.434

Exhibit 3

McCrory, P. (2007). Who should retire after repeated concussions? In D. MacAuley & T. M. Best (Eds.), Evidence-based sports medicine (2nd ed., pp. 93–107). Blackwell. 

About 35% of this chapter (which appeared as a source text in the previous exhibit) appears to have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: McCrory, P. (2002). Treatment of recurrent concussion. Current Sports Medicine Reports, 1(1), 28–32. https://doi.org/10.1249/00149619-200202000-00006
  • Blue: McCrory, P. (2001). When to retire after concussion? British Journal of Sports Medicine, 35(6), 380–382.
  • Pink: McCrory, P. (2002). Boxing and the brain. British Journal of Sports Medicine36(1), 2.
  • Green: McCrory, P., Johnston, K., Meeuwisse, W.,  Aubry, M., Cantu, R., Dvorak, J., T Graf-Baumann, T., Kelly, J., Lovell, M., & Schamasch, P. (2005). Summary and agreement statement of the 2nd International Conference on Concussion in Sport, Prague 2004. British Journal of Sports Medicine39(4), 196–204. https://doi.org/10.1136/bjsm.2005.018614

Exhibit 4

McCrory, P. (2002). Treatment of recurrent concussion. Current Sports Medicine Reports1(1), 28–32. https://doi.org/10.1249/00149619-200202000-00006 

About 40% of this article (which appeared as a source text in the previous exhibit) appears to have been copied, verbatim and without appropriate attribution, from the following sources (which also appeared as source texts in the previous exhibit):

  • Yellow: McCrory, P. (2001). When to retire after concussion? British Journal of Sports Medicine35(6), 380–382.
  • Blue: McCrory, P. (2002). Boxing and the brain. British Journal of Sports Medicine36(1), 2.
  • Pink: McCrory, P., Johnston, K. M., Mohtadi, N. G., & Meeuwisse, W. (2001). Evidence-based review of sport-related concussion: Basic science. Clinical Journal of Sport Medicine, 11(3), 160–165. https://doi.org/10.1097/00042752-200107000-00006

Exhibit 5

McCrory, P. (2001). The “piriformis syndrome”—myth or reality? British Journal of Sports Medicine35(4), 209–211. https://doi.org/10.1136/bjsm.35.4.209-a

About 90% of this editorial appears to have been copied, verbatim and without appropriate attribution, from:

  • McCrory, P., & Bell, S. (1999). Nerve entrapment syndromes as a cause of pain in the hip, groin and buttock. Sports Medicine, 27(4), 261–274. https://doi.org/10.2165/00007256-199927040-00005

Exhibit 6

McCrory, P. (2002). What advice should we give to athletes postconcussion? British Journal of Sports Medicine36(5), 316–318. https://doi.org/10.1136/bjsm.36.5.316

About 50% of this article appears to have been copied, verbatim and without appropriate attribution, from:

  • McCrory, P. (1997). Were you knocked out? A team physician's approach to initial concussion management. Medicine & Science in Sports & Exercise, 29(7 suppl.), S207–212. https://doi.org/10.1097/00005768-199707001-00002

Exhibit 7

McCrory, P. (2005). Head injuries in sport. In G. Whyte, M. Harries, & C. Williams (Eds.), SABC of Sports and Exercise Medicine (3rd ed., pp. 8–15). Blackwell.

About 30% of the main text of this chapter (which is not the same as the 2015 chapter "Head injuries in sports", which was discussed in my earlier blog post) appears to have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: McCrory, P. (1997). Were you knocked out? A team physician's approach to initial concussion management. Medicine & Science in Sports & Exercise29(7 suppl.), S207–212. https://doi.org/10.1097/00005768-199707001-00002
  • Pink: Aubry, M., Cantu, R., Dvorak, J., T Graf-Baumann, T., Johnston, K., Kelly, J., Lovell, M., McCrory, P., Meeuwisse, W., & Schamasch, P. (2005). Summary and agreement statement of the  first International Conference on Concussion in Sport, Vienna 2001. British Journal of Sports Medicine36(1), 6–10. https://doi.org/10.1136/bjsm.36.1.6

Exhibit 8

McCrory, P., Davis, G., & Makdissi, M. (2012). Second impact syndrome or cerebral swelling after sporting head injury. Current Sports Medicine Reports11(1), 21–23. https://doi.org/10.1249/JSR.0b013e3182423bfd

About 35% of this article appears to have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: McCrory, P. (2005). Does second impact syndrome exist? Clinical Journal of Sports Medicine, 11(3), 144–149. https://doi.org/10.1097/00042752-200107000-00004
  • Blue: Randolph, C., & Kirkwood, M. W. (2009). What are the real risks of sport-related concussion, and are they modifiable? Journal of the International Neuropsychological Society, 15(4), 512–520. https://doi.org/10.1017/S135561770909064X
  • Green: Davis, G. A. (2012). Neurological outcomes. In M. W. Kirkwood & K. O. Yeates (Eds.), Mild traumatic brain injury in children and adolescents: From basic science to clinical management (pp. 99–122). Guilford Press.

Exhibit 9

McCrory, P. (2018). Concussion revisited: A historical perspective. In I. Gagnon & A. Ptito (Eds.), Sports concussions: A complete guide to recovery and management (pp. 9–24). CRC Press.

About 15% of this chapter appears to have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: McCrory, P., Feddermann-Demont, N., Dvořák, J., Cassidy, J. D., McIntosh, A., Vos, P. E., Echemendia, R. J., Meeuwisse, W., & Tarnutzer, A. A. (2017). What is the definition of sports-related concussion: A systematic review. British Journal of Sports Medicine, 51(11):877–887. https://doi.org/10.1136/bjsports-2016-097393
  • Blue: Zezima, K. (2014, May 29). How Teddy Roosevelt helped save football. The Washington Post. https://www.washingtonpost.com/news/the-fix/wp/2014/05/29/teddy-roosevelt-helped-save-football-with-a-white-house-meeting-in-1905/
  • Pink: Johnston, K. M., McCrory, P., Mohtadi, N. G., & Meeuwisse, W. (2001). Evidence-based review of sport-related concussion: Clinical science. Clinical Journal of Sports Medicine11(3), 150–159. https://doi.org/10.1097/00042752-200107000-00005

Exhibit 10

McCrory, P., Bell, S., & Bradshaw, C. (2002). Nerve entrapments of the lower leg, ankle and foot in sport. Sports Medicine32(6), 371–391. https://doi.org/10.2165/00007256-200232060-00003

About 20% of the text of this article appears to have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: McCrory, P. (2000). Exercise-related leg pain: Neurological perspective. Medicine & Science in Sports & Exercise32(3 suppl.), S11–14. https://doi.org/10.1097/00005768-200003001-00003
  • Green: Pecina, M. M., Markiewitz, A. D., & Krmpotic-Nemanic, J. (2001). Tunnel syndromes (3rd ed.). See Chapter 44, p. 229.

Bonus Exhibit: The biter bit?

Espinosa, N., Jr. & Klammer, G. (2018). Peripheral nerve entrapment around the foot and ankle. In M. D. Miller & S. R. Thompson (Eds.), DeLee & Drez's orthopaedic sports medicine (5th ed., pp. 1402–1420). Elsevier Health Sciences.

The highlighted sentences of this chapter appear to have been copied, verbatim and without appropriate attribution, from Exhibit 10:

  • McCrory, P., Bell, S., & Bradshaw, C. (2002). Nerve entrapments of the lower leg, ankle and foot in sport. Sports Medicine32(6), 371–391. https://doi.org/10.2165/00007256-200232060-00003

(This book was first published in 2003 and as far as I can have been able to establish, the chapter by Espinosa and Klammer was first added in the 4th edition in 2014. If by some chance I have got the order of recycling wrong then I humbly apologise to Drs. Espinosa and Klammer and will issue a correction.)

Data availability

All of the supporting files for this post can be found here. I imagine that this involves quite a few copyright violations of my own, in that many of the source documents are not open access. I hope that the publishers will forgive me for this, but if I receive a legal request to take down any specific file I will, of course, comply with that.

(The preceding paragraph has been copied verbatim from my first blog post on the McCrory matter. Ironic, I know.)

Acknowledgements

Big thanks to Sean Rife and James Heathers for letting me use their TAPAS tool to compare documents.


09 August 2022

An interesting lack of randomness in a published dataset: Scott and Dixson (2016)

Martin Enserink has just published the third instalment in an ongoing story of strange results and possible → likely → confirmed misconduct in the field of marine biology, and more specifically the purported effects of climate change on the behaviour of fish. The first two instalments are here (2020) and here (2021).

After Martin's 2021 article, I wrote this blog post describing a few analyses that I had contributed to this investigation. Today I want to look at a recently-corrected article from the same lab, mentioned by Martin in his latest piece (see the section entitled "A corrected paper"), and in particular at the data file that was released as part of the correction, as I think that it illustrates an interesting point about the forensic investigation of data.

Here is the article:

Scott, A., & Dixson, D. L. (2016). Reef fishes can recognize bleached habitat during settlement: Sea anemone bleaching alters anemonefish host selection. Proceedings of the Royal Society B, 283, 20152694. https://doi.org/10.1098/rspb.2015.2694

A correction notice was issued for this article on July 8, 2022, and that correction was accompanied by a data file, which can be downloaded from here

I suggest that you read Martin's articles to get an idea of the types of experiments being conducted here, as the Scott and Dixson article is typical of many coming from the same lab. Basically, 20 (usually) fish were each tested 24 times to see if they "preferred" (i.e., chose to swim in) one of two streams of water (known as "flumes"), A and B, and then this set of 24 trials was repeated a second time, so each A/B pair of flumes was tested by 20 × 24 × 2 = 960 trials. In some cases the fish would be expected to have no preference between the flumes, and in others they should have shown a preference for water type A over B (for example, if B contained the odour of a predator, or some other chemical suggesting an unfavourable environment). The fact that in many cases the fish preferred water B, either when they were expected to have no preference or (even worse) when they were expected to prefer water A, was taken by the authors as an indication that something had gone wrong in the fish's ability to make adaptive choices in their environment.

Here are the two main issues that I see with this (claimed) dataset.


This isn't what a dataset looks like

As I noted in my earlier post, this just isn't what a dataset looks like. You don't collect data in multiple 2-D panels and lay those out in a further 2-D arrangement of 18 x 5 panels, because that's just making life difficult for yourself when you come to do the analyses. If for some reason you actually collected your data like that, you would need to write some quite sophisticated software—probably a couple of hundred lines of Python or R code—to reliably read the data in a form that is ready to generate the figures and/or tables that you would need for your article. That code would have to be able to cope with the inevitable errors that sneak into files when you are collecting data, such as occasional offsets, or a different number of fish in each chunk (the chunks on lines 78 through 94 only have 17 fish rather than 20; incidentally, the article says that each experiment was run on 18 to 20 fish), or an impossible value such as we see at cell DU46. (The code that I wrote to read the dataset from the Dixson et al. article that was the subject of my earlier blog post is around 300 lines long, including reasonable spacing.) 

So there would seem to be two possibilities. Either the authors have some code that reads this file and reliably extracts the data in a form suitable for running the analyses; or, they have another data file which is more suited to reading directly into SPSS or R without having to strip away all of the formatting that makes the Excel sheet relatively visually appealing. Either way, they can surely provide one or other of those to us so that we can see how they dealt with the problems that I listed above. (I will leave it up to the reader to decide if there are any other possibilities.)


There is too little variation... in the unremarkable results

In my earlier blog post on this topic I analysed another dataset from the same lab (Dixson et al., 2014) in which there were numerous duplications, whereby the sequence of the numbers of choices of one or other flume for the 20 fish in one experiment were often very similar to those in another experiment, when there was no reason for that to be the case.

In the current dataset there are a few sets of repeated numbers of this kind (see image), but I don't think that they are necessarily a problem by themselves, for a couple of reasons.


Were these lines (in green) copied, or are the similarities caused by the limited range of the data? My hunch is that it's the latter, but it isn't really important.


First, these lines only represent sequences of a few identical numbers at a time, whereas in the 2014 dataset there were often entire duplicated groups of 20 fish.

Second, for most of these duplications, the range of the numbers is severely restricted because they (at least ostensibly) correspond to a large experimental effect. The Scott and Dixson article reports that in many cases, the fish chose flume A over flume B almost all of the time. The means that the numbers of observations of each fish in flume A, out of 24 opportunities, must almost always be 22 or 23 or 24, in order for the means to correspond to the figures in the article. There are only so many ways that such a small number of different predicted values can be distributed, and given that the person examining the dataset is free to look for matches across 180 20-fish (or 17-fish) columns of data, a number of duplicates of a certain length will very likely arise by chance.

However, the dataset also contains a number of cases where the fish appeared to have no preference between the two. The mean number of times out of 24 trials that they were recorded as having chosen flume A (or flume B) in these cases was close to 12. And it turns out that in almost all of these cases, there is a different lack of variation, not in the sequence of the observations (i.e., the numbers observed from top to bottom of the 20 fish across experiments), but in the variability of the numbers within each group of fish.

If the fish genuinely don't have a preference between the two flumes, then each observation of a fish is basically a Bernoulli trial with a probability of success equal to 0.5—which is a fancy way of saying "a coin toss"—and so the 24 trials for each fish represent 24 coin tosses. Now, when you toss a coin 24 times, the most likely result is 12 heads and 12 tails, corresponding to the fish being in flume A and B 12 times each. However, although this result is the most likely, it's not especially likely; in fact, the exact 12–12 split will only occur about 16% of the time, as you can see at this site (put 24 into "Number of Bernoulli trials", click Calculate, and the probability of each result will be in the table under the figure with the curve). If you repeat those 24 trials 100 times, you would expect to get 8 As and 16 Bs  (or 8 Bs and 16 As) either 4 or 5 times.

Now let's look at the dataset. I identified 32 columns of data with 20 (or, in a few cases, 17) fish and a mean of around 12. I also included 3 other columns which had one or more values of 12; as I hope will become clear, this inclusion works in the authors' favour. I then calculated the standard deviation (SD) of the 20 (or 17) scores that are composed of 24 trials for each of these 35 columns of data.

Next, I generated one million random samples of 24 trials for 20 simulated fish and calculated the SD of each sample. For each of the 35 SDs taken from the dataset, I calculated the fraction of those million simulated SDs that were smaller than the dataset value. In other words, I calculated how likely it was that one would observe an SD as small as the one that appears in the dataset if the values in the dataset were indeed taken from 24 trials of 20 fish that had no preference between the flumes. Statistically-minded readers may recognise this as the p value for the null hypothesis that these data arose as the result of a natural process, as described by the authors of the Scott and Dixson paper.

The results are not very good for the authors. For only nine of the samples, including the three that contain a small number of scores of 12 but otherwise have a substantially different mean, the p values are greater than 0.05. Seven of the p values are zero, meaning that an SD as low as the one corresponding to the data reported by the authors did not occur at all in one million simulated samples (see image below for an example). A further six p values are less than 0.0001 and four are less than 0.001. The overall chances of obtaining these results from a natural process are hard to calculate accurately (for example, one would need to make a small adjustment for the fact that the results come in pairs of 20-fish samples, as each fish took part in 2 sets of 24 trials and those two sets are not independent), but in any case I think it can safely be described as homeopathic, if only from the seven cases of zero matches out of one million. 


Remarkably consistent results. SD in yellow (0.7863), proportion of simulated data values that have a lower SD in green (0.000000).


Conclusion

Lack of expected variability is a recurring theme in the investigation of bad science. Uri Simonsohn was one of the pioneers of this in his paper "Just Post It", and more recently Kyle Sheldrick came up with a novel method of checking whether the sequence of values in a dataset is "too regular". I hope that my explanation of the issues that I see in the Scott and Dixson dataset is clear.

Martin Enserink's latest piece mentions that the University of Delaware is seeking the retraction of three papers with Danielle Dixson as an author. Apparently the Scott and Dixson (2016) article—which, remember, has already been corrected once—is among those three papers. If nobody identifies a catastrophic error in my analyses then I plan to write to the editors of the journal to bring this issue to their attention.


Data availability

I have made an annotated copy of the article PDF file available here, which I think constitutes fair use. As mentioned earlier, the dataset is available here.




07 March 2022

Some examples of apparent plagiarism and text recycling in the work of Dr Paul McCrory

Dr Paul McCrory of the Florey Institute of Neuroscience and Mental Health has been in the news in the past few days. This started with a single retraction of an apparently plagiarised editorial piece in the British Journal of Sports Medicine from 2005, but after I started digging further and more problems came to light, he has now resigned as chair of the influential Concussion in Sport Group (CISG), as reported by The Guardian and The Athletic, among other outlets.

Since much of this story has only been covered in a series of separate threads on Twitter up to now, I thought I would take some time to document in one place the full extent of what I have found about Dr McCrory's extensive recycling of his own and others' writing.

The first five exhibits are already in the public domain, but I will include them here for completeness. If you have been following the story on Twitter up to now, you can skip straight to Exhibit 6.


Exhibit 1

McCrory, P. (2005). The time lords. British Journal of Sports Medicine, 39(11), 785–786.

About 50% of this article has been copied, verbatim and without appropriate attribution, from this 2000 article in Physics Today by Steve Haake, who was the person who first discovered Dr McCrory's plagiarism and brought it to the attention of the current editor-in-chief of the British Journal of Sports Medicine. The copied text is highlighted in pink here:


The editorial has now been retracted. This was reported by Retraction Watch on February 28, 2022. At that point I started looking into other articles by the same author.


Exhibit 2

McCrory, P. (2005). Definitions for the purist. British Journal of Sports Medicine39(11), 786.

About 70% of this article has been copied, verbatim and without appropriate attribution, from this website. A copy of that page, archived on May 22, 2003 (that is, two years before Dr McCrory's article was published) can be found here. The copied text is highlighted in yellow here:

I tweeted about this article on March 1, 2022. Retraction Watch picked up on that and later reported that the author had asked for the article to be retracted, giving an explanation that I found less than impressive.


Exhibit 3

McCrory, P. (2006). Take nothing but pictures, leave nothing but footprints…? British Journal of Sports Medicine40(7), 565. https://doi.org/10.1136/bjsm.2006.029231

Nearly 80% of the words in this article have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: This website. A copy of that page, archived on December 7, 2003 (that is, more than two years before Dr McCrory's article was published) can be found here.
  • Pink: This article from New Scientist, dated April 16, 2005.
  • Blue: This website, dated March 2006 (several months before Dr McCrory's article was published). An archived copy from May 2, 2006 can be found here.
  • Green: This website, dated November 2005.
  • Grey: This website. An archived copy from September 6, 2003 can be found here.


As with Exhibit 2, I tweeted about this on March 1, 2022. The author came up with a quite remarkable story for Retraction Watch about why this article only merited a correction. I found that even less impressive than his excuses in the previous case.


Exhibit 4

McCrory, P. (2002). Commotio cordis. British Journal of Sports Medicine, 36(4), 236–237.

About 90% of the words in this article have been copied, verbatim and without appropriate attribution, from the following sources:

  • Yellow: Curfman, G. D. (1998). Fatal impact — Concussion of the heart. New England Journal of Medicine, 338(25), 1841-1843. https://doi.org/10.1056/NEJM199806183382511
  • Blue: Nesbitt, A. D., Cooper, P. J., & Kohl, P. (2001). Rediscovering commotio cordis. The Lancet, 357(9263):1195–1197. https://doi.org/10.1016/S0140-6736(00)04338-5
    .

James Heathers discovered a couple of these overlaps on March 3, 2022 and I tweeted the full picture on March 4, 2022.

Exhibit 5

McCrory, P. (2005). A cause for concern? British Journal of Sports Medicine39(5), 249.

Almost half of the words in this article have been copied, verbatim and without appropriate attribution, from the following source:

  • Piazza, O., Anna-Leena Sirén, A.-L., & Ehrenreich, H. (2004). Soccer, neurotrauma and amyotrophic lateral sclerosis: Is there a connection? Current Medical Research and Opinion, 20(4), 505–508. https://doi.org/10.1185/030079904125003296

 The copied text is highlighted in pink here:

I tweeted about this on March 4, 2022.

Exhibit 6

McCrory, P. (2002). Should we treat concussion pharmacologically? British Journal of Sports Medicine36(1), 3–5.

Almost 100% of the text has been copied, verbatim and without appropriate attribution, from:

  • McCrory, P. (2001). New treatments for concussion: The next millennium beckons. Clinical Journal of Sport Medicine, 11(3), 190–193.

That copied text is highlighted blue (light or dark) in the image below. The text in dark blue also overlaps with this MedLink article. Thus, either Dr McCrory plagiarised three paragraphs from MedLink in two separate articles, or MedLink plagiarised him. The MedLink article was initially published in 1997, but it has been updated since, so the direction of copying cannot be established with certainty unless I can find an archived copy from 2001. It may, however, be interesting that the "phase II safety and efficacity trial" mentioned (Dr McCrory's reference 22) has a date of 1997.


James Heathers discovered one of the overlaps in this text on March 3, 2022, but it took another couple of hours work at my end to uncover the full extent of the text recycling and possible plagiarism in this article.


Exhibit 7

McCrory, P. (2006). How should we teach sports medicine? British Journal of Sports Medicine40(5), 377.

About 60% of the words in this article have been copied, verbatim and without appropriate attribution, from the following sources:

  • Pink: Fallon,  K. E., & Trevitt, A. C. (2006). Optimising a curriculum for clinical haematology and biochemistry in sports medicine: A Delphi approach. British Journal of Sports Medicine40(2), 139–144. https://doi.org/10.1136/bjsm.2005.020602
  • Blue: Long, G., & Gibbon, W. W. (2000). Postgraduate medical education: Methodology. British Journal of Sports Medicine34(4), 235–245.
Note that the Fallon & Trevitt article was published in the same journal just three months before it was plagiarised.


Exhibit 8

McCrory, P. (2008). Neurologic problems in sport. In M. Schwellnus (Ed.), Olympic textbook of medicine in sport (pp. 412–428). Wiley.

About 25% of the words in this book chapter have been copied, verbatim and without appropriate attribution, from other sources. Of that 25%, about two-thirds is recycled from other publications by the same author, and the remainder is plagiarised from other authors, as follows:
  • Orange: McCrory, P. (2000). Headaches and exercise. Sports Medicine, 30(3), 221–229. https://doi.org/10.2165/00007256-200030030-00006
  • Green: McCrory, P. (2001). Headache in sport. British Journal of Sports Medicine35(5), 286–287.
  • Blue: McCrory, P. (2005). A cause for concern? British Journal of Sports Medicine39(5), 249. (See also Exhibit 5.)
  • Yellow: Showalter, W., Esekogwu, V., Newton, K. I., & Henderson, S. O. (1997). Vertebral artery dissection. Academic Emergency Medicine, 4(10), 991–995. https://doi.org/10.1111/j.1553-2712.1997.tb03666.x
  • Pink: This MedLink article, which was initially published in 1996, but has been updated since, so the direction of copying cannot be established with certainty unless I can find an archived copy from 2008. It may, however, be interesting that the citations in the pink text (Kaku & Lowenstein 1990; Brust & Richter 1977) both (a) predate the MedLink article and (b) are not — or no longer — referenced at the equivalent points in the MedLink text. It would seem unlikely that MedLink would (a) plagiarise Dr McCrory's article from 2008 at some point after that date and (b) remove these rather old citations (without replacing them with new ones).
(Don't bother squinting too hard at the page - the annotated PDF is available for you to inspect. See link at the end of this post.)

Exhibit 9

McCrory, P., & Turner, M. (2015). Concussion – Onfield and sideline evaluation. In D. McDonagh & D. Zideman (Eds.), The IOC manual of emergency sports medicine (pp. 93–105). Wiley.

About 50% of the words in this book chapter have been copied, verbatim and without appropriate attribution, from other sources, as follows:

  • Blue: McCrory, P., le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Head injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 58–94). Wiley.
  • Yellow: McCrory, P., le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Rehabilitation of acute head and facial injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 95–100). Wiley.
  • Green: Aubry, M., Cantu, R., Dvorak, J., Graf-Baumann, T., Johnston, K., Kelly, J., Lovell, M., McCrory, P., Meeuwisse, W., & Schamasch, P. (2001). Summary and agreement statement of the first International Conference on Concussion in Sport, Vienna 2001. British Journal of Sports Medicine36(1), 6–10. https://doi.org/10.1136/bjsm.36.1.6
  • Pink: McCrory, P. (2015). Head injuries in sports. In M. N. Doral & J. Karlsson (Eds.), Sports injuries (pp. 2935–2951). Springer.

The pink text also appears in Exhibit 9, which was published in the same year, so it's not clear which is the original and which is the copy. I tweeted about some of the similarities between Exhibits 9 and 10 here, although I hadn't found everything at that point.

The green text in the final paragraph on page 105 appears to have been copied and pasted twice (it appears in two paragraphs on page 104), which might cause the reader to wonder exactly how much care and attention went into this copy-and-paste job.

Readers who are interested in the activities of the CISG might be interested to note that the 2001 Vienna conference (the "green" text reference above) was where the name of this group was first coined.

(Note that five pages, corresponding to the photographic reproduction of the Sport Concussion Assessment Tool and the Pocket Concussion Recognition Tool, have been omitted from this image.)

Exhibit 10

McCrory, P. (2015). Head injuries in sports. In M. N. Doral & J. Karlsson (Eds.), Sports injuries (pp. 2935–2951). Springer.

About 90% of the words in this book chapter have been copied, verbatim and without appropriate attribution, from other sources, as follows:

  • Blue (light and dark): McCrory, P. le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Head injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 58–94). Wiley.
  • Yellow: McCrory, P. le Roux, P. D., Turner, M., Kirkeby, I. R., & Johnston, K. M. (2012). Rehabilitation of acute head and facial injuries. In R. Bahr (Ed.), The IOC manual of sports injuries (pp. 95–100). Wiley.
  • Pink: McCrory, P., & Turner, M. (2015). Concussion – Onfield and sideline evaluation. In D. McDonagh & D. Zideman (Eds.), The IOC manual of emergency sports medicine (pp. 93–105). Wiley.

The pink text also appears in Exhibit 9, which was published in the same year, so it's not clear which is the original and which is the copy.

The text in dark blue has been copied twice from the same source; again, it seems as if this chapter was not assembled with any great amount of care.

(Note that six pages, corresponding to the photographic reproduction of the Sport Concussion Assessment Tool and the Pocket Concussion Recognition Tool, have been omitted from this image.)


Conclusion

The exhibits above present evidence of extensive plagiarism and self-plagiarism in seven editorial pieces in the British Journal of Sports Medicine from 2002 through 2006, and three book chapters from 2008 through 2015. As well as the violations of publication ethics and other elementary academic norms, most of these cases would also seem to raise questions about copyright violations.

This is not an exhaustive collection; I have evidence of these transgressions on a smaller scale in a number of other articles and book chapters from the same author, but a combination of time, weariness (of me as investigator and, presumably, of the reader too), and lack of access to source materials (for example, I was only able to find one extensively recycled book chapter on Google Books, which is not very practical for marking up) has led me to stop at 10 exhibits here.

I have no background or experience in the field of head trauma or sports medicine, and I had never heard of Dr McCrory or the CISG until last week. Hence, I am unable to comment about what all of this might mean for the CISG or its influence on the rules and practices of sport. However, although I try not to editorialise too much in this blog, I must say that, based on what I have found here, Dr McCrory does not strike me as an especially outstanding example of scientific integrity, and it does make me wonder what other aspects of his life as a scientist and influencer of public policy might not stand up to close scrutiny.


Data availability

All of the supporting files for this post can be found here. I imagine that this involves quite a few copyright violations of my own, in that many of the source documents are not open access. I hope that the publishers will forgive me for this, but if I receive a legal request to take down any specific file I will, of course, comply with that.