06 May 2021

My minor involvement in the investigation of some strange articles from marine ecology

Today's topic is this report in Science by Martin Enserink about possible scientific misconduct in a series of studies that investigate the relation between increasing CO2 levels (causing a decrease in the pH of the world's oceans) and the behaviour of fish. Martin's report gives most of the background that you will need to follow this post. While he was preparing it, he asked me to look at the dataset for a couple of articles from the research group whose work he was investigating. In particular, I found a lot of interesting things in this article:

Dixson, D. L., Abrego, D., & Hay, M. A. (2014). Chemically mediated behavior of recruiting corals and fishes: A tipping point that may limit reef recoveryScience, 345(6192)892–897. https://doi.org/10.1126/science.1255057
(You can find the PDF of the article on a ResearchGate page here; I'm not sure if this direct link to the file will work.)

Most of my analyses of the article and its associated dataset are written up in a report that you can find here [PDF]. In this short post I just want to mention one other point that isn't in that report, which is the whole question of why the dataset is in the form of an Excel file (which you can find here [XLSX]) in the first place.

As I note in the report, just for the observations of the behaviours of 15 different species of fish (the first 15 of the 19 worksheets in the Excel file) the researchers must have made 864,000 separate notes. That is, the real "Raw Data" (this phrase appears in the title of the Excel file, but those are not the "raw data") consist of 864,000 entries corresponding to the position, in one of two possible channels, of 20 examples of 15 species of fish captured from 6 locations being recorded in 10 samples of water over 2 sets of trials of 2 minutes each with 12 observations per minute (20 x 15 x 6 x 10 x 2 x 2 x 12 = 864,000). That's almost a million ones and zeroes, each labelled with a species, fish number, capture location, water type, trial number, and sequence number of the observation.

Somewhere there must exist, or at the very least have existed, a CSV file (or, perhaps, a file in some other proprietary format, such as SPSS or SAS or Stata; but as far as I know, all of those packages can export to CSV format) containing those raw numbers. Even if the 864,000 observations were initially made on a very very very large stack of paper, at some point they would have been entered into a computer in a format from which the analyses reported in the article could have been run. Importantly, the analyses could almost certainly not have been run directly from this Excel file, because of the inconsistencies that it contains. Indeed, when I wrote some code (available here) to try to extract some summary statistics from the dataset, I had to explicitly work around the errors in the data, such as the cases where there are 21 rather than 20 fish in a set of tests, or where data elements are in different positions from one sheet to the next. Had the original analyses been based on these Excel sheets, the authors would surely have noticed that these misalignments were causing strange results or even crashes, and fixed the dataset. (And if by some chance they didn't notice these problems, there would be some inconsistencies in the published results, whereas—as the group of investigators led by Timothy Clark has pointed out—the results in this paper, and indeed across multiple studies from this laboratory, are remarkably uniform.)

As Martin's article mentions, several other datasets (all Excel files) from the same laboratory seem to have similar problems. There seems to be a consistent pattern of the researchers deciding that in order to share their data, rather than just making their CSV file available with a few notes to explain the purpose and/or labels of each variable, they needed to laboriously re-enter their data into an Excel sheet, with lots of needless formatting whose only effects are to (a) increase the chance of errors and (b) make it harder for anyone to replicate their analyses in software. Meanwhile, the original raw files from which the statistics and charts in the published articles were made remain mysteriously absent. It is difficult to understand why anybody would work this way, when simply sharing the actual raw data would represent less effort and be much more reliable.

[2021-05-10 20:58 UTC: Updated link to my analysis report with a new version. ]