Today's topic is this report in Science by Martin Enserink about possible scientific misconduct in a series of studies that investigate the relation between increasing CO2 levels (causing a decrease in the pH of the world's oceans) and the behaviour of fish. Martin's report gives most of the background that you will need to follow this post. While he was preparing it, he asked me to look at the dataset for a couple of articles from the research group whose work he was investigating. In particular, I found a lot of interesting things in this article:
Most of my analyses of the article and its associated dataset are written up in a report that you can find here [PDF]. In this short post I just want to mention one other point that isn't in that report, which is the whole question of why the dataset is in the form of an Excel file (which you can find here [XLSX]) in the first place.
As I note in the report, just for the observations of the behaviours of 15 different species of fish (the first 15 of the 19 worksheets in the Excel file) the researchers must have made 864,000 separate notes. That is, the real "Raw Data" (this phrase appears in the title of the Excel file, but those are not the "raw data") consist of 864,000 entries corresponding to the position, in one of two possible channels, of 20 examples of 15 species of fish captured from 6 locations being recorded in 10 samples of water over 2 sets of trials of 2 minutes each with 12 observations per minute (20 x 15 x 6 x 10 x 2 x 2 x 12 = 864,000). That's almost a million ones and zeroes, each labelled with a species, fish number, capture location, water type, trial number, and sequence number of the observation.
Somewhere there must exist, or at the very least have existed, a CSV file (or, perhaps, a file in some other proprietary format, such as SPSS or SAS or Stata; but as far as I know, all of those packages can export to CSV format) containing those raw numbers. Even if the 864,000 observations were initially made on a very very very large stack of paper, at some point they would have been entered into a computer in a format from which the analyses reported in the article could have been run. Importantly, the analyses could almost certainly not have been run directly from this Excel file, because of the inconsistencies that it contains. Indeed, when I wrote some code (available here) to try to extract some summary statistics from the dataset, I had to explicitly work around the errors in the data, such as the cases where there are 21 rather than 20 fish in a set of tests, or where data elements are in different positions from one sheet to the next. Had the original analyses been based on these Excel sheets, the authors would surely have noticed that these misalignments were causing strange results or even crashes, and fixed the dataset. (And if by some chance they didn't notice these problems, there would be some inconsistencies in the published results, whereas—as the group of investigators led by Timothy Clark has pointed out—the results in this paper, and indeed across multiple studies from this laboratory, are remarkably uniform.)
As Martin's article mentions, several other datasets (all Excel files) from the same laboratory seem to have similar problems. There seems to be a consistent pattern of the researchers deciding that in order to share their data, rather than just making their CSV file available with a few notes to explain the purpose and/or labels of each variable, they needed to laboriously re-enter their data into an Excel sheet, with lots of needless formatting whose only effects are to (a) increase the chance of errors and (b) make it harder for anyone to replicate their analyses in software. Meanwhile, the original raw files from which the statistics and charts in the published articles were made remain mysteriously absent. It is difficult to understand why anybody would work this way, when simply sharing the actual raw data would represent less effort and be much more reliable.
[2021-05-10 20:58 UTC: Updated link to my analysis report with a new version. ]
"It is difficult to understand why anybody would work this way, when simply sharing the actual raw data would represent less effort and be much more reliable."
ReplyDeleteI can think of a reason or two.
The other thing is it quite easy to turn .csv files into Excel columns, if you wanted to do so. There is no reason to manually re-enter.
The authors state in the methods:
ReplyDelete"A two-minute acclimation period was followed by a three-minute testing period where the position of the planula, on either the right or left side of the chamber, was reco[r?]ded at five-second intervals."
"The test was then repeated, including the acclimation period, to ensure planula were displaying a preference for the chemical cues rather than one side of the chamber."
The methods do not inform readers how this information (left / right) was registered. Written down on a piece of paper / in a notebook (with two people, an observer and a writer, or with one person who was observing and at the same time writing down notes)? Or registered in a computer / notebook etc. (by one or two people)? Or by using audio equipment (only one observer needed)? These records are towards my opinion the raw research data (and not an excelfile which was created much later). Where is this information?
The authors also state in the methods:
"Using underwater paper a grid was made on the bottom of each aquaria displaying a 20mm boundary around each tile. Planulae were counted as settling on the tile if they were on it or within this 20 mm boundary near it."
How was this information (in / out) recorded? On a piece of paper? Or directly entered into a database? Or by using audio equipment? Where is this information (once again raw research data, at least towards my opinion)?
Etc.
Any idea if Fredrik Jutfelt & co have contacted the authors about the whereabouts of these data?
All excellent questions. I don't know if the main investigators have asked about the specific raw data for this study. The Excel file is dated September 2016, suggesting either that the "truly" raw data were still around then, or that there is another data in non-Excel format with the equivalent information to the Excel sheet.
DeleteThanks.
ReplyDeleteNick wrote: "At 10 hours a day and six days a week, this would take 40 weeks; in practice it would probably take well over a person-year (and have been soul-crushingly repetitive). Only one person (“V. Bonito”) is named in the Acknowledgements, and it is not even clear if they were a research assistant."
Re-reading the section Methods of the supplementary file reveals towards my opinion that a huge amount of time was needed to conduct all activities / experiments etc. which are not included in this "40 weeks". See below for some examples:
"Using randomized selection half of the plots were cleared of macroalgae by hand (...) Macroalgal removal plots were maintained bi-weekly initially; this was reduced to monthly once algal re-growth was determined to be slow."
The authors do not indicate how much time (half an hour, one hour, several hours?) is needed to clear one plot (in total 15 plots needed to be cleared).
"Offshore seawater was collected from 2 km offshore and used as a control to test for side bias when tested against itself (blank trial; n=20 individual fish per species per location; n=1800 when pooled)."
I was unable to sort out how often a trip (with a boat?) was made to collect this seawater. How much time would be needed for one such trip?
"Directly adjacent to 2 sets of cleared and natural plots in each MPA and non-MPA per village tile arrays mounted on PVC poles were cemented into the benthos. Each tile array stood 50 cm off of the seafloor and contained 8-15x15 cm unglazed tiles (=16 unglazed surfaces) as potential coral settlement sites."
Do there exist pictures of these structures? Are they composed from parts which can be bought in a shop? Or do they contain parts which have been prepared by for example a (local) technician (or by the authors)? How many hours are needed to install these structures?
"All planula were tested only once."
"Colonies were transported from the reef to experimental holding pools each night for spawning and gamete collection. Different species were maintained in separate stagnate 1,000L pools. Each night before corals were brought into the laboratory, holding pools were filled with 5 μm filtered ocean water and complete water changes occurred each day to ensure highest water quality."
So how many planula's have been tested and how long (in months?) lasted this period of testing?
I also have difficulties to sort out the exact period (years) when this research was conducted at Fiji. The authors state in the supplementary file:
"Plots were initiated in March 2012 and maintained through the coral recruitment season, ending on April 2013."
"Four individual colonies of each of species containing mature pigmented oocytes were carefully dislodged from the reef using a hammer and chisel four days following the October full moon of 2012."
"Transects were conducted during the annual recruitment pulse (December-January 2012-2013) with densities assessed by slowly swimming a 30 m line and carefully searching 1 m on each side of the line."
Does this imply that all field work and all lab experiments were conducted in the period March 2012 - April 2013? I was unable to find this information in the main text. I fail to understand why the authors have not listed their study period in the main text of their article. Does "21 April 2014" mean that the manuscript was submitted on 21 April 2014 to Science?
I am not a marine ecologist. I am therefore offering my apologies for misunderstandings and/or for asking very stupid questions. Maybe lots and lots of students were involved in this project?