15 July 2021

Some problems in the dataset of a large study of Ivermectin for the treatment of Covid-19

This post appears at the same time as this piece at grftr.news by Jack Lawrence. Jack contacted me to ask if I could help him look at a number of issues with a prominent study of Ivermectin for the treatment of Covid-19. My speciality is forensic numerical data analysis finding errors in numbers, so I concentrated on that and suggested some other names to Jack to help him look at things like the study design, methods, and reporting. 

Hence, this post is almost entirely about the problems with the data from this study; Jack's piece covers other topics, such as the plagiarised text, the clinical trial "pre-registration" that was made after the first version of the results of the study was published, and many other problems. Gideon Meyerowitz-Katz has a piece over at Medium that discusses the implications of this study for the whole Ivermectin-for-Covid literature, and Melissa Davey is covering the story in the Guardian today.

Here is the article reference. In fact, it's not in a peer-reviewed journal. It's just a preprint—in spite of which it has already acquired 30 citations in just over six months, according to Google Scholar, and—as reported by Jack—has also become a major component of the weight of evidence for the efficacy of Ivermectin in several meta-analyses (see Gideon's Medium piece, linked above, for more on this).

Elgazzar, A., Eltaweel, A., Youssef, S. A., Hany, B., Hafez, M., & Moussa, H. (2020). Efficacy and safety of Ivermectin for treatment and prophylaxis of COVID-19 pandemic. Research Square100956https://doi.org/10.21203/rs.3.rs-100956/v3

The preprint is currently in its third revision(*). You can download it, plus the two previous revisions if you want to compare those, from the preprint hosting service Research Square. (I use draftable.com to compare PDFs.)

The authors have, well, "sort of" made their data available. To quote from the preprint (p. 6): "The study data master sheet are [sic] available on reasonable request from the corresponding auther [sic] from the following link. https://filetransfer.io/data-package/qGiU0mw6#link". It is tempting to imagine that one might be able to download the data file directly from that link; however, when you attempt to do that, the site says that you have to create a premium account ($9 per month), and after you have done that and downloaded the file, it turns out to be password-protected. This suggests that the authors did not want anyone to be able to read it without their approval, which is not quite in the spirit of open science. (It is, however, not incompatible with Research Square's rather feeble data sharing policy.)

Fortunately, Jack Lawrence did a lot of work here. Not only did he pay for a premium account at filetransfer.io, but he also guessed the password of the file, which turned out to be 1234. I have never met Jack Lawrence in person, though, so as part of my due diligence for this blog post, I also paid $9 plus VAT for a one-month subscription to filetransfer.io, and downloaded the file for myself. To save you, dear reader, from having to go through that process, I have made an unlocked copy of the file available here. It is perhaps interesting to note that, judging by the filename, the authors were apparently still editing the "study data master sheet" on 12 December 2020, when they had already posted essentially all of their results in two earlier versions of their preprint by November 16.

Formatting problems

The data file is in Microsoft Excel (.XLSX) format, although the authors reported performing their analyses in SPSS 21. In the Excel metadata (File/Properties/Statistics), the creation date of the file is "16 September 2006 02:00:00", which suggests that the authors started with an older file and cleared all the cells before entering their data. Clearing the cells in this way does not remove cell formatting, which might explain one or two of the stranger cell formats that one sees when opening the file in Excel (e.g., in cell K5 the number 6.3 is formatted as a day and appears as "06-Jan", while B222 and F225:Z225 are in a different font); however, the formatting problems go a lot further than that.

Numbers containing non-numeric characters

Several cells that represent numbers in the Excel file appear to have been entered by someone more used to a manual typewriter than a computer. Specifically, cells K17, L318, L354, L366, L380, M38, M101, M396:M402, S272, S278, S280, S396, and S398 contain one or more occurrences of the lowercase letter "o" instead of the digit "0". As a result, these cells are text strings, rather than numbers, and any numerical calculations based on them will fail.

Because these cells contain strings, their values are left-aligned (the default for strings in Excel), whereas the numbers in the same column are right-aligned. In many cases it seems that the creator of the data file has attempted to remedy this visual infelicity by left-padding the non-numeric string with spaces. For example, although the value "1.o" [sic] in cell L318 has no padding, the same value in cells L354, L366, and L380 has been padded on the left with 33, 34, and 32 space characters, respectively.

Relatedly, the percentages in cells M89, M94, M128, S232, S243, S245, S250, S261, S262, S274, and S279 contain a comma as a decimal separator, instead of a dot, and so again are treated as text strings rather than numbers. These cells with commas are padded on the left with between 12 and 16 leading space characters in column S, although there is no padding in column M.

Confusion around date formats

Columns W and X of the Excel file contain dates with the captions "symptoms date&+ ve PCR" and "recovery date & -ve PCR". It seems that these dates are performing multiple duties, since there is no obvious reason why a patient's date of first showing symptoms of Covid-19 should be identical to the date on which they first tested positive, or why the date of their (first?) negative PCR test should correspond with their doctor's (?) certification that they have recovered. These dates seem to have also been used to calculate the length of time during which people were hospitalised (column Y), although again, one would generally expect the dates of a participant's hospital admission to be somewhat decoupled from their dates of first symptoms and/or PCR tests. I find it very surprising that there are not more dates recorded for each patient, to account for the various milestones that are of importance in the progression and treatment of Covid-19.

However, it is not only the meaning of these dates that is confusing. Their format is, too. The only dates that are actually formatted as Excel dates (i.e., with an underlying number representing the count of days since December 30, 1899) are those where both the day and month are less than 13. My working hypothesis is that the creator of the file either typed in the dates by hand, or pasted them from a text file, in dd/mm/yyyy format, but that Excel was in "US date mode" at the time. Thus, the only dates that were converted to the underlying numeric date format were those that were interpretable as mm/dd/yyyy (i.e., those with a dd/mm/yyyy "day" less than 13; I assume that there were no errors with a dd/mm/yyyy "month" greater than 31).

As with the numbers that contain non-numeric characters (see previous section), it seems that padding with spaces has been added manually in an attempt to align the "string" dates with the "correct" (numeric) dates. In column W, there are 176 numeric dates, 115 "string" dates with no padding, and 109 "string" dates with left padding of between 19 and 27 spaces; three of these also have padding on the right, of 3, 6, and 33 spaces. In column X, after removing the text "died in ICU" from 21 dates and "died bin ICU" [sic] from another, there are 94 numeric dates, 305 "string" dates with left padding of between 1 and 28 spaces (three of which also have padding on the right, of 1 and 8 spaces), and one "string" date with left padding of a backquote character (`) and 22 spaces (cell X259). In 6 cases (cells X8, X15, X23, X62, X75, and X402), the discharge date in column X also includes the number of days spent in hospital, which should be in column Y; it appears that whoever was inputting the data may have thought that using spaces to move the cursor over to the right of the cell boundary between the two columns was equivalent to using the Tab key to move to the next cell. 

Several of the "string" dates are incorrectly formatted internally, e.g., "1l6l2020" (cell W110, with lowercase "L" as the separator), "06/82020" (cell W208), "31/7//2020" (cell X223), and "6/8/20/20" (cell X230). The date in cell X155 ("31/06/2020") is ostensibly formatted correctly, but implies that the patient was discharged on the non-existent date of 31 June 2020.

In summary, it is impossible for the dates in the file to have been used to calculate any sort of elapsed time in SPSS. Indeed, it seems that this calculation was done by hand, with the results being reported in column Y (with the addition of the text " days"), and with different "fencepost" rules typically being applied for each group. For example, in groups I and III the number of days in column Y is usually one more than the difference between the dates in columns W and X (i.e., both the start and end date are counted), whereas for groups II and (especially) IV the number of days in column Y is typically equal to the difference between the dates in columns W and X. (See also "Table 6", below, for a brief discussion of the apparent confusion between columns W and X on the one hand, and column Y on the other.)

Repeated sequences

At several points in the Excel file, there are instances where the values of an ostensibly random variable are identical in two or more sequences of 10 or more participants, suggesting that ranges of cells or even entire rows of data have been copied and pasted.

Approximately 19 cloned patients in group II

In cells B150:B168 and B184:B202, the patient's initials are either identical at each corresponding point (e.g., cells B150/B184) or, in almost all the remaining cases, differ in only one letter.

Cells C150:C168 are identical to cells C184:C202.
Cells D150:D168 are identical—with one exception out of 19 cells—to cells D184:D202.
Cells I150:I167 are identical to cells I184:I201.
Cells S150:S165 are identical—with one exception out of 14 cells—to cells S184:S199.
Cells U150:U168 are identical to cells U184:U202.
Cells V150:V168 are identical to cells V184:V202.
Cells W150:W168 are identical—with three exceptions out of 19 cells—to cells W150:W168.
Cells AA150:AA168 are identical to cells AA184:AA202.

Approximately 60 cloned patients in group IV

In cells B303:B320, B321:B338, and B339:B356, the patient's initials are either identical at each corresponding point (e.g., cells B303/B321/B339) or, in almost all the remaining cases, differ in only one letter.

Cells I303:I320 are identical to cells I321:I338 and I339:I356, including the typo "coguh" for "cough".
Cells I358:I371 are identical to cells I372:I385, including the typo "coguh" for "cough".
Cells I340:I349 are identical—with one exception out of 10 cells—to cells I386:I395.

Cells J303:J320 are identical to cells J321:J338 and J339:J356.
Cells J358:J371 are identical to cells J372:J385.
Cells J340:J349 are identical to cells J386:J395.

Cells K303:K320 are identical to cells K321:K338 and K339:K356.
Cells K358:K371 are identical to cells K372:K385.
Cells K340:K349 are identical to cells K386:K395.

Cells L303:L320 are identical—with two exceptions out of 18 cells—to cells L321:L338 and L339:L356.
Cells L358:L371 are identical—with one exception out of 14 cells—to cells L372:L385.
Cells L340:L349 are identical—with two exceptions out of 10 cells—to cells L386:L395.

Cells M303:M320 are identical to cells M321:M338 and M339:M356.
Cells M358:M371 are identical to cells M372:M385.
Cells M340:M349 are identical to cells M386:M395.

Cells S303:S320 are identical to cells S321:S338 and S339:S356.
Cells S358:S371 are identical to cells S372:S385.
Cells S340:S349 are identical to cells S386:S395.

Cells U303:U320 are identical to cells U321:U338 and U339:U356.
Cells U358:U371 are identical to cells U372:U385.
Cells U340:U349 are identical to cells U386:U395.

Cells W303:W320 are identical to cells W321:W338 and W339:W356.
Cells W358:W371 are identical to cells W372:W385.
Cells W340:W349 are identical to cells W386:W395.

Cells Y303:Y320 are identical (apart from spacing differences) to cells Y321:Y338 and
with one exception out of 18 cells—Y339:Y356.
Cells Y358:Y371 are identical—with three exceptions out of 14 cells—to cells Y372:Y385.
Cells Y340:Y349 are identical to cells Y386:Y395.

Cells Z303:Z320 are identical to cells Z321:Y338 and Z339:Y356.
Cells Z358:Z371 are identical—with three exceptions out of 14 cells—to cells Z372:Z385.
Cells Z340:Z349 are identical to cells Z386:Z395.

Duplicated cells in groups II (top) and IV (bottom). In each column, groups of 10 or more cells with the same background colour and surrounded by a solid black border are identical. These images are screenshots from my annotated version of the Excel data file (see "Resources", below).

These patterns are not consistent with groups II and IV each containing the results of 100 different, real patients. The chances of any one of these duplications occurring by chance, let alone all of them, are astronomical. These patterns are, however, highly consistent with the idea that the Excel file has been fabricated with extensive use of copy/paste operations, followed perhaps by occasional attempts to obscure this "cloning" process by changing some numbers manually. Indeed, the slight imperfections in some of the copies would seem to exclude the possibility that these patterns are the result of an unfortunate slip of the mouse.

It seems indisputable that the patients in group II (mild/moderate disease, control condition) whose records are found at line 184 through 202 of the Excel file—a total of 19 people—are crude "clones" of the data of other patients (who, themselves, may or may not have actually existed). Similarly, it is hard to think of any explanation for the duplications in lines 321 through 356 and 372 through 395, other than that the records of around 32 patients in group IV (severe disease, control condition) have been "cloned", some of them multiple times. The question then naturally arises of which other records in the file may not reflect the reality of the patients in the study.

Apparent failures of randomisation

The patients in groups I and II (mild/moderate disease, treatment and control) ought to have been similar to each other; likewise the patients in groups III and IV (severe disease, treatment and control). Indeed, the authors state (p. 3) that "A block randomization method was used to randomize the study participants into two groups that result in equal sample size. This method was used to ensure a balance in sample size across groups over the time and keep the number of participants in each group similar at all times". (Aside: I would be grateful if someone could explain to me what the second sentence there implies for the execution of the study.)

However, the randomisation does not appear to have been a complete success. For example:
  • In group I, the number of patients with anosmia as an additional symptom was 25. In group II, this number was 4.
  • In group I, the number of patients with loss of taste as an additional symptom was 25. In group II, this number was 0.
  • In group III, the number of patients with vomiting as an additional symptom was 1. In group IV, this number was 12.
  • In group III, the number of patients with bronchial asthma as a comorbidity was 14. In group IV, this number was 0.
  • In group III, the number of patients with cholecystitis, chronic kidney disease, hepatitis B, hepatitis C, and open heart surgery as comorbidities was 0 in all five cases. In group IV, these numbers were 6, 5, 5, 6, and 6, respectively.

Descriptive statistics that do not match the preprint

The first three paragraphs of the Results section of Elgazzar et al.'s preprint contain descriptions of the characteristics of their sample. Here, I reproduce the text of each of those paragraphs. Where the numbers that I calculated from the data set differ from those reported, I have included my calculated values in red and inside brackets. It should be apparent that while a few of the numbers calculated from the Excel sheet match those in the preprint, the great majority do not.

First paragraph:
The mean age in Group I was 56.7 [47.5] ±18.4 [15.1]; included 72 % males and 28 % females. The mean age in Group II was 53.8 [43.2] ±21.3 [16.1]; included 67 [66] % males and 33 [34] % females. The mean age in Group III was 58.2 [55.0] ±20.9 [14.0]; included 68 [74] % males and 32 [26] % females. The mean age in Group IV was 59.6 [54.2] ±18.2 [13.7]; included 74 [73] % males and 26 [27]% females. The mean age in Group V was 57.6 [48.8] ±18.4 [9.2]; included 75 % males and 25% females. The mean age in Group VI was 56.8 [54.4]±18.2 [8.8]; included 72 % males and 28% females. There was no statistical significance variation between groups regarding mean age or sex distribution (p-value >0.05).

The sex of one of the participants in group V was coded in the Excel sheet as "A" (cell C449), rather than "M" or "F". The preprint made no mention of any patients identifying as anything other than Male or Female. In order for the numbers of patients of each sex in group V to match the numbers reported in the preprint, I counted "A" as "M".

Second paragraph:
Co morbid conditions distributed between different studied groups showed that DM was present in 15 [4]% of Group I patients, 14 [16]% of Group II patients, 18% of Group III patients, 21 [26]% of Group IV patients 15% of group V and 19 % of group VI. HTN presented in 11 [6]% of Group I patients, 12 [13] % of Group II patients, 14% of Group III patients, 18 [32]% of Group IV patients ,15 [14] % of group V patients and 14 [13]%of group VI patients . 2 [1]% of Group I patients had IHD versus 6 [7]% in Group II, 5% in group III; 12 [5]% in group IV;1% in group V and 3 [4] % in group VI respectively with statistically significant prevalence of ischemic heart disease as severity increase (p-value < 0.03).. Bronchial asthma presented in 5 [3]% of Group I patients, 6 % of Group II patients, and 14% of Group III patients, in 12 [0]% of Group IV patients; 5% of group V and 4% of group VI patients.

I assume that the authors' calculation of "prevalence of ischemic heart disease as severity increase" involved grouping the patients into three pairs of groups by severity (groups I and II, groups III and IV, and groups V and VI). Here are the results of that operation using, first, their IHD prevalence numbers, and second, my calculated numbers. The authors reported a p value of "< 0.03".

> chisq.test(c(8, 15, 4))
X-squared = 6.8889, df = 2, p-value = 0.03192

> chisq.test(c(8, 10, 5))
X-squared = 1.6522, df = 2, p-value = 0.4378

Third paragraph:
Clinically there was a highly statistically significant difference between groups of diseased patients regarding fatigue, dyspnea, and respiratory failure (p-value <0.001), as most of group III & IV, showed fatigue and dyspnea (86 [86, 85]% and 88 [85, 84]%, respectively), compared to (36 [28]%, 38 [47]% ; 54 [34]% and 52 [49]%, respectively), in group I & II. Respiratory failure had been detected in 38% and 40% in group III& IV respectively while no patients in group I& II developed respiratory failure. No skin manifestation had been detected in any group.

The authors' reporting of fatigue and dyspnea is unclear here, as for groups III and IV they report only two percentages. I have assumed that their claim was that these were identical for each of the two groups (i.e., fatigue, 86 for both groups; dyspnea, 88 for both groups), whereas I found three out of four numbers to be different. I was unable to calculate the percentage of respiratory failure as this was not apparently reported in the data file, although "sore throat" was. Nor could I find anything in the data file corresponding to "skin manifestation". Regarding the "highly statistically significant difference between groups of diseased patients", the p values are less than 0.001 (indeed, less that 1E-9) with the authors' numbers or mine.

Table results that do not match the preprint

I wrote some R code that attempts to reproduce the authors' tables, as far as possible. This is available here so that readers can judge (a) whether the small number of decisions that I needed to make in order to adapt the Excel sheet for analysis are reasonable, and (b) whether I have programmed the subsequent calculations correctly.

Here are the results that I obtained. (Full-resolution images are available at the same link as the code.) Readers are invited to compare these results with the tables in Elgazzar et al.'s preprint. I think it is fair to say that there is a substantial degree of divergence.

Table 1

The D-dimer results from the Table 1 in preprint could not be reproduced because no data for that measure exist in the Excel file.

Tables 2 and 3

Only three of the elements from the preprint's Tables 2 and 3 correspond to values in the Excel file after one week of treatment. Longitudinal data for HGB, TLC, Lymphocyte % are all missing, and the Excel file contains no data for D-dimer at any time point.

The time difference between the first and last RT-PCR tests, can be calculated in two ways: either using the authors' provided field (column Y) or by subtracting the date of the first PCR test (column W) from the date of the final PCR test (column X) and then (as was apparently done by the authors, at least for group I) adding one.

Table 4

The first half of Table 4 ("Prognosis") cannot be reproduced, as this variable only exists in the Excel file for groups V and VI. As for Tables 2 and 3, there are two ways to calculate the time of stay in the hospital. The per-group ranges associated with the version labelled "RecordedStay", corresponding to column Y in the Excel sheet, are not too far from the ones reported in the preprint, with 6 out of 8 numbers (minimum or maximum) being identical; would seems to suggest that the reproduction is on the right track.

Also noteworthy here are the extremely small standard deviations of the stays in group I (both as recorded in column Y, and as calculated from columns W and X) and group III (as calculated from columns W and X; in this last case I find myself wondering why there is such a difference between the SDs of the recorded and calculated stays).

Relatedly, in the preprint, the standard deviations for the hospital stay are remarkably different between the groups. The large SD for group IV (8, with a mean of 18 and a range of 9–25) implies that about 40% of the patients stayed 9 days and 60% stayed for 25, with almost no room for any other lengths of stay, as shown by SPRITE (Heathers et al., 2018, https://peerj.com/preprints/26968/; https://shiny.ieis.tue.nl/sprite/).

SPRITE analysis of the possible distribution of the recovery times claimed by Elgazzar et al. for patients in group IV.

[[Begin update 2021-07-16 12:46 UTC]]
Alert reader Anatoly Lubarsky pointed out on Twitter that there are combinations of stay lengths that do not involve quite as many values at the limits, 9 and 25, as the chart above. He is correct. By specifying one decimal place when generating the above chart I had, in effect, told SPRITE to look for SD values in the range 7.95–8.05, whereas the authors reported only integers and so their SD could have been anywhere from 7.50001 through 8.49999. It's a bit ironic that I missed that since I previously wrote this post on pretty much exactly this topic.

rSPRITE doesn't currently work with zero decimal places, but Anatoly also provided an example that he had constructed to show what seems to me to be the most favourable (i.e., the least extreme) result from the point of view of the authors. Here is the resulting chart from that example. I do not think that this greatly alters the idea that this pattern of days spent in hospital is unlikely to be a reflection of real-world data.

Chart showing 100 values with minimum=9, maximum=25, mean=17.86 (which rounds to 18 if no decimal places are included), and SD=7.505 (rounds to 8), cf. Elgazzar et al.'s Table 4, bottom row, group IV.
See the file "SD-simulation.xls" in "Resources", below.

[[End update 2021-07-16 12:35 UTC]]

It is unclear why the authors claim to have performed a chi-squared test (χ2=87.6, p<0.001) on the value of "Recovery time &Hospital stay", as it is clearly not a categorical variable. It is tempting to imagine that this result was copied and pasted from the first half of Table 4 (with the test statistic being altered by subtracting 1 from each digit) by someone who did not understand what they were doing and did not realise that a chi-squared test is meaningless here.

Table 5
The data that would be needed to reproduce Table 5 do not seem to be available in the Excel file.

Table 6
As with Table 4, the first part of this table cannot be reproduced, as the "Prognosis" variable is not available in the Excel file.

I have not reproduced the second part of Table 6 as it appears to be redundant or to use unavailable data. The RT-PCR results correspond to the last lines of Tables 2 and 3. Interestingly, the "Hospital Stay" variable appears to be different from "RT-PCR" here, although as I hope to have demonstrated earlier, the variable marked "Hospital Stay" in column Y of the data file has a very close relationship with the difference between columns W ("symptoms date&+ ve PCR") and X ("recovery date & -ve PCR"). It seems that the authors are unsure whether the difference in days between the first positive and last negative PCR test (columns W and X) corresponds to the hospital stay or not, with or without an adjustment of one day for the "fencepost" issue mentioned earlier.

Other issues

The age distribution

The distribution of patient ages is very strange. There are 34 patients aged 48 and 31 aged 58, but only 3 aged 50 and 4 aged 53. Furthermore, of the 600 patients, 410 have an age that is an even number of years while only 190 have an age that is an odd number of years.

It is difficult to see how any of this could have arisen by chance. (The R function pbinom() reports that the binomial probability of 399 out of 600 ages being even is 1.11E-16; the chance of there being 400 or more even ages out of 600 is too small for R to calculate.)

Trailing digits of numerical variables

Kyle Sheldrick discovered that of the 400 values of the variable "serum ferritin before treatment", only three end in the digit 3. I looked at the other numerical results in the preprint and found that in almost all cases the distribution of the trailing digits is extremely unusual, and in contrast with what one would expect from Benford's Law, which—although it is perhaps best known for its predictions about the first digits of data corresponding to natural phenomena anchored at zero—shows that for the digits of a random variable apart from the first, the expected distribution is approximately uniform. (In November 2020 I used the predictions of the distributions of trailing digits from Benford's Law to demonstrate that the official Covid-19 statistics from the Turkish Ministry of Health were probably fabricated.)

In cases where the dominant trailing digit is zero we might allow for the possibility that different people collected data to different degrees of precision, thus leading to numbers being rounded and, consequently, a trailing digit of zero more often than might be expected by chance. But this cannot explain why, for example, 82% of the numbers for HGB end in the digits 2–5, or why 17.5% of the numbers for TLC end in 8 whereas none end in 2. The large chi-square statistics and their associated homeopathic p values in the tests of the trailing digits from Elgazzar et al.'s data file suggest that none of these patterns are the result of a natural process. They are, however, highly compatible with the idea that the numbers in the Excel table have either been copied and pasted in bulk, or invented out of whole cloth by someone who was trying (and failing) to simulate random numbers—an activity that humans are not very good at.

Counts of the trailing digits (0–9) of various numeric variables in Elgazzar et al.'s data file, and the chi-square statistics for the test against the null hypothesis that their distribution is uniform.

Study entry and exit dates

The preprint states (p. 3) that "The study was carried out from 8th June to 15th September 2020". This seems to conflict with the study's registration on ClinicalTrials.gov, which states that the "Actual Study Completion Date"—defined as "The date [of] the last participant's last visit"—was 30 October 2020. We cannot, perhaps, infer much from the fact that the last recorded entry (positive PCR) and exit (negative PCR) dates in the Excel file are 18 August 2020 and 21 August 2020, respectively, as we do not have date information for the outpatients (groups V and VI). However, we can see that there are 120 patients (71 in group II, 3 in group III, and 47 in group IV) with an entry date prior to 8 June 2020, with the earliest being 12 May 2020. Similarly, there are 49 patients (31 in group II, 1 in group III, and 17 in group IV) with an exit date prior to 8 June 2020, with the earliest being 23 May 2020.


Another strange feature of this story is that, although the authors claim to have performed their analyses using SPSS (p. 4 of the preprint), they did not share the SPSS data file (in .SAV or .CSV format), although this would have been a much better way to allow readers to reproduce their analyses. Instead they shared what they called the "study data master sheet". As I have shown here, these data (a) contain numerous signs of manipulation and (b) once cleaned up and analysed with the same statistical tests that authors used, mostly—but, perhaps significantly, not entirely—fail to produce the results reported by the authors in their Results section text and tables.

There is another curious sentence in the preprint that makes me wonder whether the authors actually used SPSS at all, or indeed have ever done so. On p. 5 they wrote "After the calculation of each of the test statistics, the corresponding distribution tables were counseled to get the 'P' (probability value)". Assuming that "counseled" here is a typo for "consulted", it appears that the authors' claim is that they read the test statistics from the SPSS output and then looked up the corresponding p values in a table, such as the one on this page. I wonder why anyone would do this, given that the SPSS output for all of the tests that the authors reported having run contains the p value right next to the test statistic. Looking up test statistics in a table to get the p value has been out of fashion since we stopped computing t statistics using pencil and paper, circa 1995 ("Ah, now I know why my desk calculator has a square root key").


In view of the problems described in the preceding sections, most notably the repeated sequences of identical numbers corresponding to apparently "cloned" patients, it is difficult to avoid the conclusion that the Excel file provided by the authors does not faithfully represent the results of the study, and indeed has probably been extensively manipulated by hand.

In some cases where forensic researchers have discovered discrepancies between images or datasets and the results reported in a paper, the authors have attempted to claim that they "accidentally" provided a "training" version that had been made to calibrate their software. It does not seem possible that such a defence could be used in this case, however, since the Excel sheet provided by Elgazzar et al. cannot possibly have been used for this purpose, in view of the extensive amount of manual cleaning that would be required to make it useable for any purpose.

I urge the authors to make their SPSS data file publicly available without delay, in order that we can see the exact numbers on which their analyses were based—because, as demonstrated above, those numbers cannot be those in the Excel file. If the authors cannot provide their SPSS data file then I believe that either they or Research Square should consider retracting their preprint as a matter of urgency.


I have made the following files available here:
  • The unlocked original data file, named "Copy of covid_19 final master sheet12-12-2020 (1).xlsx". That is exactly the name that it would have if you were to download it (albeit still locked with a password at that point).
  • The password-protected version of the original data file, to the name of which I have prepended (Locked), so the full name is now "(Locked) Copy of covid_19 final master sheet12-12-2020 (1).xlsx". The password for read-only access is 1234. If someone has the right tools to extract the second-level password that is needed to modify the contents of the file, I'd be curious to know what it is. [[ Update 2021-07-17 15:58 UTC: Graham Sutherland has discovered that there are many passwords that unlock read-write access, the simplest of which appears to be 0001. ]]
  • An annotated version of that file, named "(Nick) Elgazzar data.xls". I have highlighted the anomalous individual cells in yellow and the runs of duplicated cells in a variety of other colours.
  • Slightly higher resolution versions of the images from this post, named "Table1.png" (etc), "SPRITE-Table4.png", and "Trailing-digits.png".
  • My analysis code, named "Elgazzar.R".


Thanks to Jack Lawrence (@JackMLawrence) for bringing the paper to my attention and cracking the password of the data file; Gideon Meyerowitz-Katz (@GidMK) for pointing out the Table 4 standard deviation problem, the issue with the patient dates relative to the reported study dates, and the claim that the p values were looked up in a table; and Kyle Sheldrick (@K_Sheldrick) for making the initial discovery of the lack of trailing 3s in the serum ferritin numbers.

(*) Things move fast in Covid world. Less than 24 hours before this blog post was due to be published, Research Square posted "V4" of the preprint, which is simply a placeholder that says "Research Square has withdrawn this preprint due to ethical concerns". To ensure that the preprint does not disappear, I have posted the PDFs of the first three versions in the same location as the other supporting files for this post (see "Resources", above).

2021-07-14: Research Square withdraws the preprint.