02 October 2021

Applying John Carlisle's "Table 1" analysis to some claims about the treatment of Covid-19 using anti-androgens

Back in 2017, Dr. John Carlisle published this article in which he described a novel method for examining the table of participants' baseline characteristics (usually Table 1) in reports of randomised trials. Table 1 is typically used to show that the randomisation "worked", which in general means that the groups do not differ on important variables more often than we would expect by chance (with the p value of the final comparison expected to, in effect, "mop up" any differences that do occur).

Carlisle's insight was that it is possible for the baseline characteristics to be too similar across groups. That is, in some cases, we do not see the random variation that would be expect if the assignment to groups is truly random. For example, if you have 100 participants and 20 of them have some condition (say, diabetes), while a 10–10 split across two groups is the most likely individual outcome (about 17.6% of the time), you would expect a 7–13 (or more extreme) split about 26.5% of the time. If Table 1 contains a large number of even or near-even splits, that can be a sign that the randomisation was not done as reported, because there is just not enough genuine randomness in the data.

We can quantify the degree of randomness that is present by looking at the p values for the different between-group baseline tests in Table 1. If all of the variables are truly randomised, we would expect these p values to be uniformly distributed with a mean of 0.50. With a sufficiently large number of variables, we would expect 10% of the p values to be between 0 and 0.1, a further 10% to be between 0.1 and 0.2, and so on. If we see mostly p values above 0.8 or 0.9, this suggests that the baseline similarities between the groups could be too good to be true. A statistical test attributed to Stouffer and Fisher can be used to determine the probability of the set of p values that we observe being due to chance.

Carlisle's method has limitations, some of which I blogged about at the time (see also the comments under that post, including a nice reply from John Carlisle himself) and some of which were mentioned by proper methodologists and statisticians, for example here and here. Those limitations (principally, the possibility of non-independence of the observations being combined from Table 1, and their sometimes limited number) should be borne in mind when reading what follows, but (spoiler alert) I do not think they are sufficient to explain the issues that I report here.

In this post, I'm going to apply Carlisle's method, and the associated Stouffer-Fisher test, to two articles that have made claims of remarkably large positive effects of anti-androgenic drugs for the treatment of Covid-19:

Cadegiani, F. A., McCoy, J., Wambier, C. G., Vaño-Galván, S., Shapiro, J., Tosti, A., Zimerman, R. A., & Goren, A. (2021). Proxalutamide significantly accelerates viral clearance and reduces time to clinical remission in patients with mild to moderate COVID-19: Results from a randomized, double-blinded, placebo-controlled trialCureus13(2), e13492. https://doi.org/10.7759/cureus.13492

Cadegiani, F. A., McCoy, J., Wambier, C. G., & Goren, A. (2021). Early antiandrogen therapy with dutasteride reduces viral shedding, inflammatory responses, and time-to-remission in males with COVID-19: A randomized, double-blind, placebo-controlled interventional trial (EAT-DUTA AndroCoV Trial – Biochemical). Cureus, 13(2), e13047. https://doi.org/10.7759/cureus.13047

I will refer to these as Article 1 and Article 2, respectively. Article 2 also seems to be closely related to this preprint, which reports results from what appear to be a superset of its participants; I will refer to this as Article 3. Below, I also report the results of the analyses using Carlisle's method on this preprint; however, because of the similarity between the two samples I don't think that it would be fair to the authors to claim that the issues that I report here have been found in three, rather than two, articles.

Cadegiani, F. A., McCoy, J., Wambier, C. G., & Goren, A. (2020). 5-alpha-reductase inhibitors reduce remission time of COVID-19: Results from a randomized double blind placebo controlled interventional trial in 130 SARS-CoV-2 positive men. medRxiv. https://doi.org/10.1101/2020.11.16.20232512

Method

For each article, I extracted the contents of Table 1 to a text file and, using global commands in the "vim" text editor as far as possible, converted each line into a call to a custom-written function that calculated the p value for that variable.

For variables where the p value is calculated from an independent-samples t test, my custom function determined the maximum possible t statistic (and, hence, the minimum possible p value), by adding the maximum possible rounding error to the larger mean, subtracting the same maximum possible rounding error from the smaller mean, and subtracting the maximum possible rounding error from the standard deviation of each group (cf. Brown & Heathers, 2019, "Rounded Input Variables, Exact Test Statistics (RIVETS)", https://psyarxiv.com/ctu9z/). I believe that doing this works in the authors' favour, as the majority of the p values in these analyses come from contingency tables and are rather large; that is, getting the smallest possible p value from the t tests tends to increase the overall Stouffer-Fisher test p value.

For the majority of the variables, where the p value is derived from an analysis of 2x2 contingency tables, my custom function applied the following rules:
  • If any cell of the table contains 0, return NULL; the variable will not be considered to have returned a p value).
  • If any cell of the table contains a number less than 5, apply Fisher's exact test. This is of course an arbitrary distinction (but it doesn't make too much difference anyway).
  • Otherwise, apply a chi-square test with Yates' continuity correction for 2x2 tables.
Other analyses are possible, but this was what I decided to do a priori. I call this "Analysis 1a" below. In Analysis 1b I included tables where one or more cells are zero (also using Fisher's exact test). In Analysis 1c I excluded all variables where one or more cells had a value less then 3, so that any variable where 2 or fewer people in either condition had or did not have the attribute in question we excluded. In Analyses 2a, 2b, and 2c I  applied the same rules for inclusion as in 1a, 1b, and 1c, respectively, but I used the chi-square test throughout. (This means that Analyses 1c and 2c are identical, as there are no variable in 1c for which my rules would mean that Fisher's exact test would be used.) In Analyses 3a, 3b, and 3c I used Fisher's exact test throughout.

After calculating the p values, I replaced any that were greater than 0.98 with exactly 0.98, which avoids problems with the calculation of the Stouffer-Fisher formula in R with values of exactly 1.0 (which will occur, for example, if the number of cases in a contingency table is identical across conditions, or differs only by 1). Again, I believe that this choice works in the authors' favour. Then I calculated the overall Stouffer-Fisher test p value formula using the method that I described in my blog post about Carlisle's article:
  1. Convert each p value into a z score.
  2. Sum the z scores.
  3. If there are k scores, divide the sum of the z scores from step 2 by the square root of k.
  4. Calculate the one-tailed p value associated with the overall z score from step 3.

For example, for Article 1 with Analysis 2c (see the table below), 26 p values are retained from 55 variables: (0.048, 0.431, 0.703, 0.782, 0.973, 0.298, 0.980, 0.682, 0.817, 0.897, 0.826, 0.328, 0.980, 0.980, 0.227, 0.424, 0.918, 0.884, 0.959, 0.353, 0.980, 0.511, 0.980, 0.512, 0.980, 0.980). These correspond to the z scores (-1.662, -0.175, 0.533, 0.779, 1.924, -0.531, 2.054, 0.473, 0.904, 1.265, 0.939, -0.446, 2.054, 2.054, -0.749, -0.193, 1.393, 1.198, 1.739, -0.376, 2.054, 0.028, 2.054, 0.030, 2.054, 2.054), which sum to 21.448. We divide this by the square root of 26 (i.e., 5.099) to get an overall z score of 4.206, which in turn gives a p value of 0.000013 (1.30E-05).

Note that we do not take the absolute value of each z score, because the sign is important. A p value below/above 0.5 corresponds to a negative/positive z score. If  the positive and negative z scores cancel each other out, the overall Stouffer-Fisher p value will be 0.5, which is what we expect to see on average with perfect randomisation.

Results

Here are the p values that I obtained from each article using each of the analysis methods described above. Note that a value of zero corresponds to a p value below 2.2E-16, the smallest value that R can calculate (on my computer anyway). The 10 columns to the right of the overall p value show the deciles of the distribution of p values of the individual tests that make up the overall score.


It can be readily seen that in Articles 1 and 2, which are the main focus of this post, the p value is very small, whatever the combination of analyses and exclusions that were performed, and the distribution of p values from the individual comparisons is very heavily skewed towards values above .7 and especially .9. Even when I excluded a large number of comparisons because there were fewer than 3 cases or non-cases in one or more conditions (Analyses 1c, 2c, and 3c), a decision that is heavily in favour of the authors, the largest p value I obtained for Article 1 was 0.0000337, and for Article 2 the largest p value was 0.00000000819. (As mentioned above, the results for Article 3 should not be considered to be independent from those of Article 2, but they do tend to confirm that there are potentially severe problems with the randomisation of participants in both reported versions of that study.)

Conclusion

The use of novel statistical methods should always be approached with an abundance of caution; indeed, as noted earlier, I have had my own criticisms of John Carlisle's approach in the past. (Now that I have adopted Carlisle's method in this post, I hope that someone else will take the part of the "Red Team" and perhaps show me where my application of it might be invalid!)

However, in this case, I believe that my comments from 2017 about the non-independence (and low number) of baseline measures do not apply here to anything like the same extent. In addition, I have run multiple analyses of each of the articles in question here, and tried to make whatever choice favoured the authors when the opportunity presented itself. (Of course, I could have made even more conservative choices; for example, I could have excluded all of the contingency tables where any of the cells were less than 6, or 10. But at some point, one has to accept at face value the authors' claims that their Table 1 results (a) are worth reporting and (b) demonstrate the success of their randomisation.)

Despite that, I obtained what seems to be substantial evidence that the randomisation of participants to conditions in the two studies that are the respective subjects of these articles may not have taken place exactly as reported. This adds to a growing volume of public (and, in some cases, official) critique of the Covid-19 research coming from this laboratory (e.g., in English, here, herehere, here, and here, or in Portuguese here, here, here, here, here, here, and here).

Materials

My R code is available here. The three referenced articles are published either under open access conditions or as a preprint, so I hope they will be easy for readers to download when checking my working.

[[ Update 2021-10-03 12:56 UTC: Added the columns of p value deciles to the table of results. ]]

14 September 2021

Some problems in a study of Nitazoxanide as a treatment for Covid-19

Nitazoxanide (NTZ) is an antiparasitic medication that has been proposed as a possible treatment for Covid-19. (The currently most widely-touted repurposed drug for Covid-19, Ivermectin, is also an antiparasitic.) The effects of NTZ were recently examined in this paper:

Blum, V. F., Cimerman, S., Hunter, J. R., Tierno, P., Lacerda, A., Soeiro, A., Cardoso, F., Bellei, N. C., Maricato, J., Mantovani, N., Vassao, M., Dias, D., Galinskas, J.,
Janini, L. M. R., Santos-Oliveira, J. R., Da-Cruz, A. M., & Diaz, R. S. (2021). Nitazoxanide superiority to placebo to treat moderate COVID-19: A pilot prove [sic] of concept randomized double-blind clinical trial. EClinicalMedicine, 37, 100981. https://doi.org/10.1016/j.eclinm.2021.100981

The authors reported (p. 9) that "The study protocol and all data files from the study results and
programs to process the files are deposited in a publicly available repository on the GitLab system for this clinical trial", with a link. To me, this implies that the data and code should have been available on the date when the article was published, i.e., June 27, 2021. However, it appears that no files were actually uploaded to that repository until August 29, 2021, which perhaps not uncoincidentally was shortly after my colleague Kyle Sheldrick had asked the authors to post their data.

The GitLab repository contains the data divided into two sets, one with the participant characteristics ("Demographic_NTZ_Trial_Data") and one with the observations from each day of the trial ("Clinical_Laboratory_NTZ_Trial_Data"). Each of these is ostensibly provided in two formats, as both an Excel sheet (with the extension ".xlsx") and as the binary dump of an R data structure (".rds"). For "Demographic_NTZ_Trial_Data" the Excel and R versions appear to correctly contain the same data. Unfortunately, the two R binary files are byte-for-byte identical to each other, suggesting that the wrong file has been uploaded for “Clinical_Laboratory_NTZ_Trial_Data"; hence, I relied on the Excel version of both files.

Also perhaps of note is that the metadata (“File/Properties/Summary” and “File/Properties/Statistics”) of the Excel files reveal that they were created by “Apache POI” on August 29, 2021 at 19:13:34 UTC (“Demographic_NTZ_Trial_Data”) and 20:27:18 UTC (“Clinical_Laboratory_NTZ_Trial_Data”). The file creator name “Apache POI” suggests that these files were created by the R function write.xlsx(), which leaves this “signature” from the Java library that it uses to create Excel files. This in turn suggests that there ought to be a more fundamental version of the data files somewhere, perhaps in CSV format. It would be nice if the authors could share this “rawer” form of the data with us, rather than the output of an R program after some unknown amount of preprocessing.

The GitLab repository contains only the above-mentioned data files, plus a "data dictionary" containing a one-sentence description of each variable, and a brief "Readme"file. The authors have therefore apparently still not fulfilled their promise to provide the study protocols or their analysis code.

All of the following analyses are based on the two Excel files, in which several of the variables are labeled and/or coded in Portuguese; I believe that I have appropriately interpreted these labels, so that for example the variable “sexo” with the value “Masculino” corresponds to a male patient (reported by the authors in English as "Masculine").

The table of results

The basis of the majority of the authors' claims regarding the superiority of NTZ is the main table of results (Table 1, p. 5). Unfortunately, this table seems to contain a large number of errors and inconsistencies with the supplied data files. Some of these are more serious than others; I will describe them here from the top to the bottom of the table, rather than attempting to order them by some subjective measure of severity.

  1. Table 1 reports that there were 7 men ("Gender: Masculine") and 18 women ("Gender: Feminine") in the NTZ group, versus 8 men and 17 women in the placebo group. However, in the data these numbers show the opposite result, not by group but by gender: In the NTZ group, 18 patient records have the value "Masculino" and 7 have "Feminino" for the variable named "sexo", whereas in the placebo group, 17 records have "Masculino" and 8 contain "Feminino". That is, the data say that two-thirds of the patients were men and the table of results says that two-thirds were women. The article text does not discuss the sex or gender of the patients as a group, but it does not seem unreasonable to me that one of the 17 authors of the article might have been expected to notice that they reported the proportions of men and women entirely incorrectly.
  2. The variables "Race" and "Age group" list counts of the patients in each of these categories, and express these as percentages. However, these percentages are not of the number of people in each experimental condition, but of the total number of patients. For example, the first value for Race is White: 21 (42%), but 21 is 84% of the 25 people in the NTZ condition.
  3. The calculation of "RT-PCR Difference Day 1 - 21" does not seem to be correct. In the NTZ group, there are no records with measurements of PCR viral load at day 21, so there is no way to calculate the mean or SD of the difference between the two timepoints. In the placebo group there are four such records; I calculate the mean difference between the day-1 and day-21 viral load for these cases to be 0.79 with an SD of 3.45.
  4. The calculation of mean and SD values for "Removed from Supplemental O2" does not seem to be possible, as there is no variable corresponding to this in the data file. (The authors provided a data dictionary in their GitLab repository, which does not apparently mention any variable corresponding to supplemental oxygen.) Perhaps this variable is based on the clinical condition score changing from 3 to a lower value on a subsequent day, but this only occurs in three cases in the data file (patient 101008, day 7, change from “3” (supplemental O₂) to “2” (hospitalised); patient 103035, day 7, change from “3” to “2”; patient 104001, day 14, change from “3” to “2”), and all three of these participants are in the placebo group. It does not need to be possible to calculate a meaningful mean or standard deviation for these values even for the placebo group, let alone for the NTZ group where there are zero cases.)
  5. The p values for "Viral Load at Day 1" and "O2 Saturation at Day 1" in the table are both reported as 0.984. Using R's wilcox.test() function, I calculated these values as 0.168 and 0.381 respectively. The authors also mention (p. 4) that they used the Kruskal-Wallis test in some cases; with R's kruskal.test() function, I obtained p values of 0.167 and 0.380 respectively.
  6. Many of the numbers for the patient clinical condition scores in the lines "Patients Hospitalized" (6/8), "Oxygen supplementation" (8/8), and "Death" (3/6) are different between the data file and the article. Interestingly, all eight numbers for "Invasive mechanical ventilation" seem to be correct. See my annotated version of the table for more information.
  7. It is not clear how the p values for the patient clinical conditions were calculated. This could have been a simple two-count chi-square test (e.g., in R, chisq.test(c(2, 1)) gives a p value of 0.564, which appears a couple of times in the table), or it could perhaps have been a 2 x 2 contingency table, with the denominator being the number of patients in the other conditions. In any case, these p values do not appear to match either the calculated or reported frequencies.
  8. The mean and SD values for Lymphocytes at day 1 in the placebo condition and at day 7 in the NTZ condition have been exchanged. That is, according to the data file, the mean lymphocyte value on day 1 in the placebo condition was 1162.36, and the mean value on day 7 in the NTZ condition was 944.88 (a decrease on the day 1 value, rather than the increase that was reported in Table 1).
  9. For the measures "D-Dimer", "US- C Reactive Protein", "TNF-α", "IL-6", and "IL-8", there are several problems. First, the table reports values at day 10, but there are no records of any kind with this day in the data files. Second, for "TNF-α", "IL-6", and "IL-8", the table reports mean values on day 1, but none of the patient records for day 1 have any values other than "#N/A" for any of these three variables. Third, none of the remaining mean or SD values come close to the values that I calculated from the data file.
  10. A column in Table 1 shows the ratio between the means or counts in the two groups and describes this as the "Rate ratio". However, there are some rather strange numbers in this column. In many cases the decimal point (or comma) is missing, so that for example the ratio of 7 "Mixed" (race) patients in the placebo group to 3 in the NTZ group is reported as 2333. In other cases, even with a decimal separator assumed, the numbers make little sense. For example, the "ratios" between the numbers of patients on intensive mechanical ventilation at days 4 (3 placebo, 1 NTZ) and 7 (4 placebo, 1 NTZ) are reported as 3333 and 4444. One wonders what kind of analysis code generated these numbers.
  11. The "Rate ratio" values under "Day 21 Difference" for the measures "D-Dimer" and "US- C Reactive Protein" have been exchanged in the table, relative to what is in the data file.

The regression lines

Another set of problems is apparent in Blum et al.'s Figure 2, which the authors claim (p. 6) “shows that the viral load for the NTZ arm of the study dropped slightly faster than the Placebo arm over the 21 days of the study (slope of 1.19 for NTZ against 1.08 for the placebo)”.

First, the variable corresponding to the PCR tests in the data files is the PCR Ct (cycle threshold, cf. the data file “ntz_data_dict_2808.html” in the authors’ GitLab repository), and a higher cycle threshold corresponds to a lower viral load. That is, both of these slopes ostensibly show the viral load increasing over the course of the study, with (on the authors’ account) the NTZ group getting worse to a greater degree than the placebo group. This apparent failure to understand the meaning of their own data might be considered quite concerning.

Second, it can be readily seen that the great majority of the visible data points here are above the two lines, with a considerable number being both above and far off to the right. A moment's thought suggests a solution, which is that very large numbers of data points with a viral load value of zero have been drawn on top of each other. However, in the data file, the lowest PCR value is 17. What seems to have happened is that, as well as 115 actual PCR values, the authors have coded 136 NA values from days 4 through 21—corresponding either to missing tests or patient dropout—as a value of zero. While some of these NA values might correspond to negative PCR tests, others must be missing data as they occur later than the date at which the patient’s condition code changed to “5” (death). When I omitted these NA values, I obtained the following plot:

On its own terms (but see my next point), this plot shows—rather more plausibly than the published Figure 2—that the PCR cycle threshold increased (and, hence, the viral load decreased) throughout the study, with a bigger decrease in the NTZ group.

The third problem with Figure 2 is that attempting to interpret the slope of a linear regression line (one of the assumptions of which is that each data point is independent of the others) in this way for time series data with multiple repeated measures on the same patients is completely invalid. No conclusions can be drawn on the basis of this model. The authors ought to have used some kind of generalized linear model, accounting for correlation between observations on the same patients and also the censoring effects of patient death or discharge.

Repeated-measures "ANOVA"

Figure 3 of Blum et al.'s article shows the results of analyses that they describe as "Two-way-ANOVA". It is not clear exactly what analyses were performed here. On p. 4 the authors say "Statistical analyses included one-way analyses of variances using the Kruskal-Wallis non-parametric ranks test. Additional statistical questions were addressed with the Wilcoxon Rank Sum Test for numerical variables and Chi-Squared tests for categorical variables." Since the Wilcoxon rank-sum test is sufficient to conduct all of the analyses in Table 2, one might imagine that "Two-way-ANOVA" could include the Kruskal-Wallis test, although that is not appropriate for repeated measures.

In any case, some of the numbers reported in Figure 3 are rather strange. For example, in panels B and D, a significance star is assigned to p values less than 0.5, which does not seem to be a very high bar to meet. Furthermore, the caption for Figure 3 includes the sentence "The MFI of CD38+CD4+ T cells (3A), HLA-DR.+CD4+ T cells (3B), ... are expressed as the median ± standard deviation" (emphasis added), which is a rather unusual combination.

I was unable to reproduce Figure 3 for two reasons. First, there are no data in the files for day 10 or with any label that corresponds to the Y axis of panels E and F. Second, the variables that do exist in the data file with names that resemble the labels of the Y axis of panels A through D (i.e., "cd38_cd4", "cd38_cd8", "hla_dr_cd4", and "hla_dr_cd8") have mean values in the range 0.5–2.0, whereas the Y-axes in Figure 3 are two or more orders of magnitude larger; it is not clear to me if I haven't understood what these variables mean, or if they are simply missing from the data files.

Analyses of individual patients

The authors provided detailed information in the text of their article about the patients who died in the study. However, in some cases it is difficult to identify these individuals in the data, as there are a considerable number of discrepancies between the numbers reported in the text and those in the data file..

  1. Male deaths in the NTZ arm

“Two patients died in the NTZ arm of the study. One of these, patient P3 was a 67-year-old male with systemic arterial hypertension (SAH) and a BMI (body mass index) of 25. … Another patient in the NTZ group, P6, was a 63-year-old male with type 2 diabetes mellitus and dyslipidemia, with a BMI of 31.”

The data contain two records of patients dying in the NTZ arm. One (numerical ID 101004) was aged 77 with a BMI of 24.9, while the other (numerical ID 106016) was aged 63 with a BMI of 30.74. We might tentatively assume that these individuals correspond to the two mentioned in the text. However, 101004, with “systemic arterial hypertension” had baseline systolic blood pressure of 120, whereas 106016, with no mention of hypertension, had a systolic BP of 150. The reported C-reactive protein for the latter also does not match the data. Furthermore, for both of these patients (and indeed for all of the patients who died), IL-6 values are reported, although the IL-6 numbers are NAs in all cases (except for one reading for patient 103004; see next paragraph).

  1. Female deaths in the placebo arm

“P8 was an 88-year-old female with SAH, BMI of 24 … and IL-6 of 4872 MFI”. There is one 88-year-old female patient in the dataset (103004); her BMI is recorded as 28.8. For this patient, unlike all the others who died, there is one IL-6 reading, of 225.5 on day 21. That is the date on which she is reported as having died in the text of the article, although the data file “Demographic_NTZ_Trial_Data” states that she left the trial early (due to death) on day 7.

“Patient P14 was a 65-year-old female with previously not welldefined cardiomyopathy, BMI of 25. … and IL-6 of 4658 MFI. She died on D4...” There are no 65-year-old females in the data, whatever their outcome. By elimination (see next paragraph) it appears that P14 was in fact the patient with numerical ID 106010, aged 76 who according to the data file “Demographic_NTZ_Trial_Data” died on day 6. Again, no IL-6 data were reported for this patient.

“The last patient in the placebo group to die was P43, a 73-year-old female with obesity (BMI of 32). Baseline laboratory evaluation revealed a total lymphocyte count of 930 (total leucocytes13,480), D-dimer of 0.82 mcg/mL, US-RCP of 187.59mcg/L, and IL-6 of 417 MFI. She died on D12...” This appears to be patient 106038, who according to the data file was aged 72 (rather than 73). Again, no IL-6 data were reported for this patient.

  1. Male deaths in the placebo arm

Three deaths among male participants were reported in the placebo arm. Apart from small discrepancies in their ages in two cases (P24, 106012, reported age 55, data file age 55; P27, 106014, reported age 71, data file age 70; P17, 106028, reported age 78, data file age 79), the only obvious discrepancy for these three cases is the reporting of IL-6 data in the text when these values are not present in the data.

Conclusion

Blum et al.'s article contains a large number of both major and minor statistical and reporting errors, some of which are apparent even without the data set. Whether or not these issues are too great to make the article salvageable is a matter for the authors and the journal’s editorial team.

I have made my analysis code, together with full-size versions of the images, available at https://osf.io/uz875/

Thanks to Kyle Sheldrick and Gideon Meyerowitz-Katz for bringing the paper to my attention, and  being the first to notice the problem with the NA values being interpreted as zeroes in Figure 2.