Back in 2017, Dr. John Carlisle published this article in which he described a novel method for examining the table of participants' baseline characteristics (usually Table 1) in reports of randomised trials. Table 1 is typically used to show that the randomisation "worked", which in general means that the groups do not differ on important variables more often than we would expect by chance (with the p value of the final comparison expected to, in effect, "mop up" any differences that do occur).
[[ Update 2023-06-04 15:40 UTC: Please read my additional comments just before the "Materials" section of this post. ]]
Carlisle's insight was that it is possible for the baseline characteristics to be too similar across groups. That is, in some cases, we do not see the random variation that would be expect if the assignment to groups is truly random. For example, if you have 100 participants and 20 of them have some condition (say, diabetes), while a 10–10 split across two groups is the most likely individual outcome (about 17.6% of the time), you would expect a 7–13 (or more extreme) split about 26.5% of the time. If Table 1 contains a large number of even or near-even splits, that can be a sign that the randomisation was not done as reported, because there is just not enough genuine randomness in the data.
We can quantify the degree of randomness that is present by looking at the p values for the different between-group baseline tests in Table 1. If all of the variables are truly randomised, we would expect these p values to be uniformly distributed with a mean of 0.50. With a sufficiently large number of variables, we would expect 10% of the p values to be between 0 and 0.1, a further 10% to be between 0.1 and 0.2, and so on. If we see mostly p values above 0.8 or 0.9, this suggests that the baseline similarities between the groups could be too good to be true. A statistical test attributed to Stouffer and Fisher can be used to determine the probability of the set of p values that we observe being due to chance.
Carlisle's method has limitations, some of which I blogged about at the time (see also the comments under that post, including a nice reply from John Carlisle himself) and some of which were mentioned by proper methodologists and statisticians, for example here and here. Those limitations (principally, the possibility of non-independence of the observations being combined from Table 1, and their sometimes limited number) should be borne in mind when reading what follows, but (spoiler alert) I do not think they are sufficient to explain the issues that I report here.
In this post, I'm going to apply Carlisle's method, and the associated Stouffer-Fisher test, to two articles that have made claims of remarkably large positive effects of anti-androgenic drugs for the treatment of Covid-19:
Cadegiani, F. A., McCoy, J., Wambier, C. G., Vaño-Galván, S., Shapiro, J., Tosti, A., Zimerman, R. A., & Goren, A. (2021). Proxalutamide significantly accelerates viral clearance and reduces time to clinical remission in patients with mild to moderate COVID-19: Results from a randomized, double-blinded, placebo-controlled trial. Cureus, 13(2), e13492. https://doi.org/10.7759/cureus.13492
Cadegiani, F. A., McCoy, J., Wambier, C. G., & Goren, A. (2021). Early antiandrogen therapy with dutasteride reduces viral shedding, inflammatory responses, and time-to-remission in males with COVID-19: A randomized, double-blind, placebo-controlled interventional trial (EAT-DUTA AndroCoV Trial – Biochemical). Cureus, 13(2), e13047. https://doi.org/10.7759/cureus.13047
I will refer to these as Article 1 and Article 2, respectively. Article 2 also seems to be closely related to this preprint, which reports results from what appear to be a superset of its participants; I will refer to this as Article 3. Below, I also report the results of the analyses using Carlisle's method on this preprint; however, because of the similarity between the two samples I don't think that it would be fair to the authors to claim that the issues that I report here have been found in three, rather than two, articles.
Cadegiani, F. A., McCoy, J., Wambier, C. G., & Goren, A. (2020). 5-alpha-reductase inhibitors reduce remission time of COVID-19: Results from a randomized double blind placebo controlled interventional trial in 130 SARS-CoV-2 positive men. medRxiv. https://doi.org/10.1101/2020.11.16.20232512
Method
For each article, I extracted the contents of Table 1 to a text file and, using global commands in the "vim" text editor as far as possible, converted each line into a call to a custom-written function that calculated the p value for that variable.
For variables where the p value is calculated from an independent-samples t test, my custom function determined the maximum possible t statistic (and, hence, the minimum possible p value), by adding the maximum possible rounding error to the larger mean, subtracting the same maximum possible rounding error from the smaller mean, and subtracting the maximum possible rounding error from the standard deviation of each group (cf. Brown & Heathers, 2019, "Rounded Input Variables, Exact Test Statistics (RIVETS)", https://psyarxiv.com/ctu9z/). I believe that doing this works in the authors' favour, as the majority of the p values in these analyses come from contingency tables and are rather large; that is, getting the smallest possible p value from the t tests tends to increase the overall Stouffer-Fisher test p value.
For the majority of the variables, where the p value is derived from an analysis of 2x2 contingency tables, my custom function applied the following rules:
- If any cell of the table contains 0, return NULL; the variable will not be considered to have returned a p value).
- If any cell of the table contains a number less than 5, apply Fisher's exact test. This is of course an arbitrary distinction (but it doesn't make too much difference anyway).
- Otherwise, apply a chi-square test with Yates' continuity correction for 2x2 tables.
Other analyses are possible, but this was what I decided to do a priori. I call this "Analysis 1a" below. In Analysis 1b I included tables where one or more cells are zero (also using Fisher's exact test). In Analysis 1c I excluded all variables where one or more cells had a value less then 3, so that any variable where 2 or fewer people in either condition had or did not have the attribute in question we excluded. In Analyses 2a, 2b, and 2c I applied the same rules for inclusion as in 1a, 1b, and 1c, respectively, but I used the chi-square test throughout. (This means that Analyses 1c and 2c are identical, as there are no variable in 1c for which my rules would mean that Fisher's exact test would be used.) In Analyses 3a, 3b, and 3c I used Fisher's exact test throughout.
After calculating the p values, I replaced any that were greater than 0.98 with exactly 0.98, which avoids problems with the calculation of the Stouffer-Fisher formula in R with values of exactly 1.0 (which will occur, for example, if the number of cases in a contingency table is identical across conditions, or differs only by 1). Again, I believe that this choice works in the authors' favour. Then I calculated the overall Stouffer-Fisher test p value formula using the method that I described in my blog post about Carlisle's article:
- Convert each p value into a z score.
- Sum the z scores.
- If there are k scores, divide the sum of the z scores from step 2 by the square root of k.
- Calculate the one-tailed p value associated with the overall z score from step 3.
For example, for Article 1 with Analysis 2c (see the table below), 26 p values are retained from 55 variables: (0.048, 0.431, 0.703, 0.782, 0.973, 0.298, 0.980, 0.682, 0.817, 0.897, 0.826, 0.328, 0.980, 0.980, 0.227, 0.424, 0.918, 0.884, 0.959, 0.353, 0.980, 0.511, 0.980, 0.512, 0.980, 0.980). These correspond to the z scores (-1.662, -0.175, 0.533, 0.779, 1.924, -0.531, 2.054, 0.473, 0.904, 1.265, 0.939, -0.446, 2.054, 2.054, -0.749, -0.193, 1.393, 1.198, 1.739, -0.376, 2.054, 0.028, 2.054, 0.030, 2.054, 2.054), which sum to 21.448. We divide this by the square root of 26 (i.e., 5.099) to get an overall z score of 4.206, which in turn gives a p value of 0.000013 (1.30E-05).
Note that we do not take the absolute value of each z score, because the sign is important. A p value below/above 0.5 corresponds to a negative/positive z score. If the positive and negative z scores cancel each other out, the overall Stouffer-Fisher p value will be 0.5, which is what we expect to see on average with perfect randomisation.
Results
Here are the p values that I obtained from each article using each of the analysis methods described above. Note that a value of zero corresponds to a p value below 2.2E-16, the smallest value that R can calculate (on my computer anyway). The 10 columns to the right of the overall p value show the deciles of the distribution of p values of the individual tests that make up the overall score.
It can be readily seen that in Articles 1 and 2, which are the main focus of this post, the
p value is very small, whatever the combination of analyses and exclusions that were performed, and the distribution of
p values from the individual comparisons is very heavily skewed towards values above .7 and especially .9. Even when I excluded a large number of comparisons because there were fewer than 3 cases or non-cases in one or more conditions (Analyses 1c, 2c, and 3c), a decision that is heavily in favour of the authors, the largest
p value I obtained for Article 1 was 0.0000337, and for Article 2 the largest
p value was 0.00000000819. (As mentioned above, the results for Article 3 should not be considered to be independent from those of Article 2, but they do tend to confirm that there are potentially severe problems with the randomisation of participants in both reported versions of that study.)
Conclusion
The use of novel statistical methods should always be approached with an abundance of caution; indeed, as noted earlier, I have had my own criticisms of John Carlisle's approach in the past. (Now that I have adopted Carlisle's method in this post, I hope that someone else will take the part of the "Red Team" and perhaps show me where my application of it might be invalid!)
However, in this case, I believe that my comments from 2017 about the non-independence (and low number) of baseline measures do not apply here to anything like the same extent. In addition, I have run multiple analyses of each of the articles in question here, and tried to make whatever choice favoured the authors when the opportunity presented itself. (Of course, I could have made even more conservative choices; for example, I could have excluded all of the contingency tables where any of the cells were less than 6, or 10. But at some point, one has to accept at face value the authors' claims that their Table 1 results (a) are worth reporting and (b) demonstrate the success of their randomisation.)
Despite that, I obtained what seems to be substantial evidence that the randomisation of participants to conditions in the two studies that are the respective subjects of these articles may not have taken place exactly as reported. This adds to a growing volume of public (and, in some cases, official) critique of the Covid-19 research coming from this laboratory (e.g., in English, here, here, here, here, and here, or in Portuguese here, here, here, here, here, here, and here).
[[ Update 2023-06-04 15:40 UTC: Two alert comments mentioned the preprint by Daniel Tausk which points out a limitation of the use of Carlisle's method with dichotomous variables; my comment that "The use of novel statistical methods should always be approached with an abundance of caution" was apparently prescient. Tausk's analysis suggests that the p values that I calculated above may have been rather too small; he suggests 0.017 for the first and 0.0024 for the second. These might not, on their own, constitute strong enough evidence to launch an inquiry into these studies, but it seems to me that they are still of interest in view of the other critiques of the work from this laboratory. I will let readers make up their own minds. ]]
Materials
My R code is available here. The three referenced articles are published either under open access conditions or as a preprint, so I hope they will be easy for readers to download when checking my working.
[[ Update 2021-10-03 12:56 UTC: Added the columns of p value deciles to the table of results. ]]
Excellent work Nick.
ReplyDeleteI appreciate how hard you worked to be fair to the authors, adjustment by RIVETS, etc.
This is an unimpeachable analysis, well done.
A disturbing result! Thank you Nick for your work and that of your colleagues to expose sloppy science — so dangerous to society.
ReplyDeleteHi Nick. Thanks a lot for the clear description of your analysis. Did you happen to evaluate their other proxalutamide papers as well (https://doi.org/10.3389/fmed.2021.668698 and https://doi.org/10.1101/2021.06.22.21259318)? These are by far the most striking claims of efficacy of the drug and have been the most controversial studies by the group (in Brazil at least) due to various concerns about ethical infringements. Thus, I would be curious to know whether the same patterns hold in those cases.
ReplyDeleteFor the first of those (Frontiers), there aren't enough items in Table 1 to do this sort of analyses. For the second (medRxiv) the overall p value ranges from .001 up to .20, and I would not take that to the bank, even if the latter was obtained under the absolute most favourable circumstances for the authors.
DeleteThis analysis is incorrect, as explained in detail in my new preprint:
ReplyDeletehttps://arxiv.org/abs/2209.00131
This blog post is really weird and skewed and now Professor Tausk has done a really impeccable analysis on his article, with the math right. It is natural to say that we all expect a retraction from you Dr. Brown.
Delete