Back in 2017, Dr. John Carlisle published this article in which he described a novel method for examining the table of participants' baseline characteristics (usually Table 1) in reports of randomised trials. Table 1 is typically used to show that the randomisation "worked", which in general means that the groups do not differ on important variables more often than we would expect by chance (with the *p* value of the final comparison expected to, in effect, "mop up" any differences that do occur).

Carlisle's insight was that it is possible for the baseline characteristics to be too similar across groups. That is, in some cases, we do not see the random variation that would be expect if the assignment to groups is truly random. For example, if you have 100 participants and 20 of them have some condition (say, diabetes), while a 10–10 split across two groups is the most likely individual outcome (about 17.6% of the time), you would expect a 7–13 (or more extreme) split about 26.5% of the time. If Table 1 contains a large number of even or near-even splits, that can be a sign that the randomisation was not done as reported, because there is just not enough genuine randomness in the data.

We can quantify the degree of randomness that is present by looking at the *p* values for the different between-group baseline tests in Table 1. If all of the variables are truly randomised, we would expect these *p* values to be uniformly distributed with a mean of 0.50. With a sufficiently large number of variables, we would expect 10% of the *p* values to be between 0 and 0.1, a further 10% to be between 0.1 and 0.2, and so on. If we see mostly *p* values above 0.8 or 0.9, this suggests that the baseline similarities between the groups could be too good to be true. A statistical test attributed to Stouffer and Fisher can be used to determine the probability of the set of *p* values that we observe being due to chance.

Carlisle's method has limitations, some of which I blogged about at the time (see also the comments under that post, including a nice reply from John Carlisle himself) and some of which were mentioned by proper methodologists and statisticians, for example here and here. Those limitations (principally, the possibility of non-independence of the observations being combined from Table 1, and their sometimes limited number) should be borne in mind when reading what follows, but (spoiler alert) I do not think they are sufficient to explain the issues that I report here.

In this post, I'm going to apply Carlisle's method, and the associated Stouffer-Fisher test, to two articles that have made claims of remarkably large positive effects of anti-androgenic drugs for the treatment of Covid-19:

*Cureus*,

*13*(2), e13492. https://doi.org/10.7759/cureus.13492

*Cureus*,

*13*(2), e13047. https://doi.org/10.7759/cureus.13047

*medRxiv*. https://doi.org/10.1101/2020.11.16.20232512

### Method

*p*value for that variable.

*p*value is calculated from an independent-samples

*t*test, my custom function determined the maximum possible

*t*statistic (and, hence, the minimum possible

*p*value), by adding the maximum possible rounding error to the larger mean, subtracting the same maximum possible rounding error from the smaller mean, and subtracting the maximum possible rounding error from the standard deviation of each group (cf. Brown & Heathers, 2019, "Rounded Input Variables, Exact Test Statistics (RIVETS)", https://psyarxiv.com/ctu9z/). I believe that doing this works in the authors' favour, as the majority of the

*p*values in these analyses come from contingency tables and are rather large; that is, getting the smallest possible

*p*value from the

*t*tests tends to increase the overall Stouffer-Fisher test

*p*value.

*p*value is derived from an analysis of 2x2 contingency tables, my custom function applied the following rules:

- If any cell of the table contains 0, return NULL; the variable will not be considered to have returned a
*p*value). - If any cell of the table contains a number less than 5, apply Fisher's exact test. This is of course an arbitrary distinction (but it doesn't make too much difference anyway).
- Otherwise, apply a chi-square test with Yates' continuity correction for 2x2 tables.

*p*values, I replaced any that were greater than 0.98 with exactly 0.98, which avoids problems with the calculation of the Stouffer-Fisher formula in R with values of exactly 1.0 (which will occur, for example, if the number of cases in a contingency table is identical across conditions, or differs only by 1). Again, I believe that this choice works in the authors' favour. Then I calculated the overall Stouffer-Fisher test

*p*value formula using the method that I described in my blog post about Carlisle's article:

- Convert each
*p*value into a*z*score. - Sum the
*z*scores. - If there are
*k*scores, divide the sum of the*z*scores from step 2 by the square root of*k*. - Calculate the one-tailed
*p*value associated with the overall*z*score from step 3.

*p*values are retained from 55 variables: (0.048, 0.431, 0.703, 0.782, 0.973, 0.298, 0.980, 0.682, 0.817, 0.897, 0.826, 0.328, 0.980, 0.980, 0.227, 0.424, 0.918, 0.884, 0.959, 0.353, 0.980, 0.511, 0.980, 0.512, 0.980, 0.980). These correspond to the z scores (-1.662, -0.175, 0.533, 0.779, 1.924, -0.531, 2.054, 0.473, 0.904, 1.265, 0.939, -0.446, 2.054, 2.054, -0.749, -0.193, 1.393, 1.198, 1.739, -0.376, 2.054, 0.028, 2.054, 0.030, 2.054, 2.054), which sum to 21.448. We divide this by the square root of 26 (i.e., 5.099) to get an overall

*z*score of 4.206, which in turn gives a

*p*value of 0.000013 (1.30E-05).

*z*score, because the sign is important. A

*p*value below/above 0.5 corresponds to a negative/positive

*z*score. If the positive and negative z scores cancel each other out, the overall Stouffer-Fisher

*p*value will be 0.5, which is what we expect to see on average with perfect randomisation.

### Results

*p*values that I obtained from each article using each of the analysis methods described above. Note that a value of zero corresponds to a p value below 2.2E-16, the smallest value that R can calculate (on my computer anyway). The 10 columns to the right of the overall

*p*value show the deciles of the distribution of

*p*values of the individual tests that make up the overall score.

*p*value is very small, whatever the combination of analyses and exclusions that were performed, and the distribution of

*p*values from the individual comparisons is very heavily skewed towards values above .7 and especially .9. Even when I excluded a large number of comparisons because there were fewer than 3 cases or non-cases in one or more conditions (Analyses 1c, 2c, and 3c), a decision that is heavily in favour of the authors, the largest

*p*value I obtained for Article 1 was 0.0000337, and for Article 2 the largest

*p*value was 0.00000000819. (As mentioned above, the results for Article 3 should not be considered to be independent from those of Article 2, but they do tend to confirm that there are potentially severe problems with the randomisation of participants in both reported versions of that study.)

### Conclusion

The use of novel statistical methods should always be approached with an abundance of caution; indeed, as noted earlier, I have had my own criticisms of John Carlisle's approach in the past. (Now that I have adopted Carlisle's method in this post, I hope that someone else will take the part of the "Red Team" and perhaps show me where my application of it might be invalid!)

However, in this case, I believe that my comments from 2017 about the non-independence (and low number) of baseline measures do not apply here to anything like the same extent. In addition, I have run multiple analyses of each of the articles in question here, and tried to make whatever choice favoured the authors when the opportunity presented itself. (Of course, I could have made even more conservative choices; for example, I could have excluded all of the contingency tables where any of the cells were less than 6, or 10. But at some point, one has to accept at face value the authors' claims that their Table 1 results (a) are worth reporting and (b) demonstrate the success of their randomisation.)

Despite that, I obtained what seems to be substantial evidence that the randomisation of participants to conditions in the two studies that are the respective subjects of these articles may not have taken place exactly as reported. This adds to a growing volume of public (and, in some cases, official) critique of the Covid-19 research coming from this laboratory (e.g., in English, here, here, here, here, and here, or in Portuguese here, here, here, here, here, here, and here).

### Materials

My R code is available here. The three referenced articles are published either under open access conditions or as a preprint, so I hope they will be easy for readers to download when checking my working.

[[ Update 2021-10-03 12:56 UTC: Added the columns of *p* value deciles to the table of results. ]]

Excellent work Nick.

ReplyDeleteI appreciate how hard you worked to be fair to the authors, adjustment by RIVETS, etc.

This is an unimpeachable analysis, well done.

A disturbing result! Thank you Nick for your work and that of your colleagues to expose sloppy science — so dangerous to society.

ReplyDeleteHi Nick. Thanks a lot for the clear description of your analysis. Did you happen to evaluate their other proxalutamide papers as well (https://doi.org/10.3389/fmed.2021.668698 and https://doi.org/10.1101/2021.06.22.21259318)? These are by far the most striking claims of efficacy of the drug and have been the most controversial studies by the group (in Brazil at least) due to various concerns about ethical infringements. Thus, I would be curious to know whether the same patterns hold in those cases.

ReplyDeleteFor the first of those (Frontiers), there aren't enough items in Table 1 to do this sort of analyses. For the second (medRxiv) the overall p value ranges from .001 up to .20, and I would not take that to the bank, even if the latter was obtained under the absolute most favourable circumstances for the authors.

DeleteThis analysis is incorrect, as explained in detail in my new preprint:

ReplyDeletehttps://arxiv.org/abs/2209.00131