31 October 2021

A bug and a dilemma

A few months ago, I discovered that the SAS statistical software package, which is used worldwide by universities and other large organisations to analyse their data, contained—until quite recently—a bug that could result in information that the user thought they had successfully deleted (and was no longer visible from within the application itself) still being present in the saved data file. This could lead to personal identifiable information (PII) about study participants being revealed, alongside whatever other data might have been collected from these participants, which—depending on the study—could potentially be extremely sensitive. I found this entirely by chance when looking at an SAS data file to try and work out why some numbers weren't coming out as expected, for which it would have been useful to know if numbers are stored in ASCII or binary. (It turned out that they are stored in binary.)

Here's how this bug works: Suppose that as a researcher you have run a study on 80 named participants, and you now have a dataset containing their names, study ID numbers (for example, if the study code within your organisation is XYZ this code might be XYZ100, XYZ101, etc, up to XYZ179), and other relevant variables from the study. One day you decide to make a version of the dataset that can be shared without the participants being identifiable, either because you have to deposit this in an archive when you submit the study to a journal, or because somebody has read the article and asked for your data. You could share this in .CSV file format, and indeed that would normally be considered best practice for interoperability; but there may be good reasons to share it in SAS's native binary data file format with a .sas7bdat extension, which can in any case be opened in R (using a package named "sas7bdat", among others) or in SPSS

So you open your file called participants-final.sas7bdat in the SAS data editor and delete the column with the participants' names (and any other PII, such as IP addresses, or perhaps dates of birth if those are not needed to establish the participants' ages, etc), then save it as deidentified-participants-final.sas7bdat, and share the latter file. But what you don't know is that, because of this bug, in some unknown percentage of cases the text of most of the names can sometimes still be sitting in the sas7bdat binary data file, close to the alphanumeric participant IDs. That is, if the bug has struck, someone who opens the "deidentified" file in a plain text editor (which could be as simple as Notepad on Windows) might see the names and IDs among the binary gloop, as shown in this image.

I am pretty sure these two people did not take part in this study.

This screenshot shows an actual extract from a data file that I found, with only the names and the study ID codes replaced with those of others selected from the phone book. The full names of about two-thirds of the participants in this study were readable. Of course, you can't read the binary data and it would take a lot of work to do so, but given the participant IDs (PRZ045 for Trump, PRZ046 for Biden) you can simply open the "anonymised" data file in SAS and find out all you want about those two people from within the application.

Even worse, though, is the fact that unless the participant's name is extremely common, when combined with knowledge of approximately where and when the study was conducted it might very well let someone identify them with a high degree of confidence for relatively little effort. And by opening the file in SAS—for example, with the free service SAS OnDemand for Academics, or in SPSS or R as previously mentioned—and looking at the data that was intended to be shared, we will be able to see that our newly-identified participant is 1.73 metres tall, or takes warfarin, or is HIV-positive.

(A number of Microsoft products, including Word and Excel, used to have a bug like this, many versions ago. When you chose "Save" rather than "Save As", it typically would not physically overwrite on the disk any text that you had deleted, perhaps because the code had originally been written to minimise writing operations with diskettes, which are slow.)

I have been told by SAS support (see screenshot below) that this bug was fixed in version 9.4M4 of the software, which was released on 16 November 2016. The support agent told me that the problem was known to be present in version 9.4M3, which was released on 14 July 2015; however, I do not know whether the problem also existed in previous versions. I think it would be prudent to assume that any file in .sas7bdat format created by a version of SAS prior to 9.4M4 may have this issue. Neither the existence of the problem, nor the fact that it had been fixed, were documented by SAS in the release notes for version 9.4M4; equally, however, the support representative did not tell me that the problem is regarded as top secret or subject to any sort of embargo.

(The identity of the organisation that shared the files in which I found the bug has been redacted here.)

SAS is a complex software package and it will generally take a while for large organisations to migrate to a new version. Probably by now most versions have been upgraded to 9.4M4 or later, but quite a few sites might have been using the previous version containing this bug until quite recently, and as I already mentioned, it's not clear how old the bug is (i.e., at what point it was introduced to the software). So it could have been around for many years prior to being discovered, and it could well have still been around for two or three years after that date at many sites.

Now, this discovery caused me a dilemma. I worried that, if I were to go public with this bug, this might start a race between people who have already shared their datasets that were made with a version prior to 9.4M4 trying to replace or recall their files, and Bad People™ trying to find material online to exploit. That is, to reveal the existence of the problem might increase the risk of data leaking out. On the other hand, it's also possible that the bad people are already aware of the problem and are actively looking for that material, in which case every day that passes without the problem becoming public knowledge increases the risk, and going public would be the start of the solution.

Note that this is different from the typical "white hat"/"bug bounty" scenario, in which the Good People™ who find a vulnerability tell the software company about the bug and get paid to remain silent until a reasonable amount of time has passed to patch the systems, after which they are free to reveal the existence of the problem. In those cases, patching the software fixes the problem immediately, because the extent of the vulnerability is limited to the software itself. But here, the vulnerability is in the data files that were not anonymised as intended. There is no way to patch anything to stop those files from being read, because that only needs a text editor. The only remedy is for the files to be deleted from, or replaced in, repositories as their authors or guardians become aware of the issue.

In the original case where I discovered this issue, I reported it to the owner of the dataset and he arranged for the offending file to be recalled from the repository where he had placed it, namely the Open Science Framework. (I also gave a heads-up to the Executive Director of the Center for Open Science, Brian Nosek, at that time.) The dataset owner also reported the problem to their management, as they thought (and I completely agree) that dealing with this sort of issue is beyond the pay grade of any individual principal investigator. I do not know what has happened since, nor do I think it's really my business.  I would argue that SAS ought to have done something more about this than just sneaking out a fix without telling anybody; but perhaps they, too, looked at the trade-off described above and decided to keep quiet on that basis, rather than merely avoiding embarrassment.

I have spent several months wondering what to do about this knowledge. In the end, I decided that (a) there probably aren't too many corrupt files out there, and (b) there probably aren't too many Bad People™ who are likely to go hunting for sensitive data this way, because it just doesn't seem like a very productive way of being a Bad Person. So I am going public today, in the hope that the practical consequences of revealing the existence of this problem are unlikely to be major, and that giving people the chance to correct any SAS data files that they might have made public will be, on balance, a net win for the Good People. (For what it's worth, I asked two professors of ethics about this, one of them a specialist in data-related issues, and they both said "Ouch. Tough call. I don't know. Do what you think is best".)

Now, what does this discovery mean? Well, if you use SAS and have made your data available using the .sas7bdat file format, you might want to have a look in the data files with a text editor and check that there is nothing in there that you wouldn't expect. But even if you don't use SAS, there may still be a couple of lessons for you from this incident, because (a) the fact that this particular software bug is fixed doesn't mean there aren't others, and (b) everyone makes mistakes.

First, consider always using .CSV files to share your data, if there is no compelling reason not to do so. The other day I had to download a two-year-old .RData file from OSF and it contained data structures that were already partly obsolete when read by newer versions of the package that had create them; I had to hunt around online for the solution, and that might not work at all at some future point. When I  had sorted that out I saved the resulting data in a .CSV file, which turned out to be nearly 20% smaller than the .RData file anyway.

Second, try to keep all PII out of the dataset altogether. Build a separate file or files that connects each participant's study ID number to their name and any other information that is not going to be an analysed variable. If your study requires you to generate a personalised report for the participants that includes their name then this might represent a little extra effort, but generally this approach will greatly reduce the chances of a leak of PII. (I suspect that for every participant whose PII is revealed by bugs, several more are the victims of either data theft or simply failure on the part of the researchers to delete the PII before sharing their data.)

(Thanks to Marcus Munafò and Brian Nosek for valuable discussions about an earlier draft of this post.)

28 October 2021

A catastrophic failure of peer review in obstetrics and gynaecology

In this post I will discuss a set of 46 articles from the same institution that appear to show severe problems in many journals in the field of obstetrics and gynaecology. These are not entirely new discoveries; worrying overlaps among 35 of these articles have already been investigated in a commentary article from 2020 by Esmée Bordewijk and colleagues that critiqued 24 articles on which Dr Ahmed Badawy was lead author (19) or a co-author (5), plus 11 articles lead-authored by Dr Hatem Abu Hashim, who is Dr Badawy's colleague in the Department of Obstetrics and Gynaecology at Mansoura University in Egypt.

Bordewijk et al. reported that they had detected a large number of apparent duplications in the summary statistics across those articles, which mostly describe randomized controlled trials carried out in the Mansoura ObGyn department. Nine of these articles appear as chapters in Dr Badawy's PhD thesis , which he defended in December 2008 at the University of Utrecht in the Netherlands.

I think it is fair to say that Dr Badawy was not especially impressed by the arguments in Bordewijk et al.'s commentary; indeed, he wrote a reply with the uncompromising title of "Data integrity of randomized controlled trials: A hate speech or scientific work?" in which he questioned, among other things, the simulation techniques that Bordewijk et al. had used to demonstrate how unlikely it was that the patterns that they had observed across the 35 articles that they examined had arisen by chance.

The senior author on the Bordewijk et al. commentary was Dr Ben Mol of Monash University in Melbourne, Australia. Since their commentary was published, Dr Mol and his colleagues have been attempting to get the journal editors who published the 35 articles in question to take some form of action on them. To date, five articles have been retracted and another 10 have expressions of concern. The story, including its potential legal fallout, has been covered in considerable detail at Retraction Watch in December 2020 and again in August 2021.

However, to some extent, Dr Badawy may have a point: The evidence presented in the commentary is circumstantial and depends on a number of probabilistic assumptions, which editors may not be inclined to completely trust (although personally I find Bordewijk et al.'s analysis thoroughly convincing). And, as an editor, even if you believe that two or more articles are based on recycled data or summary statistics, how are you to know that the one in your journal is not the original ("good") one?

Fortunately (at least from the point of view of the error correction process) there is a much simpler approach to the problem at hand. It can be shown that almost all of the articles that were analysed by Bordewijk et al.—plus a few more that did not make it into their commentary—have very substantial statistical flaws at the level of each individual article. In my opinion, in most cases these errors would justify a rapid retraction based solely on the evidence that is to be found in each article's PDF file. There is no need for simulations or probability calculations; in the majority of cases, the numbers sitting there in the tables of results are demonstrably incorrect.

General description of the articles

As mentioned above, for this blog post I examined 46 articles from the Department of Obstetrics and Gynaecology at Mansoura University. Of these, 35 had already been analysed by Bordewijk et al., and the rest were included either at the suggestion of Ben Mol or after I searched for any other empirical studies that I could find in the Google Scholar profiles of Dr Badawy and Dr Abu Hashim. Seven of the 46 articles had neither Dr Badawy nor Dr Abu Hashim as co-authors, but for all seven of those Dr Tarek Shokeir was listed as a co-author (or, in one case, sole author).

These articles mostly describe RCTs of various interventions for conditions such as infertility, heavy menstrual bleeding, polycystic ovary syndrome, preterm labour, or endometriosis. Several of them have more than 100 citations according to Google Scholar. The studies seem to be well-powered, many with more than 100 participants in each group (e.g., for this one the authors claimed to have recruited 996 infertile women), and it is not hard to imagine that their findings may be affecting clinical practice around the world.

The typical article is relatively short, and contains a baseline table comparing the groups of patients (usually two), followed by one or sometimes more tables comparing the outcomes across those groups. These are usually expressed as simple unpaired comparisons of parameters (e.g., height, with mean and standard deviation reported for each group), or as tests of proportions (e.g., in the treatment group X% of N1 participants became pregnant, versus Y% of N2 participants in the control group). The statistics are therefore for the most part very simple; for example, there are no logistic regressions with covariates. This means that we can readily check most of the statistics from the tables themselves.

The t statistics

First up, I note that in about half of these articles, no t statistics at all are reported for the comparisons of continuous variables across groups. Sometimes we get just a p value. In other cases we are only told that individual comparisons (or all of the comparisons in a table, via a note at the end) were statistically significant or not; typically we are left to infer that that means p < 0.05. (In a few articles the authors reported using the Mann-Whitney U test when data were not normally distributed, but they do not generally indicate which variables are concerned by this in each case.)

In quite a few cases the errors in the implicit t statistics are visible from space, as in this example from 10.1111/j.1447-0756.2010.01383.x:

(This table has been truncated on the right in order to fit on the page.)

Have a look at the "Fasting glucose" numbers (fourth line from the bottom). The difference between the means is 5.4 (which means a minimum of 5.3 even after allowing for rounding) and just by approximating a weighted mean you can see that the pooled SD is going to be about 1.6, so this is a Cohen's d of around 3.3, which is never going to be non-significant at the 0.05 level. You don't have to carry the formula for the pooled standard error around in your head to know that with df = 136 the t statistic here is going to be huge, and indeed it is: the minimum t value is 17.31, the midpoint is 18.17, the maximum is 19.09, p = homeopathic. (Aside: p
hysiologists might wonder about the homogeneity of the testosterone levels in the ovulation group, with an SD of just 0.01.)

When we do get to see t statistics, the majority are incorrect given the mean, SD, and sample size, and even after allowing for rounding (but see also our RIVETS preprint for what happens if test statistics are derived with too much allowance for rounding). See this table from 10.1016/j.fertnstert.2008.06.013:

Several of the effect sizes here are huge; for example, it is close to d = 4 on the first line, for which the correct t statistic is not 4.4 but instead somewhere between 38.71 and 40.26. If you are more familiar with ANOVA, you can square those numbers to get the corresponding F statistic. (Spoiler: F = 1,500 is a lot.)

The overall results for the set of 46 articles are disastrous. In most of the articles the majority of the p values are incorrect, sometimes by a wide margin. Even in those articles where the authors did not explicitly report which test they used, leaving open the possibility that they might have used the Mann-Whitney U test, the discrepancies between the reported p values and those that I obtained from a t test were very often such that even the discovery that the Mann-Whitney test had been used would not be sufficient to explain them.

[Begin update 2021-10-28 19:57 UTC]
Brenton Wiernik asked whether it was possible that the authors had accidentally reported the variability of their means as the standard error of the mean (SEM) rather than the standard deviation. I must confess that when doing the analyses I did not consider this possibility, not least because all of the SDs for things like age (around 3) or height (around 5–6cm) seemed quite reasonable, although perhaps a bit on the low side of what one might expect. 

However, I did identify seven articles (10.3109/01443615.2010.49787310.1016/j.fertnstert.2008.04.06510.1016/j.fertnstert.2008.06.013, 10.1016/j.fertnstert.2007.08.034, 10.1016/s1472-6483(10)60148-4, 10.1016/j.fertnstert.2007.05.010, and 10.1016/j.fertnstert.2007.02.062) in which the authors claimed that their measure of variability (typically written as ± and a number after the mean) was indeed the SEM. But the sample sizes in these papers are such that the implied SDs would need to be huge. Across these seven papers, the implied SD for age would range from 29.50 to 62.39 years, and for height from 54.79 to 124.77 cm. I therefore stand by my interpretation that these values were intended by the authors to be standard deviations, although that then gives them another question to answer, namely why they claimed them to be SEM when those numbers are patently absurd.
[End update 2021-10-28 19:57 UTC]

The Χ² (etc) statistics

In examining whether the tests of proportions had been reported correctly, I included only those articles (30 out of the total set of 46) that contained at least one exact numerical (i.e., not "NS" or "<0.001") p value from a Pearson chi-square test or Fisher's exact test of a 2x2 contingency table. If the authors also reported Χ² statistics and/or odds ratios, I also included those numbers. I then examined the extent to which these statistics matched the values that I calculated from the underlying data. When the subsample sizes were very small, I allowed the authors some more leeway, as the Pearson chi-square test does not always perform well in these cases.

As with the t test results (see previous section), the overall results revealed a large number of incorrect p values in almost every article for which I recalculated the tests of proportions.

Perhaps the most indisputable source of errors is situations in which what is effectively the same test is reported twice, with different chi-square statistics (if those are reported) and different p values, even though those values are necessarily identical. I counted 8 examples of this across 7 different articles. For example, consider this table from 10.3109/01443615.2010.508850:

You're either pregnant or you aren't. I don't make the rules.

After 6 months in the study, every participant either had, or had not, become pregnant. So the contingency table for the first outcome ("No pregnancy") is ((141, 114), (150, 107)) and for the second outcome ("Clinical pregnancy") it is 
((114, 141), (107, 150)). Those will of course give exactly the same result, meaning that at least one of the Χ²/p-value pairs must be wrong. In fact the correct numbers are Χ²(1) = 0.49, p = 0.48, which means that neither of the Χ² test statistics, nor the p values, in the first two lines of the table match are even remotely valid. For that matter, neither (incorrect) p value in the table even matches its corresponding (incorrect) Χ² statistic; I will let you check this for yourself as an exercise.

The p values

There are a couple of basic things to keep in mind when reading tables of statistics:

  • A t statistic of 1.96 with 100 or more degrees of freedom gives a rounded two-tailed p value of 0.05 (although if you want it to be strictly less than 0.05000, you need a t statistic of 1.984 with 100 dfs).
  • For any given number of degrees of freedom, a larger t or Χ² statistic gives a smaller p value.
With that in mind, let's look at this table (from 10.1016/j.fertnstert.2007.05.010), which I believe to be entirely typical of the articles under discussion here:

The good news is that the percentages and the Χ² statistics check out OK!

Nether of the above bits of very basic knowledge is respected here. First we have t(228) = 2.66, p = 0.1 (when the correct p value for that t is 0.008—although in any case the correct t statistic for the given means and SDs would be between 5.05 and 5.83). Second, between FSH (follicle-stimulating hormone) and LH (luteinizing hormone), the t statistic goes up and so does the p value (which is also clearly incorrect in both cases).

A number of the articles contain p values that are literally impossible (i.e., greater than 1.0; don't @ me to tell me about that time you did a Bonferroni correction by multiplying the p value instead of dividing the alpha). See 10.3109/01443615.2010.497873 (Table 1, "Parity", p = 1.13; see also "Other inconsistencies", below), 10.1016/j.fertnstert.2008.04.065 (Table 1, "Height", = 1.01),  and 10.1007/s00404-013-2866-0, which contains no less than four examples across its Tables 1 and 2:

1.12, 1.22, 1.32, 1.42: The impossible p values form a nice pattern.

The confidence intervals

A few of the articles have confidence intervals in the tables, perhaps added at the insistence of a reviewer or editor. But in most cases the point estimate falls outside the confidence interval. Sometimes this can become quite absurd, as in the following example (from 10.1016/j.fertnstert.2007.08.034). Those CI limits are ± 1.96 standard errors either side of... what exactly?

(Once you look beyond the CIs, there is a "bonus" waiting in the p value column here.)

Other inconsistencies

Within these 46 articles it is hard not to notice a considerable number of other inconsistencies, which make the reader wonder how much care and attention went into both the writing and review processes. These tables from 10.3109/01443615.2010.497873 provide a particularly egregious example, with the appearance of 41 and 42 extra patients in the respective groups between baseline and outcome. (As a bonus, we also have a p value of 1.13.)

(Some white space has been removed from this table.)

These results from 10.1016/j.ejogrb.2012.09.014 make no sense. The first comparison was apparently done using Fisher's exact test, the second with Pearson's Χ² test, and the third, well, your guess is as good as mine. But there is no reason to use the two different types of test here, and even less reason to use Fisher's test for the larger case numbers and Pearson's for the smaller ones. (The p values are all incorrect, and would be even if the other test were to have been used for every variable.)

(Some white space and some results that are not relevant to the point being illustrated have been removed from this table.)

Finally, fans of GRIM might also be interested to learn that two articles show signs of possible inconsistencies in their reported means:
10.1016/j.rbmo.2014.03.011, "Age" (both groups) and "Infertility" (control group)


The results of these analyses seem to indicate that something has gone very badly wrong in the writing, reviewing, and publication of these articles. Even though I tried to give the published numbers the benefit of the doubt as far as possible, I estimate that across these 46 articles, 346 (64%) of the 542 parametric tests (unpaired t tests, or, occasionally, ANOVA) and 151 (61%) of the 247 contingency table test (Pearson's Χ² or Fisher's exact test) that I was able to check were incorrectly reported. I don't think that anybody should be relying on the conclusions of these articles as a guide to practice, and I suspect that the only solution for most of them will be retraction. (As already mentioned, five have already been retracted following the publication of Bordewijk et al.'s commentary.)

I have a few aims in writing this post.

First, I want to do whatever I can to help get these misleading (at best) papers retracted from the medical literature, where they would seem to have considerable potential to do serious harm to the health of women, especially those who are pregnant or trying to overcome infertility.

Second, I aim to show some of the techniques that can be used to detect obvious errors in published articles (or in manuscripts that you might be reviewing).

Third, and the important reason for doing all this work (it took a lot of hours to do these analyses, as you will see if you download the Excel file!), is to draw attention to the utter failure of peer review that was required in order for most of these articles to get published. They appeared in 13 different journals, none of which would appear to correspond to most people's idea of a "predatory" outlet. It is very tempting to imagine that nobody—editors, reviewers, Dr Badawy's thesis committee at the University of Utrecht, or readers of the journals (until Ben Mol and Esmée Bordewijk came along)—even so much glanced at the tables of results in these articles, given that they almost all contain multiple impossible numbers.

It is true that the majority of these articles are more than 10 years old, but I wonder how much has changed in the publication processes of medical journals since then. The reality of scientific peer review seems to be that, to a first approximation, nobody ever checks any of the numbers. I find that deeply worrying.

Supporting documents

The majority of the analyses underlying this post have been done with Microsoft Excel 2003. I have some R code that can do the same thing, but it seemed to me to make more sense to use Excel as the process of copying and pasting numbers from the tables in the articles was a lot more reliable, requiring only a text editor to replace the column separators with tab characters. I used my R code to compute the test statistics in a couple of cases where there were more than two groups and so I had to use the rpsychi package to calculate the results of a one-way ANOVA.

In my Excel file, each unpaired t test is performed on a separate line. The user enters the mean, standard deviation, and sample size for each of the two conditions, plus an indication of the rounding precision (i.e, the number of decimal places) for the means and SDs separately. The spreadsheet then calculates (using formulas that you can find in columns whose width I have in most cases reduced to zero) the minimum and maximum possible (i.e., before rounding) means and SDs, and from that it determines the minimum, notional (i.e., assuming that the rounded input values are exact), and maximum t statistics and the corresponding p values. It then highlights those t statistics (if available) and p values (or "significant/not-significant" claims) from the article that are not compatible with any point in the possible ranges of values. That is, at all times, I give the maximum benefit of the doubt to the authors. (Similar considerations apply, mutatis mutandis, to the table of chi-square tests in the same Excel file.)

The documents (an Excel file for the main analyses and some R code for the bits that I couldn't work out how to do easily in Excel) are available here. The article PDFs are all copyrighted and I cannot share them, but if you do not have institutional access then there is always the site whose name rhymes with Dry Club.

Appendix: List of examined articles

Articles that have not been retracted and have no expression of concern

Badawy et al. (2009). Gonadotropin-releasing hormone agonists for prevention of chemotherapy-induced ovarian damage: Prospective randomized study. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.12.044

Badawy et al. (2007). Induction of ovulation in idiopathic premature ovarian failure: A randomized double-blind trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60711-0

Badawy et al. (2010). Clomiphene citrate or aromatase inhibitors combined with gonadotropins for superovulation in women undergoing intrauterine insemination: A prospective randomised trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.497873

Badawy et al. (2009). Ultrasound-guided transvaginal ovarian needle drilling (UTND) for treatment of polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.044

Badawy et al. (2008). Low-molecular weight heparin in patients with recurrent early miscarriages of unknown aetiology. Journal of Obstetrics and Gynaecology. https://doi.org/10.1080/01443610802042688

Badawy et al. (2010). Laparoscopy--or not--for management of unexplained infertility. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.508850

Badawy et al. (2007). Plasma homocysteine and polycystic ovary syndrome: The missed link. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2006.10.015

Badawy et al. (2008). Extending clomiphene treatment in clomiphene-resistant women with PCOS: A randomized controlled trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60148-4

Badawy et al. (2006). Clomiphene citrate plus N-acetyl cysteine versus clomiphene citrate for augmenting ovulation in the management of unexplained infertility: A randomized double-blind controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2006.02.097

Badawy et al. (2007). Randomized controlled trial of three doses of letrozole for ovulation induction in patients with unexplained infertility. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)61046-2

Fawzy et al. (2007). Treatment options and pregnancy outcome in women with idiopathic recurrent miscarriage: A randomized placebo-controlled study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-007-0527-x

Gibreel et al. (2012). Endometrial scratching to improve pregnancy rate in couples with unexplained subfertility: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2012.02016.x

Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus highly purified FSH for ovulation induction in clomiphene-resistant PCOS women: A randomised controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2010.488771

Abu Hashim et al. (2010). Letrozole versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-010-1566-2

Abu Hashim et al. (2011). Laparoscopic ovarian diathermy after clomiphene failure in polycystic ovary syndrome: is it worthwhile? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1983-x

Abu Hashim et al. (2012). Contraceptive vaginal ring treatment of heavy menstrual bleeding: A randomized controlled trial with norethisterone. Contraception. https://doi.org/10.1016/j.contraception.2011.07.012

Abu Hashim et al. (2010). N-acetyl cysteine plus clomiphene citrate versus metformin and clomiphene citrate in treatment of clomiphene-resistant polycystic ovary syndrome: A randomized controlled trial. Journal of Women's Health. https://doi.org/10.1089/jwh.2009.1920

Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2010.01383.x

Abu Hashim et al. (2011). Minimal stimulation or clomiphene citrate as first-line therapy in women with polycystic ovary syndrome: A randomized controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2011.589924

Abu Hashim et al. (2011). Does laparoscopic ovarian diathermy change clomiphene-resistant PCOS into clomiphene-sensitive? Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1931-9

Abu Hashim et al. (2013). LNG-IUS treatment of non-atypical endometrial hyperplasia in perimenopausal women: A randomized controlled trial. Journal of Gynecologic Oncology. https://doi.org/10.3802/jgo.2013.24.2.128

Marzouk et al. (2014). Lavender-thymol as a new topical aromatherapy preparation for episiotomy: A randomised clinical trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2014.970522

Ragab et al. (2013). Does immediate postpartum curettage of the endometrium accelerate recovery from preeclampsia-eclampsia? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-013-2866-0

El Refaeey et al. (2014). Combined coenzyme Q10 and clomiphene citrate for ovulation induction in clomiphene-citrate-resistant polycystic ovary syndrome. Reproductive Biomedicine Online. https://doi.org/10.1016/j.rbmo.2014.03.011

Seleem et al. (2014). Superoxide dismutase in polycystic ovary syndrome patients undergoing intracytoplasmic sperm injection. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s10815-014-0190-7

Shokeir (2006). Tamoxifen citrate for women with unexplained infertility. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-006-0181-8 *

Shokeir et al. (2016). Hysteroscopic-guided local endometrial injury does not improve natural cycle pregnancy rate in women with unexplained infertility: Randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/jog.13077

Shokeir et al. (2009). The efficacy of Implanon for the treatment of chronic pelvic pain associated with pelvic congestion: 1-year randomized controlled pilot study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-009-0951-1 *

Shokeir & Mousa (2015). A randomized, placebo-controlled, double-blind study of hysteroscopic-guided pertubal diluted bupivacaine infusion for endometriosis-associated chronic pelvic pain. International Journal of Gynecology & Obstetrics. https://doi.org/10.1016/j.ijgo.2015.03.043

Articles that are subject to an editorial Expression of Concern

Badawy et al. (2012). Aromatase inhibitors or gonadotropin-releasing hormone agonists for the management of uterine adenomyosis: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2012.01350.x

Badawy et al. (2009). Extended letrozole therapy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.04.065

Badawy et al. (2008). Luteal phase clomiphene citrate for ovulation induction in women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.016

Badawy et al. (2007). N-Acetyl cysteine and clomiphene citrate for induction of ovulation in polycystic ovary syndrome: A cross-over trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340601090337

Badawy et al. (2009). Clomiphene citrate or anastrozole for ovulation induction in women with polycystic ovary syndrome? A prospective controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.08.034

Badawy et al. (2009). Pregnancy outcome after ovulation induction with aromatase inhibitors or clomiphene citrate in unexplained infertility. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340802638199

Abu Hashim et al. (2011). Intrauterine insemination versus timed intercourse with clomiphene citrate in polycystic ovary syndrome: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2010.01063.x

Abu Hashim et al. (2012). Randomized comparison of superovulation with letrozole versus clomiphene citrate in an IUI program for women with recently surgically treated minimal to mild endometriosis. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2011.01346.x

Shokeir et al. (2011). An RCT: use of oxytocin drip during hysteroscopic endometrial resection and its effect on operative blood loss and glycine deficit. Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2011.03.015

Shokeir et al. (2013). Does adjuvant long-acting gestagen therapy improve the outcome of hysteroscopic endometrial resection in women of low-resource settings with heavy menstrual bleeding? Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2012.11.006

Badawy & Gibreal (2011). Clomiphene citrate versus tamoxifen for ovulation induction in women with PCOS: A prospective randomized trial. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2011.07.015

Shokeir et al. (2013). Reducing blood loss at abdominal myomectomy with preoperative use of dinoprostone intravaginal suppository: A randomized placebo-controlled pilot study. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2012.09.014

Articles that have been retracted

Badawy et al. (2009). Clomiphene citrate or aromatase inhibitors for superovulation in women with unexplained infertility undergoing intrauterine insemination: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2008.06.013

Badawy et al. (2009). Clomiphene citrate or letrozole for ovulation induction in women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2007.02.062

Badawy et al. (2008). Anastrozole or letrozole for ovulation induction in clomiphene-resistant women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2007.05.010

Abu Hashim et al. (2010). Letrozole versus combined metformin and clomiphene citrate for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2009.07.985

El-Refaie et al. (2015). Vaginal progesterone for prevention of preterm labor in asymptomatic twin pregnancies with sonographic short cervix: A randomized clinical trial of efficacy and safety. Archives of Gynecology and Obstetricshttps://doi.org/10.1007/s00404-015-3767-1


The two articles marked with a * above are the only ones in which I did not identify any problems; in each of these articles all of the statistical tests are marked as either "S" (significant) or "NS" (not significant) and none of the calculations that I performed resulted in the opposite verdict for any test.

02 October 2021

Applying John Carlisle's "Table 1" analysis to some claims about the treatment of Covid-19 using anti-androgens

Back in 2017, Dr. John Carlisle published this article in which he described a novel method for examining the table of participants' baseline characteristics (usually Table 1) in reports of randomised trials. Table 1 is typically used to show that the randomisation "worked", which in general means that the groups do not differ on important variables more often than we would expect by chance (with the p value of the final comparison expected to, in effect, "mop up" any differences that do occur).

[[ Update 2023-06-04 15:40 UTC: Please read my additional comments just before the "Materials" section of this post. ]]

Carlisle's insight was that it is possible for the baseline characteristics to be too similar across groups. That is, in some cases, we do not see the random variation that would be expect if the assignment to groups is truly random. For example, if you have 100 participants and 20 of them have some condition (say, diabetes), while a 10–10 split across two groups is the most likely individual outcome (about 17.6% of the time), you would expect a 7–13 (or more extreme) split about 26.5% of the time. If Table 1 contains a large number of even or near-even splits, that can be a sign that the randomisation was not done as reported, because there is just not enough genuine randomness in the data.

We can quantify the degree of randomness that is present by looking at the p values for the different between-group baseline tests in Table 1. If all of the variables are truly randomised, we would expect these p values to be uniformly distributed with a mean of 0.50. With a sufficiently large number of variables, we would expect 10% of the p values to be between 0 and 0.1, a further 10% to be between 0.1 and 0.2, and so on. If we see mostly p values above 0.8 or 0.9, this suggests that the baseline similarities between the groups could be too good to be true. A statistical test attributed to Stouffer and Fisher can be used to determine the probability of the set of p values that we observe being due to chance.

Carlisle's method has limitations, some of which I blogged about at the time (see also the comments under that post, including a nice reply from John Carlisle himself) and some of which were mentioned by proper methodologists and statisticians, for example here and here. Those limitations (principally, the possibility of non-independence of the observations being combined from Table 1, and their sometimes limited number) should be borne in mind when reading what follows, but (spoiler alert) I do not think they are sufficient to explain the issues that I report here.

In this post, I'm going to apply Carlisle's method, and the associated Stouffer-Fisher test, to two articles that have made claims of remarkably large positive effects of anti-androgenic drugs for the treatment of Covid-19:

Cadegiani, F. A., McCoy, J., Wambier, C. G., Vaño-Galván, S., Shapiro, J., Tosti, A., Zimerman, R. A., & Goren, A. (2021). Proxalutamide significantly accelerates viral clearance and reduces time to clinical remission in patients with mild to moderate COVID-19: Results from a randomized, double-blinded, placebo-controlled trialCureus13(2), e13492. https://doi.org/10.7759/cureus.13492

Cadegiani, F. A., McCoy, J., Wambier, C. G., & Goren, A. (2021). Early antiandrogen therapy with dutasteride reduces viral shedding, inflammatory responses, and time-to-remission in males with COVID-19: A randomized, double-blind, placebo-controlled interventional trial (EAT-DUTA AndroCoV Trial – Biochemical). Cureus, 13(2), e13047. https://doi.org/10.7759/cureus.13047

I will refer to these as Article 1 and Article 2, respectively. Article 2 also seems to be closely related to this preprint, which reports results from what appear to be a superset of its participants; I will refer to this as Article 3. Below, I also report the results of the analyses using Carlisle's method on this preprint; however, because of the similarity between the two samples I don't think that it would be fair to the authors to claim that the issues that I report here have been found in three, rather than two, articles.

Cadegiani, F. A., McCoy, J., Wambier, C. G., & Goren, A. (2020). 5-alpha-reductase inhibitors reduce remission time of COVID-19: Results from a randomized double blind placebo controlled interventional trial in 130 SARS-CoV-2 positive men. medRxiv. https://doi.org/10.1101/2020.11.16.20232512


For each article, I extracted the contents of Table 1 to a text file and, using global commands in the "vim" text editor as far as possible, converted each line into a call to a custom-written function that calculated the p value for that variable.

For variables where the p value is calculated from an independent-samples t test, my custom function determined the maximum possible t statistic (and, hence, the minimum possible p value), by adding the maximum possible rounding error to the larger mean, subtracting the same maximum possible rounding error from the smaller mean, and subtracting the maximum possible rounding error from the standard deviation of each group (cf. Brown & Heathers, 2019, "Rounded Input Variables, Exact Test Statistics (RIVETS)", https://psyarxiv.com/ctu9z/). I believe that doing this works in the authors' favour, as the majority of the p values in these analyses come from contingency tables and are rather large; that is, getting the smallest possible p value from the t tests tends to increase the overall Stouffer-Fisher test p value.

For the majority of the variables, where the p value is derived from an analysis of 2x2 contingency tables, my custom function applied the following rules:
  • If any cell of the table contains 0, return NULL; the variable will not be considered to have returned a p value).
  • If any cell of the table contains a number less than 5, apply Fisher's exact test. This is of course an arbitrary distinction (but it doesn't make too much difference anyway).
  • Otherwise, apply a chi-square test with Yates' continuity correction for 2x2 tables.
Other analyses are possible, but this was what I decided to do a priori. I call this "Analysis 1a" below. In Analysis 1b I included tables where one or more cells are zero (also using Fisher's exact test). In Analysis 1c I excluded all variables where one or more cells had a value less then 3, so that any variable where 2 or fewer people in either condition had or did not have the attribute in question we excluded. In Analyses 2a, 2b, and 2c I  applied the same rules for inclusion as in 1a, 1b, and 1c, respectively, but I used the chi-square test throughout. (This means that Analyses 1c and 2c are identical, as there are no variable in 1c for which my rules would mean that Fisher's exact test would be used.) In Analyses 3a, 3b, and 3c I used Fisher's exact test throughout.

After calculating the p values, I replaced any that were greater than 0.98 with exactly 0.98, which avoids problems with the calculation of the Stouffer-Fisher formula in R with values of exactly 1.0 (which will occur, for example, if the number of cases in a contingency table is identical across conditions, or differs only by 1). Again, I believe that this choice works in the authors' favour. Then I calculated the overall Stouffer-Fisher test p value formula using the method that I described in my blog post about Carlisle's article:
  1. Convert each p value into a z score.
  2. Sum the z scores.
  3. If there are k scores, divide the sum of the z scores from step 2 by the square root of k.
  4. Calculate the one-tailed p value associated with the overall z score from step 3.

For example, for Article 1 with Analysis 2c (see the table below), 26 p values are retained from 55 variables: (0.048, 0.431, 0.703, 0.782, 0.973, 0.298, 0.980, 0.682, 0.817, 0.897, 0.826, 0.328, 0.980, 0.980, 0.227, 0.424, 0.918, 0.884, 0.959, 0.353, 0.980, 0.511, 0.980, 0.512, 0.980, 0.980). These correspond to the z scores (-1.662, -0.175, 0.533, 0.779, 1.924, -0.531, 2.054, 0.473, 0.904, 1.265, 0.939, -0.446, 2.054, 2.054, -0.749, -0.193, 1.393, 1.198, 1.739, -0.376, 2.054, 0.028, 2.054, 0.030, 2.054, 2.054), which sum to 21.448. We divide this by the square root of 26 (i.e., 5.099) to get an overall z score of 4.206, which in turn gives a p value of 0.000013 (1.30E-05).

Note that we do not take the absolute value of each z score, because the sign is important. A p value below/above 0.5 corresponds to a negative/positive z score. If  the positive and negative z scores cancel each other out, the overall Stouffer-Fisher p value will be 0.5, which is what we expect to see on average with perfect randomisation.


Here are the p values that I obtained from each article using each of the analysis methods described above. Note that a value of zero corresponds to a p value below 2.2E-16, the smallest value that R can calculate (on my computer anyway). The 10 columns to the right of the overall p value show the deciles of the distribution of p values of the individual tests that make up the overall score.

It can be readily seen that in Articles 1 and 2, which are the main focus of this post, the p value is very small, whatever the combination of analyses and exclusions that were performed, and the distribution of p values from the individual comparisons is very heavily skewed towards values above .7 and especially .9. Even when I excluded a large number of comparisons because there were fewer than 3 cases or non-cases in one or more conditions (Analyses 1c, 2c, and 3c), a decision that is heavily in favour of the authors, the largest p value I obtained for Article 1 was 0.0000337, and for Article 2 the largest p value was 0.00000000819. (As mentioned above, the results for Article 3 should not be considered to be independent from those of Article 2, but they do tend to confirm that there are potentially severe problems with the randomisation of participants in both reported versions of that study.)


The use of novel statistical methods should always be approached with an abundance of caution; indeed, as noted earlier, I have had my own criticisms of John Carlisle's approach in the past. (Now that I have adopted Carlisle's method in this post, I hope that someone else will take the part of the "Red Team" and perhaps show me where my application of it might be invalid!)

However, in this case, I believe that my comments from 2017 about the non-independence (and low number) of baseline measures do not apply here to anything like the same extent. In addition, I have run multiple analyses of each of the articles in question here, and tried to make whatever choice favoured the authors when the opportunity presented itself. (Of course, I could have made even more conservative choices; for example, I could have excluded all of the contingency tables where any of the cells were less than 6, or 10. But at some point, one has to accept at face value the authors' claims that their Table 1 results (a) are worth reporting and (b) demonstrate the success of their randomisation.)

Despite that, I obtained what seems to be substantial evidence that the randomisation of participants to conditions in the two studies that are the respective subjects of these articles may not have taken place exactly as reported. This adds to a growing volume of public (and, in some cases, official) critique of the Covid-19 research coming from this laboratory (e.g., in English, here, herehere, here, and here, or in Portuguese here, here, here, here, here, here, and here).

[[ Update 2023-06-04 15:40 UTC: Two alert comments mentioned the preprint by Daniel Tausk which points out a limitation of the use of Carlisle's method with dichotomous variables; my comment that "The use of novel statistical methods should always be approached with an abundance of caution" was apparently prescient. Tausk's analysis suggests that the p values that I calculated above may have been rather too small; he suggests 0.017 for the first and 0.0024 for the second. These might not, on their own, constitute strong enough evidence to launch an inquiry into these studies, but it seems to me that they are still of interest in view of the other critiques of the work from this laboratory. I will let readers make up their own minds. ]]


My R code is available here. The three referenced articles are published either under open access conditions or as a preprint, so I hope they will be easy for readers to download when checking my working.

[[ Update 2021-10-03 12:56 UTC: Added the columns of p value deciles to the table of results. ]]