28 October 2021

A catastrophic failure of peer review in obstetrics and gynaecology

In this post I will discuss a set of 46 articles from the same institution that appear to show severe problems in many journals in the field of obstetrics and gynaecology. These are not entirely new discoveries; worrying overlaps among 35 of these articles have already been investigated in a commentary article from 2020 by Esmée Bordewijk and colleagues that critiqued 24 articles on which Dr Ahmed Badawy was lead author (19) or a co-author (5), plus 11 articles lead-authored by Dr Hatem Abu Hashim, who is Dr Badawy's colleague in the Department of Obstetrics and Gynaecology at Mansoura University in Egypt.

Bordewijk et al. reported that they had detected a large number of apparent duplications in the summary statistics across those articles, which mostly describe randomized controlled trials carried out in the Mansoura ObGyn department. Nine of these articles appear as chapters in Dr Badawy's PhD thesis , which he defended in December 2008 at the University of Utrecht in the Netherlands.

I think it is fair to say that Dr Badawy was not especially impressed by the arguments in Bordewijk et al.'s commentary; indeed, he wrote a reply with the uncompromising title of "Data integrity of randomized controlled trials: A hate speech or scientific work?" in which he questioned, among other things, the simulation techniques that Bordewijk et al. had used to demonstrate how unlikely it was that the patterns that they had observed across the 35 articles that they examined had arisen by chance.

The senior author on the Bordewijk et al. commentary was Dr Ben Mol of Monash University in Melbourne, Australia. Since their commentary was published, Dr Mol and his colleagues have been attempting to get the journal editors who published the 35 articles in question to take some form of action on them. To date, five articles have been retracted and another 10 have expressions of concern. The story, including its potential legal fallout, has been covered in considerable detail at Retraction Watch in December 2020 and again in August 2021.

However, to some extent, Dr Badawy may have a point: The evidence presented in the commentary is circumstantial and depends on a number of probabilistic assumptions, which editors may not be inclined to completely trust (although personally I find Bordewijk et al.'s analysis thoroughly convincing). And, as an editor, even if you believe that two or more articles are based on recycled data or summary statistics, how are you to know that the one in your journal is not the original ("good") one?

Fortunately (at least from the point of view of the error correction process) there is a much simpler approach to the problem at hand. It can be shown that almost all of the articles that were analysed by Bordewijk et al.—plus a few more that did not make it into their commentary—have very substantial statistical flaws at the level of each individual article. In my opinion, in most cases these errors would justify a rapid retraction based solely on the evidence that is to be found in each article's PDF file. There is no need for simulations or probability calculations; in the majority of cases, the numbers sitting there in the tables of results are demonstrably incorrect.

General description of the articles

As mentioned above, for this blog post I examined 46 articles from the Department of Obstetrics and Gynaecology at Mansoura University. Of these, 35 had already been analysed by Bordewijk et al., and the rest were included either at the suggestion of Ben Mol or after I searched for any other empirical studies that I could find in the Google Scholar profiles of Dr Badawy and Dr Abu Hashim. Seven of the 46 articles had neither Dr Badawy nor Dr Abu Hashim as co-authors, but for all seven of those Dr Tarek Shokeir was listed as a co-author (or, in one case, sole author).

These articles mostly describe RCTs of various interventions for conditions such as infertility, heavy menstrual bleeding, polycystic ovary syndrome, preterm labour, or endometriosis. Several of them have more than 100 citations according to Google Scholar. The studies seem to be well-powered, many with more than 100 participants in each group (e.g., for this one the authors claimed to have recruited 996 infertile women), and it is not hard to imagine that their findings may be affecting clinical practice around the world.

The typical article is relatively short, and contains a baseline table comparing the groups of patients (usually two), followed by one or sometimes more tables comparing the outcomes across those groups. These are usually expressed as simple unpaired comparisons of parameters (e.g., height, with mean and standard deviation reported for each group), or as tests of proportions (e.g., in the treatment group X% of N1 participants became pregnant, versus Y% of N2 participants in the control group). The statistics are therefore for the most part very simple; for example, there are no logistic regressions with covariates. This means that we can readily check most of the statistics from the tables themselves.

The t statistics

First up, I note that in about half of these articles, no t statistics at all are reported for the comparisons of continuous variables across groups. Sometimes we get just a p value. In other cases we are only told that individual comparisons (or all of the comparisons in a table, via a note at the end) were statistically significant or not; typically we are left to infer that that means p < 0.05. (In a few articles the authors reported using the Mann-Whitney U test when data were not normally distributed, but they do not generally indicate which variables are concerned by this in each case.)

In quite a few cases the errors in the implicit t statistics are visible from space, as in this example from 10.1111/j.1447-0756.2010.01383.x:

(This table has been truncated on the right in order to fit on the page.)

Have a look at the "Fasting glucose" numbers (fourth line from the bottom). The difference between the means is 5.4 (which means a minimum of 5.3 even after allowing for rounding) and just by approximating a weighted mean you can see that the pooled SD is going to be about 1.6, so this is a Cohen's d of around 3.3, which is never going to be non-significant at the 0.05 level. You don't have to carry the formula for the pooled standard error around in your head to know that with df = 136 the t statistic here is going to be huge, and indeed it is: the minimum t value is 17.31, the midpoint is 18.17, the maximum is 19.09, p = homeopathic. (Aside: p
hysiologists might wonder about the homogeneity of the testosterone levels in the ovulation group, with an SD of just 0.01.)

When we do get to see t statistics, the majority are incorrect given the mean, SD, and sample size, and even after allowing for rounding (but see also our RIVETS preprint for what happens if test statistics are derived with too much allowance for rounding). See this table from 10.1016/j.fertnstert.2008.06.013:

Several of the effect sizes here are huge; for example, it is close to d = 4 on the first line, for which the correct t statistic is not 4.4 but instead somewhere between 38.71 and 40.26. If you are more familiar with ANOVA, you can square those numbers to get the corresponding F statistic. (Spoiler: F = 1,500 is a lot.)

The overall results for the set of 46 articles are disastrous. In most of the articles the majority of the p values are incorrect, sometimes by a wide margin. Even in those articles where the authors did not explicitly report which test they used, leaving open the possibility that they might have used the Mann-Whitney U test, the discrepancies between the reported p values and those that I obtained from a t test were very often such that even the discovery that the Mann-Whitney test had been used would not be sufficient to explain them.

[Begin update 2021-10-28 19:57 UTC]
Brenton Wiernik asked whether it was possible that the authors had accidentally reported the variability of their means as the standard error of the mean (SEM) rather than the standard deviation. I must confess that when doing the analyses I did not consider this possibility, not least because all of the SDs for things like age (around 3) or height (around 5–6cm) seemed quite reasonable, although perhaps a bit on the low side of what one might expect. 

However, I did identify seven articles (10.3109/01443615.2010.49787310.1016/j.fertnstert.2008.04.06510.1016/j.fertnstert.2008.06.013, 10.1016/j.fertnstert.2007.08.034, 10.1016/s1472-6483(10)60148-4, 10.1016/j.fertnstert.2007.05.010, and 10.1016/j.fertnstert.2007.02.062) in which the authors claimed that their measure of variability (typically written as ± and a number after the mean) was indeed the SEM. But the sample sizes in these papers are such that the implied SDs would need to be huge. Across these seven papers, the implied SD for age would range from 29.50 to 62.39 years, and for height from 54.79 to 124.77 cm. I therefore stand by my interpretation that these values were intended by the authors to be standard deviations, although that then gives them another question to answer, namely why they claimed them to be SEM when those numbers are patently absurd.
[End update 2021-10-28 19:57 UTC]

The Χ² (etc) statistics

In examining whether the tests of proportions had been reported correctly, I included only those articles (30 out of the total set of 46) that contained at least one exact numerical (i.e., not "NS" or "<0.001") p value from a Pearson chi-square test or Fisher's exact test of a 2x2 contingency table. If the authors also reported Χ² statistics and/or odds ratios, I also included those numbers. I then examined the extent to which these statistics matched the values that I calculated from the underlying data. When the subsample sizes were very small, I allowed the authors some more leeway, as the Pearson chi-square test does not always perform well in these cases.

As with the t test results (see previous section), the overall results revealed a large number of incorrect p values in almost every article for which I recalculated the tests of proportions.

Perhaps the most indisputable source of errors is situations in which what is effectively the same test is reported twice, with different chi-square statistics (if those are reported) and different p values, even though those values are necessarily identical. I counted 8 examples of this across 7 different articles. For example, consider this table from 10.3109/01443615.2010.508850:

You're either pregnant or you aren't. I don't make the rules.

After 6 months in the study, every participant either had, or had not, become pregnant. So the contingency table for the first outcome ("No pregnancy") is ((141, 114), (150, 107)) and for the second outcome ("Clinical pregnancy") it is 
((114, 141), (107, 150)). Those will of course give exactly the same result, meaning that at least one of the Χ²/p-value pairs must be wrong. In fact the correct numbers are Χ²(1) = 0.49, p = 0.48, which means that neither of the Χ² test statistics, nor the p values, in the first two lines of the table match are even remotely valid. For that matter, neither (incorrect) p value in the table even matches its corresponding (incorrect) Χ² statistic; I will let you check this for yourself as an exercise.

The p values

There are a couple of basic things to keep in mind when reading tables of statistics:

  • A t statistic of 1.96 with 100 or more degrees of freedom gives a rounded two-tailed p value of 0.05 (although if you want it to be strictly less than 0.05000, you need a t statistic of 1.984 with 100 dfs).
  • For any given number of degrees of freedom, a larger t or Χ² statistic gives a smaller p value.
With that in mind, let's look at this table (from 10.1016/j.fertnstert.2007.05.010), which I believe to be entirely typical of the articles under discussion here:

The good news is that the percentages and the Χ² statistics check out OK!

Nether of the above bits of very basic knowledge is respected here. First we have t(228) = 2.66, p = 0.1 (when the correct p value for that t is 0.008—although in any case the correct t statistic for the given means and SDs would be between 5.05 and 5.83). Second, between FSH (follicle-stimulating hormone) and LH (luteinizing hormone), the t statistic goes up and so does the p value (which is also clearly incorrect in both cases).

A number of the articles contain p values that are literally impossible (i.e., greater than 1.0; don't @ me to tell me about that time you did a Bonferroni correction by multiplying the p value instead of dividing the alpha). See 10.3109/01443615.2010.497873 (Table 1, "Parity", p = 1.13; see also "Other inconsistencies", below), 10.1016/j.fertnstert.2008.04.065 (Table 1, "Height", = 1.01),  and 10.1007/s00404-013-2866-0, which contains no less than four examples across its Tables 1 and 2:

1.12, 1.22, 1.32, 1.42: The impossible p values form a nice pattern.

The confidence intervals

A few of the articles have confidence intervals in the tables, perhaps added at the insistence of a reviewer or editor. But in most cases the point estimate falls outside the confidence interval. Sometimes this can become quite absurd, as in the following example (from 10.1016/j.fertnstert.2007.08.034). Those CI limits are ± 1.96 standard errors either side of... what exactly?

(Once you look beyond the CIs, there is a "bonus" waiting in the p value column here.)

Other inconsistencies

Within these 46 articles it is hard not to notice a considerable number of other inconsistencies, which make the reader wonder how much care and attention went into both the writing and review processes. These tables from 10.3109/01443615.2010.497873 provide a particularly egregious example, with the appearance of 41 and 42 extra patients in the respective groups between baseline and outcome. (As a bonus, we also have a p value of 1.13.)

(Some white space has been removed from this table.)

These results from 10.1016/j.ejogrb.2012.09.014 make no sense. The first comparison was apparently done using Fisher's exact test, the second with Pearson's Χ² test, and the third, well, your guess is as good as mine. But there is no reason to use the two different types of test here, and even less reason to use Fisher's test for the larger case numbers and Pearson's for the smaller ones. (The p values are all incorrect, and would be even if the other test were to have been used for every variable.)

(Some white space and some results that are not relevant to the point being illustrated have been removed from this table.)

Finally, fans of GRIM might also be interested to learn that two articles show signs of possible inconsistencies in their reported means:
10.1016/j.rbmo.2014.03.011, "Age" (both groups) and "Infertility" (control group)


The results of these analyses seem to indicate that something has gone very badly wrong in the writing, reviewing, and publication of these articles. Even though I tried to give the published numbers the benefit of the doubt as far as possible, I estimate that across these 46 articles, 346 (64%) of the 542 parametric tests (unpaired t tests, or, occasionally, ANOVA) and 151 (61%) of the 247 contingency table test (Pearson's Χ² or Fisher's exact test) that I was able to check were incorrectly reported. I don't think that anybody should be relying on the conclusions of these articles as a guide to practice, and I suspect that the only solution for most of them will be retraction. (As already mentioned, five have already been retracted following the publication of Bordewijk et al.'s commentary.)

I have a few aims in writing this post.

First, I want to do whatever I can to help get these misleading (at best) papers retracted from the medical literature, where they would seem to have considerable potential to do serious harm to the health of women, especially those who are pregnant or trying to overcome infertility.

Second, I aim to show some of the techniques that can be used to detect obvious errors in published articles (or in manuscripts that you might be reviewing).

Third, and the important reason for doing all this work (it took a lot of hours to do these analyses, as you will see if you download the Excel file!), is to draw attention to the utter failure of peer review that was required in order for most of these articles to get published. They appeared in 13 different journals, none of which would appear to correspond to most people's idea of a "predatory" outlet. It is very tempting to imagine that nobody—editors, reviewers, Dr Badawy's thesis committee at the University of Utrecht, or readers of the journals (until Ben Mol and Esmée Bordewijk came along)—even so much glanced at the tables of results in these articles, given that they almost all contain multiple impossible numbers.

It is true that the majority of these articles are more than 10 years old, but I wonder how much has changed in the publication processes of medical journals since then. The reality of scientific peer review seems to be that, to a first approximation, nobody ever checks any of the numbers. I find that deeply worrying.

Supporting documents

The majority of the analyses underlying this post have been done with Microsoft Excel 2003. I have some R code that can do the same thing, but it seemed to me to make more sense to use Excel as the process of copying and pasting numbers from the tables in the articles was a lot more reliable, requiring only a text editor to replace the column separators with tab characters. I used my R code to compute the test statistics in a couple of cases where there were more than two groups and so I had to use the rpsychi package to calculate the results of a one-way ANOVA.

In my Excel file, each unpaired t test is performed on a separate line. The user enters the mean, standard deviation, and sample size for each of the two conditions, plus an indication of the rounding precision (i.e, the number of decimal places) for the means and SDs separately. The spreadsheet then calculates (using formulas that you can find in columns whose width I have in most cases reduced to zero) the minimum and maximum possible (i.e., before rounding) means and SDs, and from that it determines the minimum, notional (i.e., assuming that the rounded input values are exact), and maximum t statistics and the corresponding p values. It then highlights those t statistics (if available) and p values (or "significant/not-significant" claims) from the article that are not compatible with any point in the possible ranges of values. That is, at all times, I give the maximum benefit of the doubt to the authors. (Similar considerations apply, mutatis mutandis, to the table of chi-square tests in the same Excel file.)

The documents (an Excel file for the main analyses and some R code for the bits that I couldn't work out how to do easily in Excel) are available here. The article PDFs are all copyrighted and I cannot share them, but if you do not have institutional access then there is always the site whose name rhymes with Dry Club.

Appendix: List of examined articles

Articles that have not been retracted and have no expression of concern

Badawy et al. (2009). Gonadotropin-releasing hormone agonists for prevention of chemotherapy-induced ovarian damage: Prospective randomized study. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.12.044

Badawy et al. (2007). Induction of ovulation in idiopathic premature ovarian failure: A randomized double-blind trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60711-0

Badawy et al. (2010). Clomiphene citrate or aromatase inhibitors combined with gonadotropins for superovulation in women undergoing intrauterine insemination: A prospective randomised trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.497873

Badawy et al. (2009). Ultrasound-guided transvaginal ovarian needle drilling (UTND) for treatment of polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.044

Badawy et al. (2008). Low-molecular weight heparin in patients with recurrent early miscarriages of unknown aetiology. Journal of Obstetrics and Gynaecology. https://doi.org/10.1080/01443610802042688

Badawy et al. (2010). Laparoscopy--or not--for management of unexplained infertility. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.508850

Badawy et al. (2007). Plasma homocysteine and polycystic ovary syndrome: The missed link. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2006.10.015

Badawy et al. (2008). Extending clomiphene treatment in clomiphene-resistant women with PCOS: A randomized controlled trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60148-4

Badawy et al. (2006). Clomiphene citrate plus N-acetyl cysteine versus clomiphene citrate for augmenting ovulation in the management of unexplained infertility: A randomized double-blind controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2006.02.097

Badawy et al. (2007). Randomized controlled trial of three doses of letrozole for ovulation induction in patients with unexplained infertility. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)61046-2

Fawzy et al. (2007). Treatment options and pregnancy outcome in women with idiopathic recurrent miscarriage: A randomized placebo-controlled study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-007-0527-x

Gibreel et al. (2012). Endometrial scratching to improve pregnancy rate in couples with unexplained subfertility: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2012.02016.x

Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus highly purified FSH for ovulation induction in clomiphene-resistant PCOS women: A randomised controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2010.488771

Abu Hashim et al. (2010). Letrozole versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-010-1566-2

Abu Hashim et al. (2011). Laparoscopic ovarian diathermy after clomiphene failure in polycystic ovary syndrome: is it worthwhile? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1983-x

Abu Hashim et al. (2012). Contraceptive vaginal ring treatment of heavy menstrual bleeding: A randomized controlled trial with norethisterone. Contraception. https://doi.org/10.1016/j.contraception.2011.07.012

Abu Hashim et al. (2010). N-acetyl cysteine plus clomiphene citrate versus metformin and clomiphene citrate in treatment of clomiphene-resistant polycystic ovary syndrome: A randomized controlled trial. Journal of Women's Health. https://doi.org/10.1089/jwh.2009.1920

Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2010.01383.x

Abu Hashim et al. (2011). Minimal stimulation or clomiphene citrate as first-line therapy in women with polycystic ovary syndrome: A randomized controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2011.589924

Abu Hashim et al. (2011). Does laparoscopic ovarian diathermy change clomiphene-resistant PCOS into clomiphene-sensitive? Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1931-9

Abu Hashim et al. (2013). LNG-IUS treatment of non-atypical endometrial hyperplasia in perimenopausal women: A randomized controlled trial. Journal of Gynecologic Oncology. https://doi.org/10.3802/jgo.2013.24.2.128

Marzouk et al. (2014). Lavender-thymol as a new topical aromatherapy preparation for episiotomy: A randomised clinical trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2014.970522

Ragab et al. (2013). Does immediate postpartum curettage of the endometrium accelerate recovery from preeclampsia-eclampsia? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-013-2866-0

El Refaeey et al. (2014). Combined coenzyme Q10 and clomiphene citrate for ovulation induction in clomiphene-citrate-resistant polycystic ovary syndrome. Reproductive Biomedicine Online. https://doi.org/10.1016/j.rbmo.2014.03.011

Seleem et al. (2014). Superoxide dismutase in polycystic ovary syndrome patients undergoing intracytoplasmic sperm injection. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s10815-014-0190-7

Shokeir (2006). Tamoxifen citrate for women with unexplained infertility. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-006-0181-8 *

Shokeir et al. (2016). Hysteroscopic-guided local endometrial injury does not improve natural cycle pregnancy rate in women with unexplained infertility: Randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/jog.13077

Shokeir et al. (2009). The efficacy of Implanon for the treatment of chronic pelvic pain associated with pelvic congestion: 1-year randomized controlled pilot study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-009-0951-1 *

Shokeir & Mousa (2015). A randomized, placebo-controlled, double-blind study of hysteroscopic-guided pertubal diluted bupivacaine infusion for endometriosis-associated chronic pelvic pain. International Journal of Gynecology & Obstetrics. https://doi.org/10.1016/j.ijgo.2015.03.043

Articles that are subject to an editorial Expression of Concern

Badawy et al. (2012). Aromatase inhibitors or gonadotropin-releasing hormone agonists for the management of uterine adenomyosis: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2012.01350.x

Badawy et al. (2009). Extended letrozole therapy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.04.065

Badawy et al. (2008). Luteal phase clomiphene citrate for ovulation induction in women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.016

Badawy et al. (2007). N-Acetyl cysteine and clomiphene citrate for induction of ovulation in polycystic ovary syndrome: A cross-over trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340601090337

Badawy et al. (2009). Clomiphene citrate or anastrozole for ovulation induction in women with polycystic ovary syndrome? A prospective controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.08.034

Badawy et al. (2009). Pregnancy outcome after ovulation induction with aromatase inhibitors or clomiphene citrate in unexplained infertility. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340802638199

Abu Hashim et al. (2011). Intrauterine insemination versus timed intercourse with clomiphene citrate in polycystic ovary syndrome: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2010.01063.x

Abu Hashim et al. (2012). Randomized comparison of superovulation with letrozole versus clomiphene citrate in an IUI program for women with recently surgically treated minimal to mild endometriosis. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2011.01346.x

Shokeir et al. (2011). An RCT: use of oxytocin drip during hysteroscopic endometrial resection and its effect on operative blood loss and glycine deficit. Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2011.03.015

Shokeir et al. (2013). Does adjuvant long-acting gestagen therapy improve the outcome of hysteroscopic endometrial resection in women of low-resource settings with heavy menstrual bleeding? Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2012.11.006

Badawy & Gibreal (2011). Clomiphene citrate versus tamoxifen for ovulation induction in women with PCOS: A prospective randomized trial. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2011.07.015

Shokeir et al. (2013). Reducing blood loss at abdominal myomectomy with preoperative use of dinoprostone intravaginal suppository: A randomized placebo-controlled pilot study. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2012.09.014

Articles that have been retracted

Badawy et al. (2009). Clomiphene citrate or aromatase inhibitors for superovulation in women with unexplained infertility undergoing intrauterine insemination: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2008.06.013

Badawy et al. (2009). Clomiphene citrate or letrozole for ovulation induction in women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2007.02.062

Badawy et al. (2008). Anastrozole or letrozole for ovulation induction in clomiphene-resistant women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2007.05.010

Abu Hashim et al. (2010). Letrozole versus combined metformin and clomiphene citrate for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterilityhttps://doi.org/10.1016/j.fertnstert.2009.07.985

El-Refaie et al. (2015). Vaginal progesterone for prevention of preterm labor in asymptomatic twin pregnancies with sonographic short cervix: A randomized clinical trial of efficacy and safety. Archives of Gynecology and Obstetricshttps://doi.org/10.1007/s00404-015-3767-1


The two articles marked with a * above are the only ones in which I did not identify any problems; in each of these articles all of the statistical tests are marked as either "S" (significant) or "NS" (not significant) and none of the calculations that I performed resulted in the opposite verdict for any test.


  1. Oh man! This is soo good. "In quite a few cases the errors in the implicit t statistics are visible from space." Looking forward to meeting up on Tuesday Nick.