In this post I will discuss a set of 46 articles from the same institution that appear to show severe problems in many journals in the field of obstetrics and gynaecology. These are not entirely new discoveries; worrying overlaps among 35 of these articles have already been investigated in a commentary article from 2020 by Esmée Bordewijk and colleagues that critiqued 24 articles on which Dr Ahmed Badawy was lead author (19) or a co-author (5), plus 11 articles lead-authored by Dr Hatem Abu Hashim, who is Dr Badawy's colleague in the Department of Obstetrics and Gynaecology at Mansoura University in Egypt.
Bordewijk et al. reported that they had detected a large number of apparent duplications in the summary statistics across those articles, which mostly describe randomized controlled trials carried out in the Mansoura ObGyn department. Nine of these articles appear as chapters in Dr Badawy's PhD thesis , which he defended in December 2008 at the University of Utrecht in the Netherlands.
I think it is fair to say that Dr Badawy was not especially impressed by the arguments in Bordewijk et al.'s commentary; indeed, he wrote a reply with the uncompromising title of "Data integrity of randomized controlled trials: A hate speech or scientific work?" in which he questioned, among other things, the simulation techniques that Bordewijk et al. had used to demonstrate how unlikely it was that the patterns that they had observed across the 35 articles that they examined had arisen by chance.
The senior author on the Bordewijk et al. commentary was Dr Ben Mol of Monash University in Melbourne, Australia. Since their commentary was published, Dr Mol and his colleagues have been attempting to get the journal editors who published the 35 articles in question to take some form of action on them. To date, five articles have been retracted and another 10 have expressions of concern. The story, including its potential legal fallout, has been covered in considerable detail at Retraction Watch in December 2020 and again in August 2021.
However, to some extent, Dr Badawy may have a point: The evidence presented in the commentary is circumstantial and depends on a number of probabilistic assumptions, which editors may not be inclined to completely trust (although personally I find Bordewijk et al.'s analysis thoroughly convincing). And, as an editor, even if you believe that two or more articles are based on recycled data or summary statistics, how are you to know that the one in your journal is not the original ("good") one?
Fortunately (at least from the point of view of the error correction process) there is a much simpler approach to the problem at hand. It can be shown that almost all of the articles that were analysed by Bordewijk et al.—plus a few more that did not make it into their commentary—have very substantial statistical flaws at the level of each individual article. In my opinion, in most cases these errors would justify a rapid retraction based solely on the evidence that is to be found in each article's PDF file. There is no need for simulations or probability calculations; in the majority of cases, the numbers sitting there in the tables of results are demonstrably incorrect.
General description of the articles
As mentioned above, for this blog post I examined 46 articles from the Department of Obstetrics and Gynaecology at Mansoura University. Of these, 35 had already been analysed by Bordewijk et al., and the rest were included either at the suggestion of Ben Mol or after I searched for any other empirical studies that I could find in the Google Scholar profiles of Dr Badawy and Dr Abu Hashim. Seven of the 46 articles had neither Dr Badawy nor Dr Abu Hashim as co-authors, but for all seven of those Dr Tarek Shokeir was listed as a co-author (or, in one case, sole author).
These articles mostly describe RCTs of various interventions for conditions such as infertility, heavy menstrual bleeding, polycystic ovary syndrome, preterm labour, or endometriosis. Several of them have more than 100 citations according to Google Scholar. The studies seem to be well-powered, many with more than 100 participants in each group (e.g., for this one the authors claimed to have recruited 996 infertile women), and it is not hard to imagine that their findings may be affecting clinical practice around the world.
The typical article is relatively short, and contains a baseline table comparing the groups of patients (usually two), followed by one or sometimes more tables comparing the outcomes across those groups. These are usually expressed as simple unpaired comparisons of parameters (e.g., height, with mean and standard deviation reported for each group), or as tests of proportions (e.g., in the treatment group X% of N1 participants became pregnant, versus Y% of N2 participants in the control group). The statistics are therefore for the most part very simple; for example, there are no logistic regressions with covariates. This means that we can readily check most of the statistics from the tables themselves.
The t statistics
First up, I note that in about half of these articles, no t statistics at all are reported for the comparisons of continuous variables across groups. Sometimes we get just a p value. In other cases we are only told that individual comparisons (or all of the comparisons in a table, via a note at the end) were statistically significant or not; typically we are left to infer that that means p < 0.05. (In a few articles the authors reported using the Mann-Whitney U test when data were not normally distributed, but they do not generally indicate which variables are concerned by this in each case.)
In quite a few cases the errors in the implicit t statistics are visible from space, as in this example from 10.1111/j.1447-0756.2010.01383.x:
(This table has been truncated on the right in order to fit on the page.)
Have a look at the "Fasting glucose" numbers (fourth line from the bottom). The difference between the means is 5.4 (which means a minimum of 5.3 even after allowing for rounding) and just by approximating a weighted mean you can see that the pooled SD is going to be about 1.6, so this is a Cohen's d of around 3.3, which is never going to be non-significant at the 0.05 level. You don't have to carry the formula for the pooled standard error around in your head to know that with df = 136 the t statistic here is going to be huge, and indeed it is: the minimum t value is 17.31, the midpoint is 18.17, the maximum is 19.09, p = homeopathic. (Aside: physiologists might wonder about the homogeneity of the testosterone levels in the ovulation group, with an SD of just 0.01.) In examining whether the tests of proportions had been reported correctly, I included only those articles (30 out of the total set of 46) that contained at least one exact numerical (i.e., not "NS" or "<0.001") p value from a Pearson chi-square test or Fisher's exact test of a 2x2 contingency table. If the authors also reported Χ² statistics and/or odds ratios, I also included those numbers. I then examined the extent to which these statistics matched the values that I calculated from the underlying data. When the subsample sizes were very small, I allowed the authors some more leeway, as the Pearson chi-square test does not always perform well in these cases.
As with the t test results (see previous section), the overall results revealed a large number of incorrect p values in almost every article for which I recalculated the tests of proportions.
Perhaps the most indisputable source of errors is situations in which what is effectively the same test is reported twice, with different chi-square statistics (if those are reported) and different p values, even though those values are necessarily identical. I counted 8 examples of this across 7 different articles. For example, consider this table from 10.3109/01443615.2010.508850:
You're either pregnant or you aren't. I don't make the rules.
After 6 months in the study, every participant either had, or had not, become pregnant. So the contingency table for the first outcome ("No pregnancy") is ((141, 114), (150, 107)) and for the second outcome ("Clinical pregnancy") it is ((114, 141), (107, 150)). Those will of course give exactly the same result, meaning that at least one of the Χ²/p-value pairs must be wrong. In fact the correct numbers are Χ²(1) = 0.49, p = 0.48, which means that neither of the Χ² test statistics, nor the p values, in the first two lines of the table match are even remotely valid. For that matter, neither (incorrect) p value in the table even matches its corresponding (incorrect) Χ² statistic; I will let you check this for yourself as an exercise.
The p values
There are a couple of basic things to keep in mind when reading tables of statistics:
- A t statistic of 1.96 with 100 or more degrees of freedom gives a rounded two-tailed p value of 0.05 (although if you want it to be strictly less than 0.05000, you need a t statistic of 1.984 with 100 dfs).
- For any given number of degrees of freedom, a larger t or Χ² statistic gives a smaller p value.
With that in mind, let's look at this table (from
10.1016/j.fertnstert.2007.05.010), which I believe to be entirely typical of the articles under discussion here:
The good news is that the percentages and the
Χ² statistics check out OK!
Nether of the above bits of very basic knowledge is respected here. First we have t(228) = 2.66, p = 0.1 (when the correct p value for that t is 0.008—although in any case the correct t statistic for the given means and SDs would be between 5.05 and 5.83). Second, between FSH (follicle-stimulating hormone) and LH (luteinizing hormone), the t statistic goes up and so does the p value (which is also clearly incorrect in both cases).
A number of the articles contain
p values that are literally impossible (i.e., greater than 1.0; don't @ me to tell me about that time you did a Bonferroni correction by multiplying the
p value instead of dividing the alpha). See
10.3109/01443615.2010.497873 (Table 1, "Parity", p = 1.13; see also "Other inconsistencies", below),
10.1016/j.fertnstert.2008.04.065 (Table 1, "Height",
p = 1.01), and
10.1007/s00404-013-2866-0, which contains no less than four examples across its Tables 1 and 2:
1.12, 1.22, 1.32, 1.42: The impossible p values form a nice pattern.
The confidence intervals
A few of the articles have confidence intervals in the tables, perhaps added at the insistence of a reviewer or editor. But in most cases the point estimate falls outside the confidence interval. Sometimes this can become quite absurd, as in the following example (from 10.1016/j.fertnstert.2007.08.034). Those CI limits are ± 1.96 standard errors either side of... what exactly?
(Once you look beyond the CIs, there is a "bonus" waiting in the p value column here.)
Other inconsistencies
Within these 46 articles it is hard not to notice a considerable number of other inconsistencies, which make the reader wonder how much care and attention went into both the writing and review processes. These tables from 10.3109/01443615.2010.497873 provide a particularly egregious example, with the appearance of 41 and 42 extra patients in the respective groups between baseline and outcome. (As a bonus, we also have a p value of 1.13.)
(Some white space has been removed from this table.)
These results from
10.1016/j.ejogrb.2012.09.014 make no sense. The first comparison was apparently done using Fisher's exact test, the second with Pearson's
Χ² test, and the third, well, your guess is as good as mine. But there is no reason to use the two different types of test here, and even less reason to use Fisher's test for the larger case numbers and Pearson's for the smaller ones. (The p values are all incorrect, and would be even if the other test were to have been used for every variable.)
(Some white space and some results that are not relevant to the point being illustrated have been removed from this table.)
Finally, fans of
GRIM might also be interested to learn that two articles show signs of possible inconsistencies in their reported means:
Conclusion
The results of these analyses seem to indicate that something has gone very badly wrong in the writing, reviewing, and publication of these articles. Even though I tried to give the published numbers the benefit of the doubt as far as possible, I estimate that across these 46 articles, 346 (64%) of the 542 parametric tests (unpaired t tests, or, occasionally, ANOVA) and 151 (61%) of the 247 contingency table test (Pearson's Χ² or Fisher's exact test) that I was able to check were incorrectly reported. I don't think that anybody should be relying on the conclusions of these articles as a guide to practice, and I suspect that the only solution for most of them will be retraction. (As already mentioned, five have already been retracted following the publication of Bordewijk et al.'s commentary.)
I have a few aims in writing this post.
First, I want to do whatever I can to help get these misleading (at best) papers retracted from the medical literature, where they would seem to have considerable potential to do serious harm to the health of women, especially those who are pregnant or trying to overcome infertility.
Second, I aim to show some of the techniques that can be used to detect obvious errors in published articles (or in manuscripts that you might be reviewing).
Third, and the important reason for doing all this work (it took a lot of hours to do these analyses, as you will see if you download the Excel file!), is to draw attention to the utter failure of peer review that was required in order for most of these articles to get published. They appeared in 13 different journals, none of which would appear to correspond to most people's idea of a "predatory" outlet. It is very tempting to imagine that nobody—editors, reviewers, Dr Badawy's thesis committee at the University of Utrecht, or readers of the journals (until Ben Mol and Esmée Bordewijk came along)—even so much glanced at the tables of results in these articles, given that they almost all contain multiple impossible numbers.
It is true that the majority of these articles are more than 10 years old, but I wonder how much has changed in the publication processes of medical journals since then. The reality of scientific peer review seems to be that, to a first approximation, nobody ever checks any of the numbers. I find that deeply worrying.
Supporting documents
The majority of the analyses underlying this post have been done with Microsoft Excel 2003. I have some R code that can do the same thing, but it seemed to me to make more sense to use Excel as the process of copying and pasting numbers from the tables in the articles was a lot more reliable, requiring only a text editor to replace the column separators with tab characters. I used my R code to compute the test statistics in a couple of cases where there were more than two groups and so I had to use the rpsychi package to calculate the results of a one-way ANOVA.
In my Excel file, each unpaired t test is performed on a separate line. The user enters the mean, standard deviation, and sample size for each of the two conditions, plus an indication of the rounding precision (i.e, the number of decimal places) for the means and SDs separately. The spreadsheet then calculates (using formulas that you can find in columns whose width I have in most cases reduced to zero) the minimum and maximum possible (i.e., before rounding) means and SDs, and from that it determines the minimum, notional (i.e., assuming that the rounded input values are exact), and maximum t statistics and the corresponding p values. It then highlights those t statistics (if available) and p values (or "significant/not-significant" claims) from the article that are not compatible with any point in the possible ranges of values. That is, at all times, I give the maximum benefit of the doubt to the authors. (Similar considerations apply, mutatis mutandis, to the table of chi-square tests in the same Excel file.)
The documents (an Excel file for the main analyses and some R code for the bits that I couldn't work out how to do easily in Excel) are available here. The article PDFs are all copyrighted and I cannot share them, but if you do not have institutional access then there is always the site whose name rhymes with Dry Club.
Appendix: List of examined articles
Articles that have not been retracted and have no expression of concern
Badawy et al. (2009). Gonadotropin-releasing hormone agonists for prevention of chemotherapy-induced ovarian damage: Prospective randomized study. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.12.044
Badawy et al. (2007). Induction of ovulation in idiopathic premature ovarian failure: A randomized double-blind trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60711-0
Badawy et al. (2010). Clomiphene citrate or aromatase inhibitors combined with gonadotropins for superovulation in women undergoing intrauterine insemination: A prospective randomised trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.497873
Badawy et al. (2009). Ultrasound-guided transvaginal ovarian needle drilling (UTND) for treatment of polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.044
Badawy et al. (2008). Low-molecular weight heparin in patients with recurrent early miscarriages of unknown aetiology. Journal of Obstetrics and Gynaecology. https://doi.org/10.1080/01443610802042688
Badawy et al. (2010). Laparoscopy--or not--for management of unexplained infertility. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2010.508850
Badawy et al. (2007). Plasma homocysteine and polycystic ovary syndrome: The missed link. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2006.10.015
Badawy et al. (2008). Extending clomiphene treatment in clomiphene-resistant women with PCOS: A randomized controlled trial. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)60148-4
Badawy et al. (2006). Clomiphene citrate plus N-acetyl cysteine versus clomiphene citrate for augmenting ovulation in the management of unexplained infertility: A randomized double-blind controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2006.02.097
Badawy et al. (2007). Randomized controlled trial of three doses of letrozole for ovulation induction in patients with unexplained infertility. Reproductive Biomedicine Online. https://doi.org/10.1016/s1472-6483(10)61046-2
Fawzy et al. (2007). Treatment options and pregnancy outcome in women with idiopathic recurrent miscarriage: A randomized placebo-controlled study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-007-0527-x
Gibreel et al. (2012). Endometrial scratching to improve pregnancy rate in couples with unexplained subfertility: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2012.02016.x
Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus highly purified FSH for ovulation induction in clomiphene-resistant PCOS women: A randomised controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2010.488771
Abu Hashim et al. (2010). Letrozole versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-010-1566-2
Abu Hashim et al. (2011). Laparoscopic ovarian diathermy after clomiphene failure in polycystic ovary syndrome: is it worthwhile? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1983-x
Abu Hashim et al. (2012). Contraceptive vaginal ring treatment of heavy menstrual bleeding: A randomized controlled trial with norethisterone. Contraception. https://doi.org/10.1016/j.contraception.2011.07.012
Abu Hashim et al. (2010). N-acetyl cysteine plus clomiphene citrate versus metformin and clomiphene citrate in treatment of clomiphene-resistant polycystic ovary syndrome: A randomized controlled trial. Journal of Women's Health. https://doi.org/10.1089/jwh.2009.1920
Abu Hashim et al. (2010). Combined metformin and clomiphene citrate versus laparoscopic ovarian diathermy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/j.1447-0756.2010.01383.x
Abu Hashim et al. (2011). Minimal stimulation or clomiphene citrate as first-line therapy in women with polycystic ovary syndrome: A randomized controlled trial. Gynecological Endocrinology. https://doi.org/10.3109/09513590.2011.589924
Abu Hashim et al. (2011). Does laparoscopic ovarian diathermy change clomiphene-resistant PCOS into clomiphene-sensitive? Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-011-1931-9
Abu Hashim et al. (2013). LNG-IUS treatment of non-atypical endometrial hyperplasia in perimenopausal women: A randomized controlled trial. Journal of Gynecologic Oncology. https://doi.org/10.3802/jgo.2013.24.2.128
Marzouk et al. (2014). Lavender-thymol as a new topical aromatherapy preparation for episiotomy: A randomised clinical trial. Journal of Obstetrics and Gynaecology. https://doi.org/10.3109/01443615.2014.970522
Ragab et al. (2013). Does immediate postpartum curettage of the endometrium accelerate recovery from preeclampsia-eclampsia? A randomized controlled trial. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-013-2866-0
El Refaeey et al. (2014). Combined coenzyme Q10 and clomiphene citrate for ovulation induction in clomiphene-citrate-resistant polycystic ovary syndrome. Reproductive Biomedicine Online. https://doi.org/10.1016/j.rbmo.2014.03.011
Seleem et al. (2014). Superoxide dismutase in polycystic ovary syndrome patients undergoing intracytoplasmic sperm injection. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s10815-014-0190-7
Shokeir (2006). Tamoxifen citrate for women with unexplained infertility. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-006-0181-8 *
Shokeir et al. (2016). Hysteroscopic-guided local endometrial injury does not improve natural cycle pregnancy rate in women with unexplained infertility: Randomized controlled trial. Journal of Obstetrics and Gynaecology Research. https://doi.org/10.1111/jog.13077
Shokeir et al. (2009). The efficacy of Implanon for the treatment of chronic pelvic pain associated with pelvic congestion: 1-year randomized controlled pilot study. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-009-0951-1 *
Shokeir & Mousa (2015). A randomized, placebo-controlled, double-blind study of hysteroscopic-guided pertubal diluted bupivacaine infusion for endometriosis-associated chronic pelvic pain. International Journal of Gynecology & Obstetrics. https://doi.org/10.1016/j.ijgo.2015.03.043
Articles that are subject to an editorial Expression of Concern
Badawy et al. (2012). Aromatase inhibitors or gonadotropin-releasing hormone agonists for the management of uterine adenomyosis: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2012.01350.x
Badawy et al. (2009). Extended letrozole therapy for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.04.065
Badawy et al. (2008). Luteal phase clomiphene citrate for ovulation induction in women with polycystic ovary syndrome: A novel protocol. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.01.016
Badawy et al. (2007). N-Acetyl cysteine and clomiphene citrate for induction of ovulation in polycystic ovary syndrome: A cross-over trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340601090337
Badawy et al. (2009). Clomiphene citrate or anastrozole for ovulation induction in women with polycystic ovary syndrome? A prospective controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.08.034
Badawy et al. (2009). Pregnancy outcome after ovulation induction with aromatase inhibitors or clomiphene citrate in unexplained infertility. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1080/00016340802638199
Abu Hashim et al. (2011). Intrauterine insemination versus timed intercourse with clomiphene citrate in polycystic ovary syndrome: A randomized controlled trial. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2010.01063.x
Abu Hashim et al. (2012). Randomized comparison of superovulation with letrozole versus clomiphene citrate in an IUI program for women with recently surgically treated minimal to mild endometriosis. Acta Obstetricia et Gynecologica Scandinavica. https://doi.org/10.1111/j.1600-0412.2011.01346.x
Shokeir et al. (2011). An RCT: use of oxytocin drip during hysteroscopic endometrial resection and its effect on operative blood loss and glycine deficit. Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2011.03.015
Shokeir et al. (2013). Does adjuvant long-acting gestagen therapy improve the outcome of hysteroscopic endometrial resection in women of low-resource settings with heavy menstrual bleeding? Journal of Minimally Invasive Gynecology. https://doi.org/10.1016/j.jmig.2012.11.006
Badawy & Gibreal (2011). Clomiphene citrate versus tamoxifen for ovulation induction in women with PCOS: A prospective randomized trial. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2011.07.015
Shokeir et al. (2013). Reducing blood loss at abdominal myomectomy with preoperative use of dinoprostone intravaginal suppository: A randomized placebo-controlled pilot study. European Journal of Obstetrics & Gynecology and Reproductive Biology. https://doi.org/10.1016/j.ejogrb.2012.09.014
Articles that have been retracted
Badawy et al. (2009). Clomiphene citrate or aromatase inhibitors for superovulation in women with unexplained infertility undergoing intrauterine insemination: A prospective randomized trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2008.06.013
Badawy et al. (2009). Clomiphene citrate or letrozole for ovulation induction in women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.02.062
Badawy et al. (2008). Anastrozole or letrozole for ovulation induction in clomiphene-resistant women with polycystic ovarian syndrome: A prospective randomized trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2007.05.010
Abu Hashim et al. (2010). Letrozole versus combined metformin and clomiphene citrate for ovulation induction in clomiphene-resistant women with polycystic ovary syndrome: A randomized controlled trial. Fertility and Sterility. https://doi.org/10.1016/j.fertnstert.2009.07.985
El-Refaie et al. (2015). Vaginal progesterone for prevention of preterm labor in asymptomatic twin pregnancies with sonographic short cervix: A randomized clinical trial of efficacy and safety. Archives of Gynecology and Obstetrics. https://doi.org/10.1007/s00404-015-3767-1
Note
The two articles marked with a * above are the only ones in which I did not identify any problems; in each of these articles all of the statistical tests are marked as either "S" (significant) or "NS" (not significant) and none of the calculations that I performed resulted in the opposite verdict for any test.