DOI: 10.1093/humrep/dead093.103 ISSN: 0268-1161

O-089 Using ChatGPT to answer patient questions about fertility: the quality of information generated by a deep learning language model

K Beilby, K Hammarberg
  • Obstetrics and Gynecology
  • Rehabilitation
  • Reproductive Medicine

Abstract

Study question

What is the quality of information provided by ChatGPT when using common patient questions as prompts?

Summary answer

Overall, the quality of the information generated by ChatGPT was high with little evidence of commercial bias.

What is known already

People seeking fertility-related information rely on internet sources when deciding on reproductive planning and assisted conception. The quality of information within the commercial landscape of infertility treatment is poor. ChatGPT, a variant of Generative Pre-trained Transformer v3 (GPT-3), is a language model that uses deep learning to generate human-like text. Given prompts, it generates answers by predicting the next word in the sequence based on patterns learned from training data. The training data for GPT-3 is not curated, but a snapshot of the Web, which includes all kinds of information, including biases that may exist within sources.

Study design, size, duration

Ten common patient questions were used as prompts. Three questions related to fertility awareness (impact of female/ male age on fertility and fertile window in the menstrual cycle), one to the chance of success with IVF, one to elective egg freezing, one to the benefits of add-ons, one to PCOS and pregnancy, one to choosing a fertility clinic, and one to how many IVF cycles should be attempted.

Participants/materials, setting, methods

Two experts independently scored the quality of the information generated by the ChatGPT using a scoring matrix with a range of 0 to 7 where higher scores indicate higher quality. Text was rated against humanistic answers based on how well it corresponded (0-3), evidence of commercial bias or controversial claims (no = 1, yes=0), use of accurate proportions/ statistics and whether it was stated that medical advice should be sought (yes=1, no = 0).

Main results and the role of chance

The scores returned by the two experts were closely aligned with only one point difference for one of the answers. This discrepancy was resolved through discussion. While none of the answers received the maximum score of 7, 6/10 scored 5 or more and 3 received a score of 3-4. Only one answer, the answer to the question about the benefits of add-ons, scored less than 3. This was also the only question where the response had evidence of commercial bias and one of only two that made claims that could be considered controversial.

Limitations, reasons for caution

The scoring method used in this study has not been validated and is exploratory in nature as this area of evaluation is emerging. However, the use of expert evaluation is common when assessing the performance of machine learning models and often used to fine-tune their parameters and improve their performance.

Wider implications of the findings

It is known that people seeking fertility-related information rely heavily on online sources such as clinic websites, consumer advocacy organisations, patient support groups and social media. Our findings suggests that ChatGPT may be a useful tool for patients seeking factual and unbiased information regarding fertility and fertility treatment.

Trial registration number

Not applicable

More from our Archive