Clinicians should be alert to the fact that while artificial intelligence (AI) is capable of generating ideas and references, it is crucial to thoroughly vet and fact-check any medical research content that AI produces.
Hong-Uyen Hua, MD, a recently graduated surgical retina fellow and first study author, reported that clinicians should be alert to the fact that while artificial intelligence (AI) is capable of generating ideas and references, they need to go a step further and thoroughly vet and fact-check any medical research content that AI produces.1 Hua, senior author Danny Mammo, MD, and colleagues are from the Cole Eye Institute, Cleveland Clinic Foundation, Cleveland.
Hua and colleagues pointed out the rapid growth in the popularity of AI chatbots and the potential for significant implications for patient education and academia. They also noted that the disadvantages of using these chatbots for generating abstracts and references have not been investigated thoroughly.
To remedy this, the research team conducted a cross-sectional comparative study to do just that, ie, evaluate and compare the quality of ophthalmic scientific abstracts and references generated by earlier and updated versions of a popular AI chatbot.
The study used 2 versions of an AI chatbot to generate scientific abstracts and 10 references for clinical research questions across 7 ophthalmology subspecialties. Two of the authors graded the abstracts using modified DISCERN criteria and performance evaluation scores, and 2 AI output detectors also evaluated the abstracts. A so-called hallucination rate for references generated by the earlier and updated versions of the chatbot but which could not be verified was calculated and compared.
The investigators found that the “mean modified AI-DISCERN scores for the chatbot-generated abstracts were 35.9 and 38.1 out of a maximal score of 50 for the earlier and updated versions, respectively (P = 0.30). Based on the 2 AI output detectors, the mean fake scores, with a score of 100% meaning generated by AI, for the earlier and updated chatbot-generated abstracts were 65.4% and 10.8%, respectively (P = 0.01) for 1 detector and 69.5% and 42.7% (P = 0.17) for the second detector. The mean hallucination rates for nonverifiable references generated by the earlier and updated versions were 33% and 29% (P = 0.74).”
The results mean that the quality between the abstracts generated by the versions of the chatbot was comparable. The mean hallucination rate of the citations was about 30% and was comparable between the versions.
Considering that the version of the chatbot produced abstracts of average quality and hallucinated citations that seemed to be realistic, Hua and colleagues warned clinicians to be aware of the potential for factual errors or hallucinations. Any medical content produced by AI should be carefully vetted and fact-checked before it is used for health education or academic purposes.
Hua commented, “The idea for this study initially came while I was exploring generative AI chatbots and their possible applications in ophthalmology. I quickly realized that the chatbot was making up references—a term called ‘hallucinations’ in generative AI. On top of that, the chatbot was unable to distinguish nuances in the scientific literature (e.g. oral vs intravenous dosing of steroids in optic neuritis). Current AI detectors perform poorly in detecting AI-generated text, especially with the newer version of AI chatbots. The scientific community at large must be wary of the implications of using generative AI for research purposes.”