Large language models (LLMs) show great promise in the realm of glaucoma with additional capabilities of self-correction, a recent study found.1 However, use of the technology in glaucoma is still in its infancy, and further research and validation are needed, according to first author Darren Ngiap Hao Tan, MD, a researcher from the Department of Ophthalmology, National University Hospital, Singapore, Singapore.
He and his colleagues wanted to determine if LLMs were useful in medicine. “Most LLMs available for public use are based on a general model and are not trained nor fine-tuned specifically for the medical field, let alone a specialty such as ophthalmology,” they explained.
Tan and colleagues evaluated the responses of an artificial intelligence chatbot ChatGPT (version GPT-3.5, OpenAI),2 which is based on a LLM and was trained on a massive dataset of text (570 gigabytes worth of data with a model size of 175 billion parameters).3 While previous studies4-8 showed that ChatGPT was a tool that could be leveraged in the healthcare industry, no studies have evaluated its performance in answering queries pertaining to the glaucoma.
The investigators recounted that they curated 24 clinically relevant questions on 4 categories in glaucoma; diagnosis, treatment, surgeries, and ocular emergencies. An expert grader panel of 3 glaucoma specialists with combined experience of more than 30 years in the field graded the responses of the LLM to each question. When the responses were poor, the LLM was prompted to self-correct, and the expert panel then re-evaluated the subsequent responses
The main outcome measures were the accuracy, comprehensiveness, and safety of the responses of ChatGPT. The scores were ranked from 1 to 4, where 4 represents the best score with a complete and accurate response.
The investigators reported a total of 72 responses to the 24 questions.
“The mean score of the expert panel was 3.29 with a standard deviation of 0.484. Of the 24 question-response pairs, 7 (29.2%) had a mean inter-grader score of 3 or less. The mean score of the original seven question-response pairs was 2.96, which rose to 3.58 after an opportunity to self-correct (z-score − 3.27, p = 0.001, Mann–Whitney U). The 7 of the 24 question-response pairs that performed poorly were given a chance to self-correct. After self-correction, the proportion of responses obtaining a full score increased from 22/72 (30.6%) to 12/21 (57.1%), (p = 0.026, χ2 test),” the study authors reported.
Yousef and colleagues concluded, “LLMs show great promise in the realm of glaucoma with additional capabilities of self-correction, with the caveat that the application of LLMs in glaucoma is still in its infancy and requires further research and validation.”