News|Articles|June 13, 2026

Reasoning prompts sharpen multimodal AI on bilingual ophthalmology exam questions

Listen

0:00 / 0:00

Key Takeaways

Benchmark testing used 316 bilingual items spanning cornea, uvea, glaucoma, retina, and orbit, combining 175 English single-choice BCSC questions and 141 Chinese multiple-choice senior-title questions.
Reasoning-enabled prompting increased scores for all models across languages, including ChatGPT-5 gains from 20.77 to 23.97 (English) and 19.95 to 22.00 (Chinese).
Human evaluation showed ChatGPT-5 provided the strongest reasoning quality, with substantial inter-rater agreement (κ = 0.87) and qualitatively more clinically traceable explanations in selected cases.
Safe educational or clinical use still requires rigorous auditing of multimodal reasoning because performance benefits varied by model and dataset, and image-interpretation robustness was inconsistent.

Step-by-step prompting boosts multimodal AI accuracy on bilingual ophthalmology vignettes with images—but weak subspecialty and image-reading skills remain.

Asking multimodal large language models (LLMs) to reason step by step before answering improved both their accuracy and the clinical interpretability of their responses on complex ophthalmology questions, according to a benchmark study published online June 9 in the British Journal of Ophthalmology.¹

While LLMs have repeatedly posted strong marks on text-based medical examinations, their ability to integrate images with clinical text—a routine demand in ophthalmic practice—has been far less thoroughly tested. Investigators led by corresponding author Kai Jin, MD, of Zhejiang University, set out to measure how well current vision-language systems handle that multimodal challenge and whether prompting them to show their reasoning makes a measurable difference.

Study design

The team evaluated 3 multimodal LLMs—CLM-V, ChatGPT-5, and MiniCPM-V 4.5—on 316 bilingual ophthalmology questions, each of which paired a clinical vignette with an accompanying image. The question set drew from 2 sources: 175 English single-choice items from the Basic and Clinical Science Course (BCSC) and 141 Chinese multiple-choice questions from the senior professional title examination. Content spanned the cornea, uvea, glaucoma, retina, and orbit.

Each model was run under 2 conditions—reasoning-enabled and reasoning-disabled prompts—so the investigators could isolate the effect of prompting strategy. Answers were scored against reference standards for accuracy, and the quality of each model's reasoning was assessed using an automated rubric covering accuracy, data synthesis, logic, option analysis, and safety, supplemented by expert review. Four cases were examined in depth qualitatively.

Key findings

Across both languages, reasoning-enabled prompting lifted mean AI-assisted total scores for every model tested. In the English dataset, scores rose from 14.97 to 16.07 for CLM-V, from 20.77 to 23.97 for ChatGPT-5, and from 10.83 to 12.60 for MiniCPM-V 4.5. The Chinese dataset showed the same direction of effect, with scores climbing from 9.03 to 10.27, 19.95 to 22.00, and 11.05 to 13.30, respectively.

ChatGPT-5 ranked highest on human evaluation, which the investigators reported showed substantial inter-rater agreement (κ = 0.87). The qualitative case analyses suggested that reasoning-enabled outputs were often easier to follow clinically, though the authors cautioned that the size of the benefit varied by model and by dataset rather than holding uniformly across the board.

Implications

The investigators concluded that multimodal LLMs show genuine potential for ophthalmic question-answering and that prompting them to reason was associated with better interpretability and, in most settings, numerically higher performance. They tempered that takeaway with a clear caveat: gaps in subspecialty robustness and image interpretation mean that rigorous evaluation of model reasoning remains essential before these tools can be deployed safely in educational or clinical settings.

The data used in the study were derived from examination question banks and are not publicly available because of copyright and access restrictions, the authors noted; further detail is available from the corresponding author on reasonable request.

Reference

Yin H, Zhao K, Shi D, Grzybowski A, Jin K. Evaluating reasoning in multimodal large language models for ophthalmology: a bilingual benchmark study using clinical vignettes and imaging. Br J Ophthalmol. Published online June 9, 2026. doi:10.1136/bjo-2025-328992

Don’t miss out—get Ophthalmology Times updates on the latest clinical advancements and expert interviews, straight to your inbox.

Reasoning prompts sharpen multimodal AI on bilingual ophthalmology exam questions

Key Takeaways

Study design

Key findings

Implications

Reference

Yin H, Zhao K, Shi D, Grzybowski A, Jin K. Evaluating reasoning in multimodal large language models for ophthalmology: a bilingual benchmark study using clinical vignettes and imaging. Br J Ophthalmol. Published online June 9, 2026. doi:10.1136/bjo-2025-328992

Related Content

Why aren’t there enough ophthalmologists?

Root cause–based dry eye classification and the shift toward precision medicine

Refractive cataract surgery: aiming for '20/happy' patients

Bausch + Lomb survey ties dry eye symptom relief to quality of life gains

Eyes on June 2026: Approvals, pipeline momentum, and AI under the microscope

Latest CME

PER Global Perspectives: The TROP2-Targeted ADC Landscape in NSCLC and How to Interpret the Evidence

PER Global Perspectives: Differentiating and Managing Toxicities with TROP2-Targeted ADCs in NSCLC Through Multidisciplinary Pathways

(CME Track) Community Collaborative Connections™: Optimizing the Collaborative Care of Neovascular Retinal Disease in a New Age of Treatment

(CME Track) Rapid Reviews in Retina™: Emerging Updates from Summer 2025—Addressing the Wealth of New Data in Treatments for Neovascular Retinal Disease

(CME Track) A Forward Look at Anti-VEGF Therapies: A Paradigm Shift in Neovascular Retinal Disease Management

(CME Track) The Evolution of MacTel Management: Integrating Neuroprotective Therapies Into Clinical Practice

(CME Track) Collaborating Across the Continuum™: Best Practices in Patient-Centric Team Management of XLRP

Navigating Advances in Neovascular Retinal Disease: Translating Evidence to Practice in AMD, DME, and RVO

(COPE Track) Beyond the Collarette: Empowering Patients in the Management of Demodex Blepharitis

(CME Track) Beyond the Collarette: Empowering Patients in the Management of Demodex Blepharitis

Navigating Ocular Toxicities: A Multidisciplinary Roadmap for Managing Adverse Events in Targeted Cancer Therapy

Rapid Reviews in Retina™: Emerging Updates from Fall 2025—Addressing the Wealth of New Data in Treatments for Neovascular Retinal Disease

Bridging Regional Challenges in Retinal Disease Management: Applying Advanced Anti-VEGF Therapy in Community Practice - California

Bridging Regional Challenges in Retinal Disease Management: Applying Advanced Anti-VEGF Therapy in Community Practice - NYC Metro

Bridging Regional Challenges in Retinal Disease Management: Applying Advanced Anti-VEGF Therapy in Community Practice - Florida

The Presbyopia-Correcting and Toric IOL Playbook: Game-Changing Surgical Strategies to Enhance Patient Outcomes

When Mites Meet Their Match: Empowering Patients With Targeted Treatment for 𝘋𝘦𝘮𝘰𝘥𝘦𝘹 Blepharitis

Trending on Eye Care Network - Ophthalmology Times

Emerging dry eye therapies and co-managing MGD: Ping Moore, OD