EyeCon 2026 Banner
News|Articles|June 13, 2026

Reasoning prompts sharpen multimodal AI on bilingual ophthalmology exam questions

Listen
0:00 / 0:00

Key Takeaways

  • Benchmark testing used 316 bilingual items spanning cornea, uvea, glaucoma, retina, and orbit, combining 175 English single-choice BCSC questions and 141 Chinese multiple-choice senior-title questions.
  • Reasoning-enabled prompting increased scores for all models across languages, including ChatGPT-5 gains from 20.77 to 23.97 (English) and 19.95 to 22.00 (Chinese).
SHOW MORE

Step-by-step prompting boosts multimodal AI accuracy on bilingual ophthalmology vignettes with images—but weak subspecialty and image-reading skills remain.

Asking multimodal large language models (LLMs) to reason step by step before answering improved both their accuracy and the clinical interpretability of their responses on complex ophthalmology questions, according to a benchmark study published online June 9 in the British Journal of Ophthalmology.1

While LLMs have repeatedly posted strong marks on text-based medical examinations, their ability to integrate images with clinical text—a routine demand in ophthalmic practice—has been far less thoroughly tested. Investigators led by corresponding author Kai Jin, MD, of Zhejiang University, set out to measure how well current vision-language systems handle that multimodal challenge and whether prompting them to show their reasoning makes a measurable difference.

Study design

The team evaluated 3 multimodal LLMs—CLM-V, ChatGPT-5, and MiniCPM-V 4.5—on 316 bilingual ophthalmology questions, each of which paired a clinical vignette with an accompanying image. The question set drew from 2 sources: 175 English single-choice items from the Basic and Clinical Science Course (BCSC) and 141 Chinese multiple-choice questions from the senior professional title examination. Content spanned the cornea, uvea, glaucoma, retina, and orbit.

Each model was run under 2 conditions—reasoning-enabled and reasoning-disabled prompts—so the investigators could isolate the effect of prompting strategy. Answers were scored against reference standards for accuracy, and the quality of each model's reasoning was assessed using an automated rubric covering accuracy, data synthesis, logic, option analysis, and safety, supplemented by expert review. Four cases were examined in depth qualitatively.

Key findings

Across both languages, reasoning-enabled prompting lifted mean AI-assisted total scores for every model tested. In the English dataset, scores rose from 14.97 to 16.07 for CLM-V, from 20.77 to 23.97 for ChatGPT-5, and from 10.83 to 12.60 for MiniCPM-V 4.5. The Chinese dataset showed the same direction of effect, with scores climbing from 9.03 to 10.27, 19.95 to 22.00, and 11.05 to 13.30, respectively.

ChatGPT-5 ranked highest on human evaluation, which the investigators reported showed substantial inter-rater agreement (κ = 0.87). The qualitative case analyses suggested that reasoning-enabled outputs were often easier to follow clinically, though the authors cautioned that the size of the benefit varied by model and by dataset rather than holding uniformly across the board.

Implications

The investigators concluded that multimodal LLMs show genuine potential for ophthalmic question-answering and that prompting them to reason was associated with better interpretability and, in most settings, numerically higher performance. They tempered that takeaway with a clear caveat: gaps in subspecialty robustness and image interpretation mean that rigorous evaluation of model reasoning remains essential before these tools can be deployed safely in educational or clinical settings.

The data used in the study were derived from examination question banks and are not publicly available because of copyright and access restrictions, the authors noted; further detail is available from the corresponding author on reasonable request.

Reference
  1. Yin H, Zhao K, Shi D, Grzybowski A, Jin K. Evaluating reasoning in multimodal large language models for ophthalmology: a bilingual benchmark study using clinical vignettes and imaging. Br J Ophthalmol. Published online June 9, 2026. doi:10.1136/bjo-2025-328992

Latest CME