Ophthalmologist AI: Textual Question Answering Model
- This research investigated the performance of several Foundational Models (FMs) - Claude 3.5 Sonnet, GPT-4o, Qwen2.5-Max,DeepSeek V3,and Gemini Advanced - on ophthalmology questions,comparing them to ophthalmology experts,trainees,and junior...
- * Question Source: Questions were sourced from a textbook used for the Fellowship of the Royal College of Ophthalmologists part 2 exam (360 questions total, 13 multimodal, 345...
- * Textual Questions: * Claude 3.5 Sonnet performed best with an accuracy of 77.7%.
Summary of the Research on Foundational Models (FMs) in Ophthalmology
This research investigated the performance of several Foundational Models (FMs) – Claude 3.5 Sonnet, GPT-4o, Qwen2.5-Max,DeepSeek V3,and Gemini Advanced – on ophthalmology questions,comparing them to ophthalmology experts,trainees,and junior physicians. Here’s a breakdown of the key findings:
Methodology:
* Question Source: Questions were sourced from a textbook used for the Fellowship of the Royal College of Ophthalmologists part 2 exam (360 questions total, 13 multimodal, 345 textual). an additional 27 multimodal questions were created, resulting in 40 images used for testing.
* FM Testing: 7 FMs were tested without customization, fine-tuning, or additional guidance. Questions were inputted between September 2024 and March 2025.
* Human Evaluation: 10 physicians with varying experience in ophthalmology also evaluated the multimodal questions.
Key Results:
* Textual Questions:
* Claude 3.5 Sonnet performed best with an accuracy of 77.7%.
* Other models: PT-4o (69.9%), Qwen2.5-Max (69.3%), DeepSeek V3 (63.2%), Gemini Advanced (62.6%).
* Claude 3.5 sonnet performed comparably to ophthalmology experts (difference of 1.3%).
* Trainees and unspecialized junior physicians performed significantly worse than Claude 3.5 Sonnet.
* Claude 3.5 Sonnet also outperformed the mean candidate score and the official pass mark.
* Multimodal Questions:
* GPT-4o had the highest accuracy (57.5%), followed by Claude 3.5 sonnet (47.5%).
* Ophthalmology experts scored 75.7%, FMs averaged 42%, and trainees scored 71.3%.
* GPT-4o and Claude 3.5 Sonnet showed the highest agreement with physicians.
Limitations:
* The study acknowledges an unclear correlation between aptitude and exam performance.
In essence, the study demonstrates that FMs, especially Claude 3.5 Sonnet and GPT-4o, show promising potential in answering ophthalmology questions, even rivaling the performance of experts in textual questions. Tho, they still lag behind experts in multimodal question answering.
