Ophthalmologist AI: Textual Question Answering Model

News Context

At a glance

This research‌ investigated the performance of several Foundational‌ Models (FMs) - ‍Claude 3.5 Sonnet, GPT-4o, Qwen2.5-Max,DeepSeek V3,and Gemini Advanced - ⁤on ophthalmology questions,comparing them to ⁢ophthalmology experts,trainees,and⁢ junior...
* Question Source: Questions were ⁤sourced from⁤ a textbook used for the Fellowship of⁣ the Royal College of Ophthalmologists part 2 exam (360 questions total, 13 multimodal, 345...
* Textual Questions: ⁢ * Claude 3.5 Sonnet performed best with an accuracy of 77.7%.

Summary of the Research on Foundational Models (FMs) in Ophthalmology

This research‌ investigated the performance of several Foundational‌ Models (FMs) – ‍Claude 3.5 Sonnet, GPT-4o, Qwen2.5-Max,DeepSeek V3,and Gemini Advanced – ⁤on ophthalmology questions,comparing them to ⁢ophthalmology experts,trainees,and⁢ junior physicians. Here’s a breakdown of the key findings:

Methodology:

* Question Source: Questions were ⁤sourced from⁤ a textbook used for the Fellowship of⁣ the Royal College of Ophthalmologists part 2 exam (360 questions total, 13 multimodal, 345 textual). ⁣an additional 27 multimodal questions were created, ‌resulting in 40 images used for testing.
* FM Testing: 7 FMs were tested without customization, fine-tuning, or additional guidance. Questions were inputted between September 2024 and⁤ March 2025.
* Human Evaluation: 10 physicians with varying experience in ophthalmology also evaluated the multimodal ⁢questions.

Key Results:

* Textual Questions:

⁢ * Claude 3.5 Sonnet performed best with an accuracy of 77.7%.
‍ * Other ⁣models: PT-4o (69.9%), Qwen2.5-Max (69.3%), DeepSeek V3 ‍(63.2%), Gemini Advanced (62.6%).
‍ * Claude 3.5 sonnet ⁤performed comparably to ophthalmology⁤ experts (difference of ‌1.3%).
* Trainees and unspecialized junior physicians performed significantly worse than⁣ Claude 3.5 Sonnet.
* Claude 3.5 ⁤Sonnet also outperformed the mean‍ candidate score and the official pass mark.
* Multimodal Questions:

‌ ⁤ * GPT-4o ‍ had the highest accuracy (57.5%), followed by Claude 3.5 sonnet (47.5%).
⁢ *‌ Ophthalmology experts scored 75.7%, FMs averaged 42%, ‍and ⁢trainees ‌scored 71.3%.
* GPT-4o and Claude 3.5 Sonnet ⁤showed the‍ highest agreement with ⁢physicians.

Limitations:

* The study acknowledges ⁤an unclear correlation between‍ aptitude and exam⁣ performance.

In essence, the study demonstrates that FMs, especially Claude 3.5 Sonnet and GPT-4o, show promising potential in answering ophthalmology questions, even rivaling the performance of experts in textual questions. Tho, they still lag ‍behind experts in multimodal ⁣question answering.

Ophthalmologist AI: Textual Question Answering Model

Summary of the Research on Foundational Models (FMs) in Ophthalmology

Share this:

Related