Multimodal Reasoning: New Error Tracking Metric
- Computer scientists have developed elegant machine learning models capable of high performance across varied tasks.
- Models such as OpenAI's GPT4 with Vision (GPT-4V), DeepSeek-R1, and Google Gemini are widely used to create multimodal content, including images and tailored texts.
- Researchers are assessing the reasoning abilities of these models, especially how they handle visual inputs.
Uncover the critical findings of a new study that scrutinizes the reliability of multimodal reasoning models. This research introduces a new metric, RH-Bench, designed to track adn assess how these advanced models, including widely-used ones like GPT-4V and Gemini, generate inaccurate outputs—or hallucinations—during reasoning tasks. the study emphasizes that reasoning models often amplify these errors, a key insight for improving AI accuracy.Discover how researchers are tackling this critical issue and what it means for the future of AI. Read more on News Directory 3 for detailed insights into this groundbreaking research. Discover what’s next …
Benchmarking Hallucinations: New Metric Tracks Multimodal Reasoning models
Updated June 15, 2025

(a) Outputs from reasoning and non-reasoning models on a perception task, highlighting visual hallucination. Multimodal reasoning models amplify hallucinations. (b) Model performance on reasoning and perception tasks in the RH-Bench dataset.
Credit: Liu et al.
Computer scientists have developed elegant machine learning models capable of high performance across varied tasks. Multimodal large language models (MLLMs) can process and generate different data types, including texts, images, and videos.
Models such as OpenAI’s GPT4 with Vision (GPT-4V), DeepSeek-R1, and Google Gemini are widely used to create multimodal content, including images and tailored texts.
Researchers are assessing the reasoning abilities of these models, especially how they handle visual inputs. A study by Liu et al., available on arXiv, investigates how reasoning processes can amplify hallucinations in MLLMs. The research introduces a new metric and dataset, RH-Bench, to evaluate these models.
The study, “More Thinking, Less Seeing? Assessing Amplified Hallucination in Multimodal Reasoning Models,” highlights that while MLLMs excel in many areas, they can also generate outputs that contain inaccuracies or fabrications, known as hallucinations. the researchers found that reasoning models are more prone to amplifying these hallucinations compared to non-reasoning models.
The RH-Bench dataset includes tasks designed to test both reasoning and perception. The results indicate that models with strong reasoning capabilities frequently enough exhibit more hallucinations. baseline non-reasoning models typically show weaker reasoning but fewer hallucinations.
What’s next
The findings suggest that future research should focus on reducing hallucinations in multimodal reasoning models to improve their reliability and accuracy in real-world applications.
