AI Agents: New Framework Cuts Tool Use & Boosts Reasoning Accuracy with HDPO
- Alibaba researchers have developed a new reinforcement learning framework, Hierarchical Decoupled Policy Optimization (HDPO), designed to address a key challenge in AI agent development: the tendency for large...
- The researchers found that current AI agents suffer from a “profound metacognitive deficit,” struggling to determine when to rely on their internal knowledge versus querying external utilities.
- This excessive tool use creates operational hurdles for real-world applications.
Alibaba researchers have developed a new reinforcement learning framework, Hierarchical Decoupled Policy Optimization (HDPO), designed to address a key challenge in AI agent development: the tendency for large language models to unnecessarily invoke external tools. This “trigger-happy” behavior leads to latency, increased costs, and degraded reasoning, according to a paper released on April 26, 2026. The framework, and the multimodal model Metis trained with it, significantly reduces redundant tool calls while simultaneously improving reasoning accuracy.
The researchers found that current AI agents suffer from a “profound metacognitive deficit,” struggling to determine when to rely on their internal knowledge versus querying external utilities. They often blindly invoke tools like web search or code execution even when the necessary information is already available.
The Metacognitive Deficit and HDPO’s Solution
This excessive tool use creates operational hurdles for real-world applications. Every unnecessary API call introduces processing delays and increases expenses. Redundant tool interactions introduce “noise” into the model’s context, potentially derailing the reasoning process and negatively impacting the final output.

Previous attempts to address this issue through reinforcement learning, by penalizing tool usage, proved problematic. Aggressive penalties suppressed essential tool use, while mild penalties were ineffective. The researchers explain that this created an “optimization dilemma” and “semantic ambiguity,” making it difficult for the model to learn effective tool-use strategies.
HDPO solves this dilemma by separating accuracy and efficiency into two independent optimization channels. The “accuracy channel” focuses solely on maximizing task correctness, while the “efficiency channel” optimizes for execution economy. The efficiency signal is conditional upon accuracy, meaning an incorrect response is never rewarded for being fast or using fewer tools. This decoupling allows the model to learn both goals without compromising either.
This design also creates an implicit “cognitive curriculum.” Early in training, the model prioritizes learning correct reasoning. As its reasoning capabilities improve, the efficiency signal increases, encouraging the model to refine its self-reliance and avoid redundant API calls.
Metis: A Demonstration of HDPO in Action
To test HDPO, the researchers developed Metis, a multimodal reasoning agent built on the Qwen3-VL-8B-Instruct vision-language model. Metis was trained in two stages: supervised fine-tuning using a curated dataset, followed by reinforcement learning using the HDPO framework.
In evaluations against other models – including LLaVA-OneVision, text-only reasoners, DeepEyes V2, and the 30-billion-parameter Skywork-R1V4 – Metis achieved state-of-the-art or highly competitive performance on visual perception and reasoning tasks. The researchers tested Metis on datasets like HRBench, V*Bench, WeMath, and MathVista.

The researchers highlighted examples of Metis’s improved decision-making. When presented with a clear image of a sign, Metis recognized the text was legible and skipped unnecessary image processing steps. In another instance, when faced with a complex chart requiring detailed visual analysis, Metis invoked Python to zoom in on a specific region, enabling accurate identification of data points.
“Our results demonstrate that strategic tool use and strong reasoning performance are not a trade-off. rather, eliminating noisy, redundant tool calls directly contributes to superior accuracy,”
the researchers conclude.
The researchers released Metis, along with the code for HDPO, under the Apache 2.0 license, making it available for wider use and further development. This release could accelerate the development of more efficient and cost-effective AI agents capable of more nuanced decision-making regarding tool utilization.
