Home » Tech » TSPO: New Framework Boosts LLM Reasoning with Turn-Level Rewards (+24% Performance Gain)

TSPO: New Framework Boosts LLM Reasoning with Turn-Level Rewards (+24% Performance Gain)

by Lisa Park - Tech Editor

Researchers at Tiansuan Lab, Ant Group Co., Ltd, are addressing a key challenge in training Large Language Models (LLMs) for complex reasoning tasks. They’ve identified a “Double Homogenization Dilemma” in current reinforcement learning (RL) frameworks that hinders the ability of LLMs to effectively learn from multi-turn interactions, particularly those involving search-augmented reasoning.

The core issue, as detailed in a paper submitted on , is that existing RL methods often fail to adequately recognize the value of individual reasoning steps. This leads to both “process homogenization,” where different reasoning paths receive the same reward, and “intra-group homogenization,” where coarse-grained rewards limit accurate advantage estimation during training. Essentially, the system struggles to differentiate between good and bad reasoning, hindering learning.

To overcome this, the team has developed a new framework called Turn-level Stage-aware Policy Optimization (TSPO). TSPO introduces a novel mechanism called the First-Occurrence Latent Reward (FOLR). This mechanism dynamically allocates partial rewards to the specific turn in a multi-turn interaction where the ground-truth answer first appears within retrieved evidence. This approach preserves crucial process-level signals, effectively differentiating between successful and unsuccessful reasoning steps, and increases reward variance within groups without relying on external reward models or manual annotations.

The researchers demonstrated the effectiveness of TSPO through extensive experiments using the Qwen2.5-3B, and Qwen2.5-7B models. Results showed significant performance gains compared to state-of-the-art baseline methods. Specifically, TSPO achieved an average performance improvement of 24% on the Qwen2.5-3B model and 13.6% on the Qwen2.5-7B model across a range of question answering datasets.

This improvement stems from addressing the limitations of sparse, outcome-level rewards. Traditional methods often compress the entire reasoning process into a single scalar value, obscuring the quality of intermediate steps. While previous attempts to address this have involved process-level supervision, these often require costly annotations or rely on proprietary models with limited generalizability. TSPO circumvents these drawbacks by focusing on the initial detection of the correct information.

The research highlights the importance of granular, turn-level rewards in effectively training LLMs for complex reasoning tasks. The team analyzed trajectories based on Outcome Accuracy and Process Integrity, identifying four categories: complete failure, near-miss, full success, and retrieval-free success. Their data confirmed that successful retrieval is essential for correct synthesis and that near-miss attempts and total failures were often receiving the same zero reward, demonstrating process-level reward homogenization.

By allocating rewards based on the first occurrence of the correct answer, FOLR enhances both process signal preservation and variance within training groups, avoiding vanishing advantages during training. Importantly, this breakthrough delivers these improvements without requiring external reward models or additional human annotations.

Experiments were conducted across seven diverse question answering datasets to rigorously evaluate TSPO’s performance. Performance was measured using exact match (EM) as the primary metric, allowing for a direct assessment of answer accuracy. The results consistently showed TSPO surpassing existing baseline methods.

The authors acknowledge some limitations. The FOLR mechanism relies on accurate retrieval, assuming the correct answer is present in the retrieved data. Currently, the research focuses on search-augmented reasoning tasks. Future work will concentrate on adapting TSPO to a wider range of task types and refining the FOLR mechanism for scenarios where retrieval is imperfect or unnecessary.

This research builds on earlier work exploring turn-level reward allocation for improved reasoning in LLMs. A related study, submitted to the ICLR 2026 conference on (and modified on ), reformulates multi-turn reasoning tasks as Markov Decision Processes (MDPs) with explicit turn-level rewards, providing theoretical analysis to support this design. This study also extended popular RL algorithms, GRPO and PPO, to their multi-turn variants, enabling fine-grained credit assignment.

The findings underscore the potential for more effective training of LLMs for complex tasks, potentially leading to advancements in areas such as open-domain question answering and mathematical reasoning. The development of TSPO represents a significant step towards more nuanced and efficient LLM training, addressing a key challenge in reinforcement learning for search-reasoning systems.

You may also like

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.