AI Reasoning: Apple Research Debunked
- As artificial general intelligence (A.G.I.) gains traction, Apple has injected a dose of reality into the discussion.
- Apple's researchers tested large language models (LLMs) like Anthropic’s Claude 3.7 Sonnet and DeepSeek-V3 using classic logic puzzles such as the Tower of Hanoi and River Crossing.
- The puzzles were categorized into three difficulty levels.
Apple’s AI Research Exposes Reasoning Gaps in top Models
Updated June 10, 2025
As artificial general intelligence (A.G.I.) gains traction, Apple has injected a dose of reality into the discussion. According to their research paper, “The Illusion of Thinking,” today’s most advanced AI models, despite being touted as having “human-level reasoning,” falter when faced with intricate logic problems. The study suggests these models primarily rely on pattern recognition, drawing from their training data to predict outcomes, rather than genuine reasoning.
Apple’s researchers tested large language models (LLMs) like Anthropic’s Claude 3.7 Sonnet and DeepSeek-V3 using classic logic puzzles such as the Tower of Hanoi and River Crossing. These puzzles are benchmarks for assessing an AI’s planning and reasoning skills. The Tower of Hanoi evaluates recursive problem-solving, while river Crossing assesses the ability to plan and execute multi-step solutions.
The puzzles were categorized into three difficulty levels. While the models performed reasonably well on simpler tasks, their performance declined substantially as complexity increased.This held true regardless of model size, training method, or computational power. Even with access to the correct algorithm,the models struggled to provide meaningful responses,suggesting a “counterintuitive scaling limit” where effort decreases as complexity rises.
Apple argues that what is often perceived as reasoning may simply be advanced pattern-matching. This outlook offers a possible explanation for Apple’s measured approach to artificial intelligence (AI) development.
Current evaluations focus primarily on established mathematical and coding benchmarks, emphasizing final answer accuracy. however,this paradigm often suffers from data contamination and fails to provide insights into the structure and quality of reasoning traces… Our setup allows analysis not only of the final answers but also of the internal reasoning traces, offering insights into how Large Reasoning models (LRMs) ‘think.’
The research paper preceded Apple’s annual WWDC developers conference, where executives introduced the Foundation Models framework. This framework enables developers to integrate AI models into their applications, facilitating image generation, text creation, and natural language search.Apple also unveiled Xcode 26, featuring built-in support for integrating AI models like ChatGPT and Claude via API keys, empowering developers to build bright applications without cloud infrastructure.
What’s next
Apple’s advancements in AI, especially the Foundation Models framework and Xcode 26, signal a move toward empowering developers with local AI capabilities. This approach contrasts with the cloud-dependent strategies of some competitors, perhaps offering users enhanced privacy and efficiency in their AI interactions.
