Apple finds generative AI crumbles under complex reasoning tests

The paper also criticised current model efficiency, reporting that reasoning systems expended unnecessary computing effort on simpler problems and consistently failed on more complex ones, even when given algorithms that should have led to the correct solution.

author-image
Social Samosa
New Update
Apple Study Flags ‘Accuracy Collapse’ in Advanced AI Systems When Faced with Complex Tasks

Apple researchers have identified significant limitations in advanced artificial intelligence models, warning of a “complete accuracy collapse” when such systems are tasked with solving complex problems.

In a peer-reviewed study released over the weekend, Apple stated that large reasoning models (LRMs), a form of AI designed to deconstruct and solve intricate challenges, began to underperform as task complexity increased. The research found that while standard models and LRMs performed adequately on simpler tasks, both suffered a breakdown in reasoning when presented with higher-complexity problems.

The study described this collapse as “particularly concerning”, noting that as tasks became harder, the models paradoxically reduced their reasoning efforts. “Upon approaching a critical threshold – which closely corresponds to their accuracy collapse point, models counterintuitively begin to reduce their reasoning effort despite increasing problem difficulty,” the paper noted.

The researchers suggested this behaviour points to a “fundamental scaling limitation in the thinking capabilities of current reasoning models”. The team tested a range of systems, including OpenAI’s o3, Google’s Gemini Thinking, Anthropic’s Claude 3.7 Sonnet-Thinking, and DeepSeek-R1.

The paper also criticised current model efficiency, reporting that reasoning systems expended unnecessary computing effort on simpler problems and consistently failed on more complex ones, even when given algorithms that should have led to the correct solution.

Apple concluded that these findings cast doubt on current methods for achieving more generalisable AI reasoning and suggest a possible dead end in the industry’s approach to artificial general intelligence (AGI).

 

Large Reasoning Models Apple AI research Gemini Chat GPT Apple