<div dir="ltr"><div dir="ltr">All,<div><br></div><div>I just came across the following -- highly recommended [by TPK] "article" on observed limitations on AI performance when dealing with complex tasks [actually requiring some real understanding of what is being researched]</div><div><br></div><div><a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf</a></div><div><br></div><div><a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">AI & the illusion of thinking</a></div><div><br></div><div>Apple researchers developed a series of puzzle tests and the tried to see how AI performed depending on the complexity of the challenge</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><i>The Illusion of Thinking:
Understanding the Strengths and Limitations of Reasoning Models
via the Lens of Problem Complexity <br></i>Parshin Shojaee∗† Iman Mirzadeh∗ Keivan Alizadeh
Maxwell Horton Samy Bengio Mehrdad Farajtabar
Apple <br>Abstract <br>Recent generations of frontier language models have introduced Large Reasoning Models
(LRMs) that generate detailed thinking processes before providing answers. While these models
demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights
into the reasoning traces’ structure and quality. In this work, we systematically investigate these
gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis
of not only final answers but also the internal reasoning traces, offering insights into how LRMs
“think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs
face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then
declines despite having an adequate token budget. By comparing LRMs with their standard LLM
counterparts under equivalent inference compute, we identify three performance regimes: <br>(1) low complexity tasks where standard models surprisingly outperform LRMs, <br>(2) medium-complexity
tasks where additional thinking in LRMs demonstrates advantage, and <br>(3) high-complexity tasks
where both models experience complete collapse. <br>We found that LRMs have limitations in exact
computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We
also investigate the reasoning traces in more depth, studying the patterns of explored solutions
and analyzing the models’ computational behavior, shedding light on their strengths, limitations,
and ultimately raising crucial questions about their true reasoning capabilities...<br></blockquote><div> </div>Hide your eyes --- here's the punchline:<div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"> 5 Conclusion
In this paper, we systematically examine frontier Large Reasoning Models (LRMs) through the lens
of problem complexity using controllable puzzle environments. Our findings reveal fundamental
limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to
develop generalizable reasoning capabilities beyond certain complexity thresholds. We identified
three distinct reasoning regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at
moderate complexity, and both collapse at high complexity. Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent
compute scaling limit in LRMs. Our detailed analysis of reasoning traces further exposed complexity dependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure
on complex ones. These insights challenge prevailing assumptions about LRM capabilities and
suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.
Finally, we presented some surprising results on LRMs that lead to several open questions for future
work. Most notably, we observed their limitations in performing exact computation; for example,
when we provided the solution algorithm for the Tower of Hanoi to the models, their performance
on this puzzle did not improve. Moreover, investigating the first failure move of the models revealed
surprising behaviors. For instance, they could perform up to 100 correct moves in the Tower of
Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle. We believe our
results can pave the way for future investigations into the reasoning capabilities of these systems.
</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Limitations
We acknowledge that our work has limitations. While our puzzle environments enable controlled
experimentation with fine-grained control over problem complexity, they represent a narrow slice of
reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning
problems. It is notable that most of our experiments rely on black-box API access to the closed frontier
LRMs, limiting our ability to analyze internal states or architectural components. Furthermore, the
use of deterministic puzzle simulators assumes that reasoning can be perfectly validated step by
step. However, in less structured domains, such precise validation may not be feasible, limiting the
transferability of this analysis to other more generalizable reasoning </blockquote><div><br></div><div>Very interesting</div><div><br></div><div>Ted</div><div><br></div></div>
</div>