<div dir="ltr"><div dir="ltr">All,<div><br></div><div>I just came across the following -- highly recommended [by TPK] "article"  on observed limitations on AI performance when dealing with complex tasks [actually requiring some real understanding of what is being researched]</div><div><br></div><div><a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf</a></div><div><br></div><div><a href="https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf">AI & the illusion of thinking</a></div><div><br></div><div>Apple researchers developed a series of puzzle tests and the tried to see how AI performed depending on the complexity of the challenge</div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><i>The Illusion of Thinking:

Understanding the Strengths and Limitations of Reasoning Models

via the Lens of Problem Complexity <br></i>Parshin Shojaee∗† Iman Mirzadeh∗ Keivan Alizadeh

Maxwell Horton Samy Bengio Mehrdad Farajtabar

Apple <br>Abstract <br>Recent generations of frontier language models have introduced Large Reasoning Models

(LRMs) that generate detailed thinking processes before providing answers. While these models

demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights

into the reasoning traces’ structure and quality. In this work, we systematically investigate these

gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis

of not only final answers but also the internal reasoning traces, offering insights into how LRMs

“think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs

face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counterintuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then

declines despite having an adequate token budget. By comparing LRMs with their standard LLM

counterparts under equivalent inference compute, we identify three performance regimes: <br>(1) low complexity tasks where standard models surprisingly outperform LRMs, <br>(2) medium-complexity

tasks where additional thinking in LRMs demonstrates advantage, and <br>(3) high-complexity tasks

where both models experience complete collapse. <br>We found that LRMs have limitations in exact

computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We

also investigate the reasoning traces in more depth, studying the patterns of explored solutions

and analyzing the models’ computational behavior, shedding light on their strengths, limitations,

and ultimately raising crucial questions about their true reasoning capabilities...<br></blockquote><div> </div>Hide your eyes --- here's the punchline:<div>  </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">  5 Conclusion

In this paper, we systematically examine frontier Large Reasoning Models (LRMs) through the lens

of problem complexity using controllable puzzle environments. Our findings reveal fundamental

limitations in current models: despite sophisticated self-reflection mechanisms, these models fail to

develop generalizable reasoning capabilities beyond certain complexity thresholds. We identified

three distinct reasoning regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at

moderate complexity, and both collapse at high complexity. Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical complexity, suggesting an inherent

compute scaling limit in LRMs. Our detailed analysis of reasoning traces further exposed complexity dependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure

on complex ones. These insights challenge prevailing assumptions about LRM capabilities and

suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.

Finally, we presented some surprising results on LRMs that lead to several open questions for future

work. Most notably, we observed their limitations in performing exact computation; for example,

when we provided the solution algorithm for the Tower of Hanoi to the models, their performance

on this puzzle did not improve. Moreover, investigating the first failure move of the models revealed

surprising behaviors. For instance, they could perform up to 100 correct moves in the Tower of

Hanoi but fail to provide more than 5 correct moves in the River Crossing puzzle. We believe our

results can pave the way for future investigations into the reasoning capabilities of these systems.

</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Limitations

We acknowledge that our work has limitations. While our puzzle environments enable controlled

experimentation with fine-grained control over problem complexity, they represent a narrow slice of

reasoning tasks and may not capture the diversity of real-world or knowledge-intensive reasoning

problems. It is notable that most of our experiments rely on black-box API access to the closed frontier

LRMs, limiting our ability to analyze internal states or architectural components. Furthermore, the

use of deterministic puzzle simulators assumes that reasoning can be perfectly validated step by

step. However, in less structured domains, such precise validation may not be feasible, limiting the

transferability of this analysis to other more generalizable reasoning  </blockquote><div><br></div><div>Very interesting</div><div><br></div><div>Ted</div><div><br></div></div>

</div>