[Lex Computer & Tech Group/LCTG] More on AI

Thu Jun 12 09:47:27 PDT 2025

All,

I just came across the following -- highly recommended [by TPK] "article"
on observed limitations on AI performance when dealing with complex tasks
[actually requiring some real understanding of what is being researched]

https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf

AI & the illusion of thinking
<https://ml-site.cdn-apple.com/papers/the-illusion-of-thinking.pdf>

Apple researchers developed a series of puzzle tests and the tried to see
how AI performed depending on the complexity of the challenge

> *The Illusion of Thinking: Understanding the Strengths and Limitations of
> Reasoning Models via the Lens of Problem Complexity *Parshin Shojaee∗†
> Iman Mirzadeh∗ Keivan Alizadeh Maxwell Horton Samy Bengio Mehrdad
> Farajtabar Apple
> Abstract
> Recent generations of frontier language models have introduced Large
> Reasoning Models (LRMs) that generate detailed thinking processes before
> providing answers. While these models demonstrate improved performance on
> reasoning benchmarks, their fundamental capabilities, scaling properties,
> and limitations remain insufficiently understood. Current evaluations
> primarily focus on established mathematical and coding benchmarks,
> emphasizing final answer accuracy. However, this evaluation paradigm often
> suffers from data contamination and does not provide insights into the
> reasoning traces’ structure and quality. In this work, we systematically
> investigate these gaps with the help of controllable puzzle environments
> that allow precise manipulation of compositional complexity while
> maintaining consistent logical structures. This setup enables the analysis
> of not only final answers but also the internal reasoning traces, offering
> insights into how LRMs “think”. Through extensive experimentation across
> diverse puzzles, we show that frontier LRMs face a complete accuracy
> collapse beyond certain complexities. Moreover, they exhibit a
> counterintuitive scaling limit: their reasoning effort increases with
> problem complexity up to a point, then declines despite having an adequate
> token budget. By comparing LRMs with their standard LLM counterparts under
> equivalent inference compute, we identify three performance regimes:
> (1) low complexity tasks where standard models surprisingly outperform
> LRMs,
> (2) medium-complexity tasks where additional thinking in LRMs demonstrates
> advantage, and
> (3) high-complexity tasks where both models experience complete collapse.
> We found that LRMs have limitations in exact computation: they fail to use
> explicit algorithms and reason inconsistently across puzzles. We also
> investigate the reasoning traces in more depth, studying the patterns of
> explored solutions and analyzing the models’ computational behavior,
> shedding light on their strengths, limitations, and ultimately raising
> crucial questions about their true reasoning capabilities...
>

Hide your eyes --- here's the punchline:

>   5 Conclusion In this paper, we systematically examine frontier Large
> Reasoning Models (LRMs) through the lens of problem complexity using
> controllable puzzle environments. Our findings reveal fundamental
> limitations in current models: despite sophisticated self-reflection
> mechanisms, these models fail to develop generalizable reasoning
> capabilities beyond certain complexity thresholds. We identified three
> distinct reasoning regimes: standard LLMs outperform LRMs at low
> complexity, LRMs excel at moderate complexity, and both collapse at high
> complexity. Particularly concerning is the counterintuitive reduction in
> reasoning effort as problems approach critical complexity, suggesting an
> inherent compute scaling limit in LRMs. Our detailed analysis of reasoning
> traces further exposed complexity dependent reasoning patterns, from
> inefficient “overthinking” on simpler problems to complete failure on
> complex ones. These insights challenge prevailing assumptions about LRM
> capabilities and suggest that current approaches may be encountering
> fundamental barriers to generalizable reasoning. Finally, we presented some
> surprising results on LRMs that lead to several open questions for future
> work. Most notably, we observed their limitations in performing exact
> computation; for example, when we provided the solution algorithm for the
> Tower of Hanoi to the models, their performance on this puzzle did not
> improve. Moreover, investigating the first failure move of the models
> revealed surprising behaviors. For instance, they could perform up to 100
> correct moves in the Tower of Hanoi but fail to provide more than 5 correct
> moves in the River Crossing puzzle. We believe our results can pave the way
> for future investigations into the reasoning capabilities of these systems.

> Limitations We acknowledge that our work has limitations. While our puzzle
> environments enable controlled experimentation with fine-grained control
> over problem complexity, they represent a narrow slice of reasoning tasks
> and may not capture the diversity of real-world or knowledge-intensive
> reasoning problems. It is notable that most of our experiments rely on
> black-box API access to the closed frontier LRMs, limiting our ability to
> analyze internal states or architectural components. Furthermore, the use
> of deterministic puzzle simulators assumes that reasoning can be perfectly
> validated step by step. However, in less structured domains, such precise
> validation may not be feasible, limiting the transferability of this
> analysis to other more generalizable reasoning

Very interesting

Ted
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250612/ddb05c58/attachment.htm>