[Lex Computer & Tech Group/LCTG] FYI: An interesting assessment of LLM and its limitations

Ted Kochanski tedpkphd at gmail.com
Fri Jul 12 13:04:30 PDT 2024


  All,

I just came across an article in the MIT News about work on assessing how
Large Language Models [e.g. Chat-GPT] deal with problems outside of their
training

Here's the MIT News article

Reasoning skills of large language models are often overestimated
New CSAIL research highlights how LLMs excel in familiar scenarios but
struggle in novel ones, questioning their true reasoning abilities versus
reliance on memorization.
Rachel Gordon | MIT CSAIL
Publication Date:July 11, 2024
https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711

When it comes to artificial intelligence, appearances can be deceiving. The
> mystery surrounding the inner workings of large language models (LLMs)
> stems from their vast size, complex training methods, hard-to-predict
> behaviors, and elusive interpretability.



> MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL)
> researchers recently peered into the proverbial magnifying glass to examine
> how LLMs fare with variations of different tasks, revealing intriguing
> insights into the interplay between memorization and reasoning skills. It
> turns out that their reasoning abilities are often overestimated.



> The study compared “default tasks,” the common tasks a model is trained
> and tested on, with “counterfactual scenarios,” hypothetical situations
> deviating from default conditions — which models like GPT-4 and Claude can
> usually be expected to cope with. The researchers developed some tests
> outside the models’ comfort zones by tweaking existing tasks instead of
> creating entirely new ones. They used a variety of datasets and benchmarks
> specifically tailored to different aspects of the models' capabilities for
> things like arithmetic, chess, evaluating code, answering logical
> questions, etc...



> “We’ve uncovered a fascinating aspect of large language models: they excel
> in familiar scenarios, almost like a well-worn path, but struggle when the
> terrain gets unfamiliar. This insight is crucial as we strive to enhance
> these models’ adaptability and broaden their application horizons,” says
> Zhaofeng Wu, an MIT PhD student in electrical engineering and computer
> science, CSAIL affiliate, and the lead author on a new paper about the
> research. “As AI is becoming increasingly ubiquitous in our society, it
> must reliably handle diverse scenarios, whether familiar or not. We hope
> these insights will one day inform the design of future LLMs with improved
> robustness.”


Here's the technical article

Reasoning or Reciting? Exploring the Capabilities and Limitations of
Language Models Through Counterfactual Tasks
Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin
Wang, Najoung Kim, Jacob Andreas, Yoon Kim

The impressive performance of recent language models across a wide range of
tasks suggests that they possess a degree of abstract reasoning skills. Are
these skills general and transferable, or specialized to specific tasks
seen during pretraining? To disentangle these effects, we propose an
evaluation framework based on "counterfactual" task variants that deviate
from the default assumptions underlying standard tasks. Across a suite of
11 tasks, we observe nontrivial performance on the counterfactual variants,
but nevertheless find that performance substantially and consistently
degrades compared to the default conditions. This suggests that while
current LMs may possess abstract task-solving skills to an extent, they
often also rely on narrow, non-transferable procedures for task-solving.
These results motivate a more careful interpretation of language model
performance that teases apart these aspects of behavior.
Comments: NAACL 2024
Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: arXiv:2307.02477 [cs.CL]
  (or arXiv:2307.02477v3 [cs.CL] for this version)

https://doi.org/10.48550/arXiv.2307.02477

also available as pdf through archiv
 https://arxiv.org/pdf/2307.02477

a couple of interesting excerpts from the paper

*Abstract*
> The impressive performance of recent language models across a wide range
> of tasks suggests that they possess a degree of abstract reasoning skills.
> Are these skills general and transferable, or specialized to specific tasks
> seen during pretraining?
> To disentangle these effects, we propose an evaluation framework based on
> “counterfactual” task variants that deviate from the default assumptions
> underlying standard tasks. Across a suite of 11 tasks, we observe
> nontrivial performance on the counterfactual variants,
> but nevertheless find that performance substantially and consistently
> degrades compared to the default conditions.
> This suggests that while current LMs may possess abstract task-solving
> skills to an extent, they often also rely on narrow, non-transferable
> procedures for task-solving. These results motivate a more careful
> interpretation of language model performance that teases apart
> these aspects of behavior...



> *9 Conclusion*



> Through our counterfactual evaluation on 11 tasks, we identified
> consistent and substantial degradation of LM performance under
> counterfactual conditions. We attribute this gap to overfitting to the
> default task variants, and thus encourage future LM analyses to explicitly
> consider abstract task ability as detached from observed task performance,
> especially when these evaluated task variants might exist in abundance in
> the LM pretraining corpora.



> Furthermore, insofar as this degradation is a result of the LMs’ being
> trained only on surface form text, it would also be interesting future work
> to see if more grounded LMs (grounded in the “real” world, or some semantic
> representation, etc.) are more robust to task variations.


In other words  --- When it comes to the ability of LLMs to actually reason
--- Caveat Emptor might be the first order assessment

Ted

PS: maybe we should get a talk by one of the authors?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20240712/28f94fb3/attachment.htm>


More information about the LCTG mailing list