<div dir="ltr">  All,<div><br></div><div>I just came across an article in the MIT News about work on assessing how Large Language Models [e.g. Chat-GPT] deal with problems outside of their training</div><div><br></div><div>Here's the MIT News article</div><div><br></div><div><font size="4">Reasoning skills of large language models are often overestimated</font><br>New CSAIL research highlights how LLMs excel in familiar scenarios but struggle in novel ones, questioning their true reasoning abilities versus reliance on memorization.<br>Rachel Gordon | MIT CSAIL<br>Publication Date:July 11, 2024<br></div><div><a href="https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711">https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711</a><br></div><div><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to examine how LLMs fare with variations of different tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">The study compared “default tasks,” the common tasks a model is trained and tested on, with “counterfactual scenarios,” hypothetical situations deviating from default conditions — which models like GPT-4 and Claude can usually be expected to cope with. The researchers developed some tests outside the models’ comfort zones by tweaking existing tasks instead of creating entirely new ones. They used a variety of datasets and benchmarks specifically tailored to different aspects of the models' capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc...</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">“We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research. “As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness.”</blockquote><div><br></div><div>Here's the technical article</div><div><br></div><div><font size="4">Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks</font><br>Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, Yoon Kim</div><div><br>The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.<br>Comments:   NAACL 2024<br>Subjects:   Computation and Language (cs.CL); Artificial Intelligence (cs.AI)<br>Cite as:     arXiv:2307.02477 [cs.CL]<br>     (or arXiv:2307.02477v3 [cs.CL] for this version)<br> <br><a href="https://doi.org/10.48550/arXiv.2307.02477">https://doi.org/10.48550/arXiv.2307.02477</a><br></div><div><br></div><div>also available as pdf through archiv</div><div> <a href="https://arxiv.org/pdf/2307.02477">https://arxiv.org/pdf/2307.02477</a></div><div><br></div><div>a couple of interesting excerpts from the paper<br><div><br></div></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><b>Abstract</b><br>The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? <br>To disentangle these effects, we propose an evaluation framework based on “counterfactual” task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants,<br>but nevertheless find that performance substantially and consistently degrades compared to the default conditions. <br>This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart<br>these aspects of behavior...</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><b>9 Conclusion</b></blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Through our counterfactual evaluation on 11 tasks, we identified consistent and substantial degradation of LM performance under counterfactual conditions. We attribute this gap to overfitting to the default task variants, and thus encourage future LM analyses to explicitly consider abstract task ability as detached from observed task performance, especially when these evaluated task variants might exist in abundance in the LM pretraining corpora.</blockquote><div> </div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Furthermore, insofar as this degradation is a result of the LMs’ being trained only on surface form text, it would also be interesting future work to see if more grounded LMs (grounded in the “real” world, or some semantic representation, etc.) are more robust to task variations.</blockquote><div><br></div><div>In other words  --- When it comes to the ability of LLMs to actually reason --- Caveat Emptor might be the first order assessment</div><div><br></div><div>Ted</div><div><br></div><div>PS: maybe we should get a talk by one of the authors?</div><div><br></div></div>