[Lex Computer & Tech Group/LCTG] FYI: An interesting assessment of LLM and its limitations

Mon Jul 15 06:51:57 PDT 2024

On Gordon Deal's broadcast this morning, there was an interview with Rob 
Anderly about the dangerous AI focus on speed vs. quality. A textual 
autotranslation excerpt is below:

"

_ This Morning With Gordon Deal_
_This Morning with Gordon Deal July 15, 2024_
_ _
_Published: 7/15/2024_
_ _
_Artificial intelligence is moving at lightning speed, but the way it's 
being implemented could be dangerous more. Now from our own Gordon deal, 
_
_ _
_Speaking with Rob Anderly, he's the founder and principal analyst at 
the Enderlee Group. He's written a piece for data NAMI called Why the 
Current Approach for AI is excessively dangerous, or as you point out 
talking about how the focus here is on productivity versus quality. Give 
an example. What have, what have you seen? Well, _
_ _
_It's, it's really on speed versus quality. The, the, if you, if you 
look at everything we're asking theis to do it, it really has to do with 
productivity speed. In other words, how much stuff you can turn out. So 
when you, when you ask it to write an article, it writes the article. 
It, if you ask it, in fact, we saw that in with the AI used for, for 
legal briefs. You ask it to write a legal brief, it it, it just, in a 
matter of seconds, it turns out a, a legal brief. The problem is the 
stuff it's creating are, are very low quality ais hallucinate. And so 
the, and so the attorneys that brought forward the legal brief, I 
believe they were disbarred as a result of doing that because they had a 
lot of references to, to a lot of citations to, to events that never 
occurred. _
_ _
_And, and, and judges don't have a tremendous amount of, of don't have a 
sense of humor when it comes to falsifying Yeah. Documentation. So that, 
as you would understand, so the, so the, so these things are being used 
to You Know to crate stuff really fast, but, but, but not being used to 
assure the quality of the things they create. _
_ _
_I thought it was interesting you said, I I guess on the human side, we 
still have to perform tasks. _
_ _
_Yeah, I mean we, we, we, we, we still do the, the the, I mean, if, if 
you look at the, at the balance here, I mean, ideally you'd want AI to 
do things that, that you don't like doing. And if you look, look, let's 
go to coding. When, when I was a coder myself, the things I didn't like 
doing was I didn't really like planning it out. I, I didn't like, I 
didn't like air checking. I didn't like doing my own quality control 
editing, particularly when we went back to, to punch cards. It was just 
annoying. And, and, and the, and that's the same thing with coders 
today, that they don't like commenting their code. They don't like doing 
quality validation. _
_ _
_What they like doing is they like writing code and, and what did we get 
the ais to do? We got the ais to write the code. In other words, do the 
thing that, that coders like doing, but not do the things that coders 
don't like doing. And, and in, and in fact, what we ought to be doing 
with AI initially is focusing it on things that people don't like doing 
as opposed to focusing it on things that people do like doing and, and 
making sure that that the stuff that's being turned out is, is of 
acceptable quality, not, not extremely poor quality, which unfortunately 
has been the outcome. _
_ _
_Mm. We're speaking of Rob Enderle, he's the founder and principal 
analyst at the Enderle Group. He's written a piece for data nami.com 
called Why the Current Approach for AI is excessively Dangerous. What do 
you want to see here? What, what needs to change? Now? _
_ _
_I'd like to see a much, much tighter focus on quality, the, the, to, to 
really assure the outcome. The, the, as I said, the current approach is, 
is resulting in a lot of low quality output, a lot of very dangerous 
output. I think the using AI on, on Google was, I think there was a 
search that was done the other day where they were asking, how do you 
keep the cheese from sliding off of pizza? And Google's AI said, use 
glue. Yeah. Yeah. Not a, not a good response. It's a very local and, and 
depending on the glue that's used, that stuff could be toxic. So it, 
it's just simply not good advice. And, and so people are, are using 
these AI tools to search for answers on medical answers, culinary 
answers, and the rest. _
_ _
_And, and they're getting, and they're getting answers that they really 
shouldn't use because they're, they're, they're dangerous. If not, 
Dudley _
_ _
_Tech analyst, Rob Anderly, with our own Gordon Diehl _

"

On 2024-07-12 16:04, Ted Kochanski via LCTG wrote:

> All,
> 
> I just came across an article in the MIT News about work on assessing 
> how Large Language Models [e.g. Chat-GPT] deal with problems outside of 
> their training
> 
> Here's the MIT News article
> 
> Reasoning skills of large language models are often overestimated
> New CSAIL research highlights how LLMs excel in familiar scenarios but 
> struggle in novel ones, questioning their true reasoning abilities 
> versus reliance on memorization.
> Rachel Gordon | MIT CSAIL
> Publication Date:July 11, 2024
> https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711
> 
>> When it comes to artificial intelligence, appearances can be 
>> deceiving. The mystery surrounding the inner workings of large 
>> language models (LLMs) stems from their vast size, complex training 
>> methods, hard-to-predict behaviors, and elusive interpretability.
> 
>> MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) 
>> researchers recently peered into the proverbial magnifying glass to 
>> examine how LLMs fare with variations of different tasks, revealing 
>> intriguing insights into the interplay between memorization and 
>> reasoning skills. It turns out that their reasoning abilities are 
>> often overestimated.
> 
>> The study compared "default tasks," the common tasks a model is 
>> trained and tested on, with "counterfactual scenarios," hypothetical 
>> situations deviating from default conditions -- which models like 
>> GPT-4 and Claude can usually be expected to cope with. The researchers 
>> developed some tests outside the models' comfort zones by tweaking 
>> existing tasks instead of creating entirely new ones. They used a 
>> variety of datasets and benchmarks specifically tailored to different 
>> aspects of the models' capabilities for things like arithmetic, chess, 
>> evaluating code, answering logical questions, etc...
> 
>> "We've uncovered a fascinating aspect of large language models: they 
>> excel in familiar scenarios, almost like a well-worn path, but 
>> struggle when the terrain gets unfamiliar. This insight is crucial as 
>> we strive to enhance these models' adaptability and broaden their 
>> application horizons," says Zhaofeng Wu, an MIT PhD student in 
>> electrical engineering and computer science, CSAIL affiliate, and the 
>> lead author on a new paper about the research. "As AI is becoming 
>> increasingly ubiquitous in our society, it must reliably handle 
>> diverse scenarios, whether familiar or not. We hope these insights 
>> will one day inform the design of future LLMs with improved 
>> robustness."
> 
> Here's the technical article
> 
> Reasoning or Reciting? Exploring the Capabilities and Limitations of 
> Language Models Through Counterfactual Tasks
> Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin 
> Wang, Najoung Kim, Jacob Andreas, Yoon Kim
> 
> The impressive performance of recent language models across a wide 
> range of tasks suggests that they possess a degree of abstract 
> reasoning skills. Are these skills general and transferable, or 
> specialized to specific tasks seen during pretraining? To disentangle 
> these effects, we propose an evaluation framework based on 
> "counterfactual" task variants that deviate from the default 
> assumptions underlying standard tasks. Across a suite of 11 tasks, we 
> observe nontrivial performance on the counterfactual variants, but 
> nevertheless find that performance substantially and consistently 
> degrades compared to the default conditions. This suggests that while 
> current LMs may possess abstract task-solving skills to an extent, they 
> often also rely on narrow, non-transferable procedures for 
> task-solving. These results motivate a more careful interpretation of 
> language model performance that teases apart these aspects of behavior.
> Comments: NAACL 2024
> Subjects: Computation and Language (cs.CL); Artificial Intelligence 
> (cs.AI)
> Cite as: arXiv:2307.02477 [cs.CL]
> (or arXiv:2307.02477v3 [cs.CL] for this version)
> 
> https://doi.org/10.48550/arXiv.2307.02477
> 
> also available as pdf through archiv
> https://arxiv.org/pdf/2307.02477
> 
> a couple of interesting excerpts from the paper
> 
>> Abstract
>> The impressive performance of recent language models across a wide 
>> range of tasks suggests that they possess a degree of abstract 
>> reasoning skills. Are these skills general and transferable, or 
>> specialized to specific tasks seen during pretraining?
>> To disentangle these effects, we propose an evaluation framework based 
>> on "counterfactual" task variants that deviate from the default 
>> assumptions underlying standard tasks. Across a suite of 11 tasks, we 
>> observe nontrivial performance on the counterfactual variants,
>> but nevertheless find that performance substantially and consistently 
>> degrades compared to the default conditions.
>> This suggests that while current LMs may possess abstract task-solving 
>> skills to an extent, they often also rely on narrow, non-transferable 
>> procedures for task-solving. These results motivate a more careful 
>> interpretation of language model performance that teases apart
>> these aspects of behavior...
> 
>> 9 Conclusion
> 
>> Through our counterfactual evaluation on 11 tasks, we identified 
>> consistent and substantial degradation of LM performance under 
>> counterfactual conditions. We attribute this gap to overfitting to the 
>> default task variants, and thus encourage future LM analyses to 
>> explicitly consider abstract task ability as detached from observed 
>> task performance, especially when these evaluated task variants might 
>> exist in abundance in the LM pretraining corpora.
> 
>> Furthermore, insofar as this degradation is a result of the LMs' being 
>> trained only on surface form text, it would also be interesting future 
>> work to see if more grounded LMs (grounded in the "real" world, or 
>> some semantic representation, etc.) are more robust to task 
>> variations.
> 
> In other words  --- When it comes to the ability of LLMs to actually 
> reason --- Caveat Emptor might be the first order assessment
> 
> Ted
> 
> PS: maybe we should get a talk by one of the authors?
> 
> ===============================================
> ::The Lexington Computer and Technology Group Mailing List::
> Reply goes to sender only; Reply All to send to list.
> Send to the list: LCTG at lists.toku.us      Message archives: 
> http://lists.toku.us/pipermail/lctg-toku.us/
> To subscribe: email lctg-subscribe at toku.us  To unsubscribe: email 
> lctg-unsubscribe at toku.us
> Future and Past meeting information: http://LCTG.toku.us
> List information: http://lists.toku.us/listinfo.cgi/lctg-toku.us
> This message was sent to mwolfe at vinebrook.com.
> Set your list options: 
> http://lists.toku.us/options.cgi/lctg-toku.us/mwolfe@vinebrook.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20240715/034be0d9/attachment.htm>