<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body style='font-size: 10pt; font-family: Verdana,Geneva,sans-serif'>
<p>On Gordon Deal's broadcast this morning, there was an interview with Rob Anderly about the dangerous AI focus on speed vs. quality. A textual autotranslation excerpt is below:</p>
<p>"</p>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> <span style="color: #236fa1;">This Morning With Gordon Deal</span></em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>This Morning with Gordon Deal July 15, 2024</em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>Published: 7/15/2024</em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em><strong>Artificial intelligence is moving at lightning speed, but the way it's being implemented could be dangerous</strong> more. Now from our own Gordon deal, </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>Speaking with Rob Anderly, he's the founder and principal analyst at the Enderlee Group. He's written a piece for data NAMI called Why the Current Approach for AI is excessively dangerous, or as you point out talking about how the focus here is on productivity versus quality. Give an example. What have, what have you seen? Well, </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>It's, it's really on speed versus quality. The, the, if you, if you look at everything we're asking theis to do it, it really has to do with productivity speed. In other words, how much stuff you can turn out. So when you, when you ask it to write an article, it writes the article. It, if you ask it, in fact, we saw that in with the AI used for, for legal briefs. You ask it to write a legal brief, it it, it just, in a matter of seconds, it turns out a, a legal brief. <strong>The problem is the stuff it's creating are, are very low quality ais hallucinate. And so the, and so the attorneys that brought forward the legal brief, I believe they were disbarred as a result of doing that because they had a lot of references to, to a lot of citations to, to events that never occurred. </strong></em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>And, and, and judges don't have a tremendous amount of, of don't have a sense of humor when it comes to falsifying Yeah. Documentation. So that, as you would understand, so the, so the, so these things are being used to You Know to crate stuff really fast, but, but, but not being used to assure the quality of the things they create. </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>I thought it was interesting you said, I I guess on the human side, we still have to perform tasks. </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>Yeah, I mean we, we, we, we, we still do the, the the, I mean, if, if you look at the, at the balance here, I mean, ideally you'd want AI to do things that, that you don't like doing. And if you look, look, let's go to coding. When, when I was a coder myself, the things I didn't like doing was I didn't really like planning it out. I, I didn't like, I didn't like air checking. I didn't like doing my own quality control editing, particularly when we went back to, to punch cards. It was just annoying. And, and, and the, and that's the same thing with coders today, that they don't like commenting their code. They don't like doing quality validation. </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>What they like doing is they like writing code and, and what did we get the ais to do? We got the ais to write the code. In other words, do the thing that, that coders like doing, but <strong>not do the things that coders don't like doing. And, and in, and in fact, what we ought to be doing with AI initially is focusing it on things that people don't like doing as opposed to focusing it on things that people do like doing and, and making sure that that the stuff that's being turned out is, is of acceptable quality, not, not extremely poor quality, which unfortunately has been the outcome. </strong></em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>Mm. We're speaking of Rob Enderle, he's the founder and principal analyst at the Enderle Group. He's written a piece for data nami.com called Why the Current Approach for AI is excessively Dangerous. What do you want to see here? What, what needs to change? Now? </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>I'd like to see a much, much tighter focus on quality, the, the, to, to really assure the outcome. The, the, as I said, the current approach is, is <strong>resulting in a lot of low quality output, a lot of very dangerous output. </strong>I think the using AI on, on Google was, I think there was a search that was done the other day where they were asking, <strong>how do you keep the cheese from sliding off of pizza? And Google's AI said, use glue</strong>. Yeah. Yeah. Not a, not a good response. It's a very local and, and depending on the glue that's used, that stuff could be toxic. So it, it's just simply not good advice. And, and so people are, are using these AI tools to search for answers on medical answers, culinary answers, and the rest. </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>And, and they're getting, and they're <strong>getting answers that they really shouldn't use because they're, they're, they're dangerous.</strong> If not, Dudley </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em> </em></span></div>
<div style="padding-left: 40px;"><span style="color: #236fa1;"><em>Tech analyst, Rob Anderly, with our own Gordon Diehl </em></span></div>
<p style="padding-left: 40px;"> </p>
<p>"</p>
<p id="reply-intro">On 2024-07-12 16:04, Ted Kochanski via LCTG wrote:</p>
<blockquote type="cite" style="padding: 0 0.4em; border-left: #1010ff 2px solid; margin: 0">
<div id="replybody1">
<div dir="ltr"> All,
<div> </div>
<div>I just came across an article in the MIT News about work on assessing how Large Language Models [e.g. Chat-GPT] deal with problems outside of their training</div>
<div> </div>
<div>Here's the MIT News article</div>
<div> </div>
<div><span style="font-size: large;">Reasoning skills of large language models are often overestimated</span><br />New CSAIL research highlights how LLMs excel in familiar scenarios but struggle in novel ones, questioning their true reasoning abilities versus reliance on memorization.<br />Rachel Gordon | MIT CSAIL<br />Publication Date:July 11, 2024</div>
<div><a href="https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711" target="_blank" rel="noopener noreferrer">https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711</a></div>
<div> </div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;">When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.</blockquote>
<div> </div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;">MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to examine how LLMs fare with variations of different tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.</blockquote>
<div> </div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;">The study compared "default tasks," the common tasks a model is trained and tested on, with "counterfactual scenarios," hypothetical situations deviating from default conditions — which models like GPT-4 and Claude can usually be expected to cope with. The researchers developed some tests outside the models' comfort zones by tweaking existing tasks instead of creating entirely new ones. They used a variety of datasets and benchmarks specifically tailored to different aspects of the models' capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc...</blockquote>
<div> </div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;">"We've uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models' adaptability and broaden their application horizons," says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research. "As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness."</blockquote>
<div> </div>
<div>Here's the technical article</div>
<div> </div>
<div><span style="font-size: large;">Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks</span><br />Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin Wang, Najoung Kim, Jacob Andreas, Yoon Kim</div>
<div><br />The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart these aspects of behavior.<br />Comments: NAACL 2024<br />Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)<br />Cite as: arXiv:2307.02477 [cs.CL]<br /> (or arXiv:2307.02477v3 [cs.CL] for this version)<br /> <br /><a href="https://doi.org/10.48550/arXiv.2307.02477" target="_blank" rel="noopener noreferrer">https://doi.org/10.48550/arXiv.2307.02477</a></div>
<div> </div>
<div>also available as pdf through archiv</div>
<div> <a href="https://arxiv.org/pdf/2307.02477" target="_blank" rel="noopener noreferrer">https://arxiv.org/pdf/2307.02477</a></div>
<div> </div>
<div>a couple of interesting excerpts from the paper<br />
<div> </div>
</div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;"><strong>Abstract</strong><br />The impressive performance of recent language models across a wide range of tasks suggests that they possess a degree of abstract reasoning skills. Are these skills general and transferable, or specialized to specific tasks seen during pretraining? <br />To disentangle these effects, we propose an evaluation framework based on "counterfactual" task variants that deviate from the default assumptions underlying standard tasks. Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants,<br />but nevertheless find that performance substantially and consistently degrades compared to the default conditions. <br />This suggests that while current LMs may possess abstract task-solving skills to an extent, they often also rely on narrow, non-transferable procedures for task-solving. These results motivate a more careful interpretation of language model performance that teases apart<br />these aspects of behavior...</blockquote>
<div> </div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;"><strong>9 Conclusion</strong></blockquote>
<div> </div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;">Through our counterfactual evaluation on 11 tasks, we identified consistent and substantial degradation of LM performance under counterfactual conditions. We attribute this gap to overfitting to the default task variants, and thus encourage future LM analyses to explicitly consider abstract task ability as detached from observed task performance, especially when these evaluated task variants might exist in abundance in the LM pretraining corpora.</blockquote>
<div> </div>
<blockquote class="v1gmail_quote" style="margin: 0px 0px 0px 0.8ex; border-left: 1px solid #cccccc; padding-left: 1ex;">Furthermore, insofar as this degradation is a result of the LMs' being trained only on surface form text, it would also be interesting future work to see if more grounded LMs (grounded in the "real" world, or some semantic representation, etc.) are more robust to task variations.</blockquote>
<div> </div>
<div>In other words --- When it comes to the ability of LLMs to actually reason --- Caveat Emptor might be the first order assessment</div>
<div> </div>
<div>Ted</div>
<div> </div>
<div>PS: maybe we should get a talk by one of the authors?</div>
<div> </div>
</div>
</div>
<br />
<div class="pre" style="margin: 0; padding: 0; font-family: monospace">===============================================<br />::The Lexington Computer and Technology Group Mailing List::<br />Reply goes to sender only; Reply All to send to list.<br />Send to the list: <a href="mailto:LCTG@lists.toku.us">LCTG@lists.toku.us</a> Message archives: <a href="http://lists.toku.us/pipermail/lctg-toku.us/" target="_blank" rel="noopener noreferrer">http://lists.toku.us/pipermail/lctg-toku.us/</a><br />To subscribe: email <a href="mailto:lctg-subscribe@toku.us">lctg-subscribe@toku.us</a> To unsubscribe: email <a href="mailto:lctg-unsubscribe@toku.us">lctg-unsubscribe@toku.us</a><br />Future and Past meeting information: <a href="http://LCTG.toku.us" target="_blank" rel="noopener noreferrer">http://LCTG.toku.us</a><br />List information: <a href="http://lists.toku.us/listinfo.cgi/lctg-toku.us" target="_blank" rel="noopener noreferrer">http://lists.toku.us/listinfo.cgi/lctg-toku.us</a><br />This message was sent to <a href="mailto:mwolfe@vinebrook.com">mwolfe@vinebrook.com</a>.<br />Set your list options: <a href="http://lists.toku.us/options.cgi/lctg-toku.us/mwolfe@vinebrook.com" target="_blank" rel="noopener noreferrer">http://lists.toku.us/options.cgi/lctg-toku.us/mwolfe@vinebrook.com</a></div>
</blockquote>
<p><br /></p>
</body></html>