[Lex Computer & Tech Group/LCTG] FYI: An interesting assessment of LLM and its limitations

Ted Kochanski tedpkphd at gmail.com
Mon Jul 15 08:46:52 PDT 2024


All,

Ultimately this is just a furthering of the process which I[and perhaps
others] identified a few decades ago of debasing "published materials"

The argument is simple and logical -- cost of [Information versus noise]

1) Noise is a fundamental property of the Universe -- any object at a
temperature above Absolute Zero is an emitter of Noise
2) Information is a challenge to Noise and competes for the same bandwidth
-- in whatever medium
3) S/N is the defining ability of Information to be transmitted and
reliably "detected"

So:

   1. take Ancient Greece as a proto-paradigm:
      1. anything to be "published" effectively was "carved into stone" --
      e.g. the Rosetta Stone
      2. 'hence mostly only thoughtful information was "published" --
      "Noise" was mostly confined to scrawling graffiti on a wall [e.g. Pompei]
   2. Once paper was easily available -- the effort required to "publish"
   decreased -- and various "Noisey material" became more public
      1. However with documents written and copied by hand or with hand
      drawn original printed pages [e.g. carved wooden graphic or text pages}
      2. and "libraries" which were the primary disseminators of the
      Information -- still chained their books down
      3. so it was mostly "Good Information" that was publicly available
   3. Once printing began with moveable type -- the cost of printing was
   now driven by an assembly operation of pre-cast type
      1. more Noise began to be disseminated
      2. this process continued until the first weekly and then daily
      Newspapers began to be published and disseminated on street corners
         1. with gossip and hearsay in now in print and easily available
         [e.g. Paul Revere's graphic impression of the "Boston Massacre"}
      3. the process of increasing Noise/Signal accelerated throughout the
      19th and early 20th C as the cost of disseminating material decreased
         1. the amount of "Noise" in easy public access increased [even in
         libraries}
         2. paperback books, scandal-sheets such as the National Inquirer
         proliferate
      4. Once the WWW was a few years old [circa 1995] the amount of
   published "Signal to Noise" has been exponentially decreasing
      1. the cost of disseminating dropped precipitously as soon
      advancing technology of storing and transmitting made the cost
of anything
      but production asymptote to zero
   5. Now Generative AI has drastically reduced the cost of production of
   material including "Deep Fakes"

The challenge for the future will be finding the "Signal" floating on a
immense sea of pld noise and amid a torrential rain of new Noise


Ted

On Mon, Jul 15, 2024 at 9:52 AM Mitchell I. Wolfe <mwolfe at vinebrook.com>
wrote:

> On Gordon Deal's broadcast this morning, there was an interview with Rob
> Anderly about the dangerous AI focus on speed vs. quality. A textual
> autotranslation excerpt is below:
>
> "
> * This Morning With Gordon Deal*
> *This Morning with Gordon Deal July 15, 2024*
>
> *Published: 7/15/2024*
>
> *Artificial intelligence is moving at lightning speed, but the way it's
> being implemented could be dangerous more. Now from our own Gordon deal, *
>
> *Speaking with Rob Anderly, he's the founder and principal analyst at the
> Enderlee Group. He's written a piece for data NAMI called Why the Current
> Approach for AI is excessively dangerous, or as you point out talking about
> how the focus here is on productivity versus quality. Give an example. What
> have, what have you seen? Well, *
>
> *It's, it's really on speed versus quality. The, the, if you, if you look
> at everything we're asking theis to do it, it really has to do with
> productivity speed. In other words, how much stuff you can turn out. So
> when you, when you ask it to write an article, it writes the article. It,
> if you ask it, in fact, we saw that in with the AI used for, for legal
> briefs. You ask it to write a legal brief, it it, it just, in a matter of
> seconds, it turns out a, a legal brief. The problem is the stuff it's
> creating are, are very low quality ais hallucinate. And so the, and so the
> attorneys that brought forward the legal brief, I believe they were
> disbarred as a result of doing that because they had a lot of references
> to, to a lot of citations to, to events that never occurred. *
>
> *And, and, and judges don't have a tremendous amount of, of don't have a
> sense of humor when it comes to falsifying Yeah. Documentation. So that, as
> you would understand, so the, so the, so these things are being used to You
> Know to crate stuff really fast, but, but, but not being used to assure the
> quality of the things they create. *
>
> *I thought it was interesting you said, I I guess on the human side, we
> still have to perform tasks. *
>
> *Yeah, I mean we, we, we, we, we still do the, the the, I mean, if, if you
> look at the, at the balance here, I mean, ideally you'd want AI to do
> things that, that you don't like doing. And if you look, look, let's go to
> coding. When, when I was a coder myself, the things I didn't like doing was
> I didn't really like planning it out. I, I didn't like, I didn't like air
> checking. I didn't like doing my own quality control editing, particularly
> when we went back to, to punch cards. It was just annoying. And, and, and
> the, and that's the same thing with coders today, that they don't like
> commenting their code. They don't like doing quality validation. *
>
> *What they like doing is they like writing code and, and what did we get
> the ais to do? We got the ais to write the code. In other words, do the
> thing that, that coders like doing, but not do the things that coders don't
> like doing. And, and in, and in fact, what we ought to be doing with AI
> initially is focusing it on things that people don't like doing as opposed
> to focusing it on things that people do like doing and, and making sure
> that that the stuff that's being turned out is, is of acceptable quality,
> not, not extremely poor quality, which unfortunately has been the outcome. *
>
> *Mm. We're speaking of Rob Enderle, he's the founder and principal analyst
> at the Enderle Group. He's written a piece for data nami.com
> <http://nami.com> called Why the Current Approach for AI is excessively
> Dangerous. What do you want to see here? What, what needs to change? Now? *
>
> *I'd like to see a much, much tighter focus on quality, the, the, to, to
> really assure the outcome. The, the, as I said, the current approach is, is
> resulting in a lot of low quality output, a lot of very dangerous output. I
> think the using AI on, on Google was, I think there was a search that was
> done the other day where they were asking, how do you keep the cheese from
> sliding off of pizza? And Google's AI said, use glue. Yeah. Yeah. Not a,
> not a good response. It's a very local and, and depending on the glue
> that's used, that stuff could be toxic. So it, it's just simply not good
> advice. And, and so people are, are using these AI tools to search for
> answers on medical answers, culinary answers, and the rest. *
>
> *And, and they're getting, and they're getting answers that they really
> shouldn't use because they're, they're, they're dangerous. If not, Dudley *
>
> *Tech analyst, Rob Anderly, with our own Gordon Diehl *
>
>
>
> "
>
> On 2024-07-12 16:04, Ted Kochanski via LCTG wrote:
>
>   All,
>
> I just came across an article in the MIT News about work on assessing how
> Large Language Models [e.g. Chat-GPT] deal with problems outside of their
> training
>
> Here's the MIT News article
>
> Reasoning skills of large language models are often overestimated
> New CSAIL research highlights how LLMs excel in familiar scenarios but
> struggle in novel ones, questioning their true reasoning abilities versus
> reliance on memorization.
> Rachel Gordon | MIT CSAIL
> Publication Date:July 11, 2024
>
> https://news.mit.edu/2024/reasoning-skills-large-language-models-often-overestimated-0711
>
>
> When it comes to artificial intelligence, appearances can be deceiving.
> The mystery surrounding the inner workings of large language models (LLMs)
> stems from their vast size, complex training methods, hard-to-predict
> behaviors, and elusive interpretability.
>
>
>
> MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL)
> researchers recently peered into the proverbial magnifying glass to examine
> how LLMs fare with variations of different tasks, revealing intriguing
> insights into the interplay between memorization and reasoning skills. It
> turns out that their reasoning abilities are often overestimated.
>
>
>
> The study compared "default tasks," the common tasks a model is trained
> and tested on, with "counterfactual scenarios," hypothetical situations
> deviating from default conditions — which models like GPT-4 and Claude can
> usually be expected to cope with. The researchers developed some tests
> outside the models' comfort zones by tweaking existing tasks instead of
> creating entirely new ones. They used a variety of datasets and benchmarks
> specifically tailored to different aspects of the models' capabilities for
> things like arithmetic, chess, evaluating code, answering logical
> questions, etc...
>
>
>
> "We've uncovered a fascinating aspect of large language models: they excel
> in familiar scenarios, almost like a well-worn path, but struggle when the
> terrain gets unfamiliar. This insight is crucial as we strive to enhance
> these models' adaptability and broaden their application horizons," says
> Zhaofeng Wu, an MIT PhD student in electrical engineering and computer
> science, CSAIL affiliate, and the lead author on a new paper about the
> research. "As AI is becoming increasingly ubiquitous in our society, it
> must reliably handle diverse scenarios, whether familiar or not. We hope
> these insights will one day inform the design of future LLMs with improved
> robustness."
>
>
> Here's the technical article
>
> Reasoning or Reciting? Exploring the Capabilities and Limitations of
> Language Models Through Counterfactual Tasks
> Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Akyürek, Boyuan Chen, Bailin
> Wang, Najoung Kim, Jacob Andreas, Yoon Kim
>
> The impressive performance of recent language models across a wide range
> of tasks suggests that they possess a degree of abstract reasoning skills.
> Are these skills general and transferable, or specialized to specific tasks
> seen during pretraining? To disentangle these effects, we propose an
> evaluation framework based on "counterfactual" task variants that deviate
> from the default assumptions underlying standard tasks. Across a suite of
> 11 tasks, we observe nontrivial performance on the counterfactual variants,
> but nevertheless find that performance substantially and consistently
> degrades compared to the default conditions. This suggests that while
> current LMs may possess abstract task-solving skills to an extent, they
> often also rely on narrow, non-transferable procedures for task-solving.
> These results motivate a more careful interpretation of language model
> performance that teases apart these aspects of behavior.
> Comments: NAACL 2024
> Subjects: Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
> Cite as: arXiv:2307.02477 [cs.CL]
>   (or arXiv:2307.02477v3 [cs.CL] for this version)
>
> https://doi.org/10.48550/arXiv.2307.02477
>
> also available as pdf through archiv
>  https://arxiv.org/pdf/2307.02477
>
> a couple of interesting excerpts from the paper
>
>
> *Abstract*
> The impressive performance of recent language models across a wide range
> of tasks suggests that they possess a degree of abstract reasoning skills.
> Are these skills general and transferable, or specialized to specific tasks
> seen during pretraining?
> To disentangle these effects, we propose an evaluation framework based on
> "counterfactual" task variants that deviate from the default assumptions
> underlying standard tasks. Across a suite of 11 tasks, we observe
> nontrivial performance on the counterfactual variants,
> but nevertheless find that performance substantially and consistently
> degrades compared to the default conditions.
> This suggests that while current LMs may possess abstract task-solving
> skills to an extent, they often also rely on narrow, non-transferable
> procedures for task-solving. These results motivate a more careful
> interpretation of language model performance that teases apart
> these aspects of behavior...
>
>
>
> *9 Conclusion*
>
>
>
> Through our counterfactual evaluation on 11 tasks, we identified
> consistent and substantial degradation of LM performance under
> counterfactual conditions. We attribute this gap to overfitting to the
> default task variants, and thus encourage future LM analyses to explicitly
> consider abstract task ability as detached from observed task performance,
> especially when these evaluated task variants might exist in abundance in
> the LM pretraining corpora.
>
>
>
> Furthermore, insofar as this degradation is a result of the LMs' being
> trained only on surface form text, it would also be interesting future work
> to see if more grounded LMs (grounded in the "real" world, or some semantic
> representation, etc.) are more robust to task variations.
>
>
> In other words  --- When it comes to the ability of LLMs to actually
> reason --- Caveat Emptor might be the first order assessment
>
> Ted
>
> PS: maybe we should get a talk by one of the authors?
>
>
> ===============================================
> ::The Lexington Computer and Technology Group Mailing List::
> Reply goes to sender only; Reply All to send to list.
> Send to the list: LCTG at lists.toku.us      Message archives:
> http://lists.toku.us/pipermail/lctg-toku.us/
> To subscribe: email lctg-subscribe at toku.us  To unsubscribe: email
> lctg-unsubscribe at toku.us
> Future and Past meeting information: http://LCTG.toku.us
> List information: http://lists.toku.us/listinfo.cgi/lctg-toku.us
> This message was sent to mwolfe at vinebrook.com.
> Set your list options:
> http://lists.toku.us/options.cgi/lctg-toku.us/mwolfe@vinebrook.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20240715/0b8edb4a/attachment.htm>


More information about the LCTG mailing list