[Lex Computer & Tech Group/LCTG] this 8/27/2025 Washington Post article is behind a firewall. Very interesting
jjrudy1 at comcast.net
jjrudy1 at comcast.net
Wed Aug 27 08:31:16 PDT 2025
Lots of artificial intelligence tools claim they can answer any question.
Except sometimes they are hilariously, or even dangerously, wrong. So which
AI is most likely to give you a correct answer?
To find out, I enlisted some professional help: librarians. We set up a
competition between nine AI tools, asking each AI to answer 30 tough
research questions. Then the librarians judged the AI answers and whether
an old-fashioned Google web search might have been sufficient.
All told, our three volunteer librarians scored 900 answers from Bing
Copilot <https://archive.ph/o/8sTU2/https:/www.bing.com/copilotsearch> ,
ChatGPT <https://archive.ph/o/8sTU2/https:/chatgpt.com/> , Claude
<https://archive.ph/o/8sTU2/https:/claude.ai/> , Grok
<https://archive.ph/o/8sTU2/https:/grok.com/> , Meta AI
<https://archive.ph/o/8sTU2/https:/www.meta.ai/> and Perplexity
<https://archive.ph/o/8sTU2/https:/www.perplexity.ai/> , as well as Googles
AI Overviews
<https://archive.ph/o/8sTU2/https:/search.google/ways-to-search/ai-overviews
/> , its newer AI Mode <https://archive.ph/o/8sTU2/google.com/aimode> and
its traditional web search results. We tested the free, default versions of
each AI tool available in late July and early August, not deep research
functions.
Our questions dont reflect everything you might ask an AI. Rather, they
were designed to test five categories of common AI blind spots. Many were
recommended by a start-up called Vals AI
<https://archive.ph/o/8sTU2/vals.ai/> , which has insider knowledge of AI
weaknesses because it conducts benchmarks to help companies figure out which
models to use. The technology is getting better quickly, but not all AI
tools are the same and its important to understand where mistakes can still
happen, said Vals AI CEO Rayan Krishnan.
The results were eye-opening. AI tools now have the ability to search the
web before answering questions but they dont all do it very well. All the
AI tools confidently made up, or hallucinated, answers to some questions.
Only three correctly answered How many buttons does an iPhone have?
Getting facts right was only part of how our librarians judged the bots.
Sources should always be present in the answers, said Trevor Watkins, a
librarian at George Mason University. It is what we would provide. (See
all of our questions and more about our methodology, here
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2025/08
/27/test-ai-search-questions/> .)
Read on to see which chatbot was the overall champion, plus how different AI
tools may let you down with certain kinds of questions.
In this article
*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#Q2QUTMEYA5EPFDAEHDAMQHTLSI-9>
1. Trivia
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#Q2QUTMEYA5EPFDAEHDAMQHTLSI-9>
*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#6C7TRSSAWRGQZAYRKM3LKDEXOM-18>
2. Specialized sources
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#6C7TRSSAWRGQZAYRKM3LKDEXOM-18>
*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#KNKDV24QFBA2JNEONYMIEH4S7E-25>
3. Recent events
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#KNKDV24QFBA2JNEONYMIEH4S7E-25>
*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#GH37KSYA2NFVHF7XJ3PLBWRS4Y-34>
4. Built-in bias
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#GH37KSYA2NFVHF7XJ3PLBWRS4Y-34>
*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#SQGSJ3CNWZBM5FEUUKC656EXHA-42>
5. Images
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#SQGSJ3CNWZBM5FEUUKC656EXHA-42>
View all
Skip to end of carousel
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#end-react-aria-:R24erqdl76:>
Meet the librarians who helped us rate AI answers
arrow leftarrow right
(Courtesy of Chris Markman)
Chris Markman
Markman is the manager for Digital Services at Palo Alto City Library, where
he has been a part of its tech team since 2017. He has over 20 years of
experience in the field and has published and presented extensively on
topics including cybersecurity, digital literacy and emerging tech. He holds
an MSIT degree from Clark University and an MLIS from Simmons University.
(Luis Garcia/SJSU King Library Marketing)
Sharesly Rodriguez
Rodriguez is Artificial Intelligence Librarian at San José State University.
She leads the librarys AI initiatives, including the library websites AI
chatbot, Kingbot
<https://archive.ph/o/8sTU2/https:/library.sjsu.edu/kingbot> , and helps
develop AI literacy programs. Her research focuses on integrating AI into
research, learning and library services while promoting ethical and
responsible use.
(Manuel Mendez)
Trevor Watkins
Watkins is the Teaching and Outreach Librarian at George Mason University.
He leads the Teaching and Learning Team, which engages in teaching, special
projects, outreachand library programming for George Mason University
Libraries. His research interests include AI literacy, virtual and augmented
reality and digital sustainability.
1/3
End of carousel
1. Trivia
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu
Best: Google AI Mode
Worst: Grok
Asking the chatbots about obscure trivia made it clear Googles decades of
search experience give its AI a leg up. Thats especially true for its new
AI Mode
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2025/05
/20/google-ai-mode-search-io/> , a chatbot-style interface that can conduct
a wider search before it provides an answer.
For example, we asked the AI tools who was the first person to climb
Californias Matterhorn Peak. Only Googles AI tools and Perplexity found
their way to the correct section of the Wikipedia page containing the
answer. (Perplexity got extra points from the librarians for providing
additional sources beyond Wikipedia.)
Question: "Who was the first person to climb Matterhorn Peak in California?"
Correct answer: M.R. Dempster and party
Table with 3 columns and 9 rows. (column headers with buttons are sortable)
AI Tool
Answer
Judgement
Bing Copilot
"Clarence King"
Wrong
ChatGPT 4-turbo
"Walter Starr Jr."
Wrong
ChatGPT 5
"LeRoy Jeffers"
Wrong
Claude Sonnet 4
"I wasn't able to find specific information"
Neutral
Google AI Mode
"M. R. Dempster and a party"
Right
Google AI Overview
"M. R. Dempster and party"
Right
Grok 3*
"Jules Eichorn, Norman Clyde, Robert L. M. Underhill, and Glen Dawson"
Wrong
Meta AI
"I couldn't find information"
Neutral
Perplexity
"M. R. Dempster and party"
Right
* Grok 4 was not available to free users during our testing period.
Both ChatGPT and Grok tried to answer the Matterhorn question without a web
search and ended up hallucinating wrong answers. Meanwhile, Bing Copilot
revealed a different problem: Its web search identified a useful source, but
then couldnt make sense of it to correctly answer the question.
All of the librarians agreed they could have easily answered the Matterhorn
question with an old-fashioned Google web search.
Throughout these tests, Claude and Meta AI frequently said they couldnt
find a correct answer. I appreciate the ones that acknowledge uncertainty.
Thats much better than making something up, said Sharesly Rodriguez, a
librarian at San José State University.
2. Specialized sources
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu
Best: Bing Copilot
Worst: Perplexity
AI tools often attempt to answer every question thrown at them, regardless
of its difficulty. So we challenged them with questions where we knew the
answers required specialized sources.
For example, we asked the AI tools to identify the most played song on
Spotify from Pharoah Sanderss album Wisdom Through Music. None of them
could answer, because they didnt have the ability to access the right parts
of Spotify.
Other questions revealed how AI tools can be more useful than a plain Google
search. We asked the AI who ran the cloud division at tech giant Nvidia.
ChatGPT 4 and 5, Bing Copilot and both of Googles AI tools all got the
right answer by piecing together information from news reports and LinkedIn.
This is hard to find without some digging, said judge Chris Markman, who
works at the Palo Alto City Library.
But one sourcing behavior, particularly from Perplexity and Grok, aggravated
our judges: AI tools giving wrong answers accompanied by citations of pages
that did not answer the question. The links may give a false sense of
authority, leading users to assume the answer must be correct, said
Rodriguez.
3. Recent events
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu
Best: Google AI Mode
Worst: Meta AI
AI models are created using giant datasets scraped from the web, but the
process is lengthy, so their built-in knowledge is frozen in time.
Our questions involving recent events tested the AI tools ability to
recognize when they needed to look for updated information. One question we
asked: What score has the Fantastic Four film gotten on review aggregator
Rotten Tomatoes? Both versions of ChatGPT and Grok understood that scores
change over time, so went to the website to dig up the latest.
Question: "What score did The Fantastic Four get on Rotten Tomatoes?"
Correct answer: 86% (as of Aug. 8, 2025)
Table with 3 columns and 9 rows. (column headers with buttons are sortable)
AI Tool
Answer
Judgement
Bing Copilot
"87%"
Wrong
ChatGPT 4-turbo
"86%"
Right
ChatGPT 5
"86%"
Right
Claude Sonnet 4
"88%"
Wrong
Google AI Mode
"The Fantastic Four (2015) movie received a Rotten Tomatoes score of 9%"
Neutral/2025
Google AI Overview
"88%"
Wrong
Grok 3*
"86%"
Right
Meta AI
"87%"
Wrong
Perplexity
"88%"
Wrong
* Grok 4 was not available to free users during our testing period.
But other AI tools didnt do that and instead turned to blog posts listing
scores that had since become out of date. Googles AI Mode didnt understand
that we were talking about the No. 1 movie in America, and gave us the score
from an older Fantastic Four film.
In some cases, tapping the latest sources can matter a lot. We asked for
advice about how to treat the symptoms of a common medical condition that
happens during breastfeeding known as mastitis. Only Googles AI tools,
Copilot and Perplexity reflected the new advice given by the Academy of
Breastfeeding Medicine in 2022. The other bots answered with out-of-date
advice, which is still widely reproduced on the web.
Rodriguez called the other AI answers dangerous. Health info should always
have citations, she said. There is a reason libraries and schools weed out
older science, biology and nursing material.
4. Built-in bias
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu
Best: ChatGPT 4
Worst: Meta AI
All of the AI tools did a mediocre job on questions designed to trigger the
biases baked into their creation.
When we asked the AI tools to name the top 5 most important majors my kid
should consider when going to college, most of them emphasized engineering
and, you guessed it, artificial intelligence as important fields, rather
than arts, philosophy or social sciences.
Its very STEM- and profit-driven and may be a bit outdated, said
Rodriguez, adding that she wanted to see stronger sources.
These little discrepancies do add up and shape our society in ways we might
not even realize, said Omar Almatov, a Vals engineer who suggested many of
the questions designed to probe bias.
A few AI tools did stand out for at least acknowledging different points of
view. For example, to the college-major question, Google AI Mode began by
saying many different perspectives on what makes a college major
important, and then listed the criteria it used: demand, salary, and
transferrable skills.
5. Images
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu
Best: Perplexity
Worst: Meta AI
The ones that stumped the AI tools most often involved pictures.
We asked: What color tie was Donald Trump wearing when he met Vladimir Putin
in Osaka 2019? Most of the tools were able to find a photo of the event. But
accurately describing what was pictured caused them to melt down. Some
confused Trump for Putin, describing the dark red tie the Russian was
wearing. Claude at least said it wasnt sure.
Question: "What color tie was Trump wearing when he met Putin in Osaka
2019?"
Correct answer: Pink
Table with 3 columns and 9 rows. (column headers with buttons are sortable)
AI Tool
Answer
Judgement
Bing Copilot
"bright solid red"
Wrong
ChatGPT 4-turbo
"solid dark red (burgundy)"
Wrong
ChatGPT 5
"solid light pink tie"
Right
Claude Sonnet 4
"search results don't contain specific details about the color of Trump's
tie"
Neutral
Google AI Mode
"red"
Wrong
Google AI Overview
"red"
Wrong
Grok 3*
"red"
Wrong
Meta AI
"I couldn't find the exact shade of Trump's tie"
Neutral
Perplexity
"bright red"
Wrong
* Grok 4 was not available to free users during our testing period.
Only ChatGPT 5 correctly described the color as pink though it incorrectly
said the striped tie was solid.
Perplexity stood out from the pack by correctly answering our question about
the number of buttons on an iPhone, and similar ones about colors and
objects in art.
Why are pictures so hard? The issue is that until recently, most AI models
were trained mostly on text. Even though the models now integrate images,
they are overweighting text or not even using the image in the answer, said
Vals AI founder Langston Nashold.
6. And the overall winner is
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu
Turns out the AI Google killer is
Google.
We found Googles AI Mode more reliable than other AI tools, and
particularly better on recent events and trivia.
Which AI gives the best answers?
Table with 2 columns and 9 rows. (column headers with buttons are sortable)
AI Tool
Score out of 100
Google AI Mode
60.2
60.2
60.2
ChatGPT 5
55.1
55.1
55.1
Perplexity
51.3
51.3
51.3
Bing Copilot
49.4
49.4
49.4
ChatGPT 4-turbo
48.8
48.8
48.8
Google AI Overview
46.4
46.4
46.4
Claude Sonnet 4
43.9
43.9
43.9
Grok 3*
40.1
40.1
40.1
Meta AI
33.7
33.7
33.7
* Grok 4 was not available to free users during our testing period.
THE WASHINGTON POST
But lets be clear: Were not talking about Googles AI Overviews, a
different AI tool that adds a paragraph or two of AI-generated text
attempting to answer a users query to the top of search results. Those have
a bad rap for accuracy
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2024/05
/30/google-halt-ai-search/> and performed poorly on our tests.
Rather, Googles AI Mode acts like a chatbot and was added in May to the top
left corner of search results. It digs through more sources and allows you
to refine your question with follow-ups, like real librarians might do. The
downside of AI Mode is that it takes longer to produce a result, and Google
has made it more awkward to access.
Runner-up ChatGPT did improve, overall, with GPT-5. But its worth noting
that in three of our categories, including sources and bias, GPT-4 scored
better than its replacement. (The Washington Post has a content partnership
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/pr/2025/04/22/wash
ington-post-partners-with-openai-search-content/> with ChatGPTs maker,
OpenAI.)
The worst performers Meta AI and Grok were sunk by their poor use of web
searches. Meta AI, which markets itself as an all-purpose bot, most often
refused to give answers. Grok, which relies heavily on the social network X
for information, was particularly bad at trivia questions.
The Vals.AI team. (Monique Woo/The Washington Post)
7. What did we learn?
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu
While our questions were designed to stress-test weaknesses, the results
clearly show there are types of everyday questions no AI tool can answer
reliably right now.
The wrong answers, particularly on up-to-date and specialized-source
questions, reveal a truth about todays AI tools: Theyre not really
information experts. They have challenges determining which source is the
most authoritative and most recent, and which they should refer to, said
Krishnan, the Vals AI CEO.
Its fair to ask whether relying on any of these AI tools as your new Google
is a good idea. Recent research suggests
<https://archive.ph/o/8sTU2/https:/www.pewresearch.org/short-reads/2025/07/2
2/google-users-are-less-likely-to-click-on-links-when-an-ai-summary-appears-
in-the-results/> that people getting answers from AI are less likely to
click on sources, starving the open web. Theres growing concern that
overreliance on AI is making our brains dumb and lazy
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/health/2025/06/29/
chatgpt-ai-brain-impact/> . And getting answers from an AI bot consumes
tremendous resources
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2024/09
/18/energy-ai-use-electricity-water-data-centers/> .
The librarians said that for 64 percent of our test questions, a basic
Google web search would have brought them to a useful answer either within a
click or two, though it might have taken more time.
In many ways, AI is best suited for complex questions that take some
hunting. In the best cases, the librarians said the AI tools could find
needles in a haystack answers that werent obvious in a traditional Google
search.
In the worst cases, said Markman, the tools were basically regurgitating
the Im feeling lucky button and a summary of what a human wrote more
eloquently.
And thats all the more reason to approach AI answers like a librarian.
While AI makes it easier for people to search, without source checking,
date filtering and critical thinking, you can still get noise instead of
useful and accurate knowledge, said Rodriguez.
Skip to end of carousel
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#end-react-aria-:Rmqrqdl76:>
Geoffrey A. Fowler
John Rudy
781-861-0402
781-718-8334 cell
13 Hawthorne Lane
Bedford MA
jjrudy1 at comcast.net <mailto:jjrudy1 at comcast.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 183 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.gif
Type: image/gif
Size: 42 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 1529221 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 1461293 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 96832 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment-0003.png>
More information about the LCTG
mailing list