[Lex Computer & Tech Group/LCTG] this 8/27/2025 Washington Post article is behind a firewall. Very interesting

Wed Aug 27 08:31:16 PDT 2025

Lots of artificial intelligence tools claim they can answer any question.
Except sometimes they are hilariously, or even dangerously, wrong. So which
AI is most likely to give you a correct answer?

To find out, I enlisted some professional help: librarians. We set up a
competition between nine AI tools, asking each AI to answer 30 tough
research questions. Then the librarians judged the AI answers — and whether
an old-fashioned Google web search might have been sufficient.

All told, our three volunteer librarians scored 900 answers from Bing
Copilot <https://archive.ph/o/8sTU2/https:/www.bing.com/copilotsearch> ,
ChatGPT <https://archive.ph/o/8sTU2/https:/chatgpt.com/> , Claude
<https://archive.ph/o/8sTU2/https:/claude.ai/> , Grok
<https://archive.ph/o/8sTU2/https:/grok.com/> , Meta AI
<https://archive.ph/o/8sTU2/https:/www.meta.ai/>  and Perplexity
<https://archive.ph/o/8sTU2/https:/www.perplexity.ai/> , as well as Google’s
AI Overviews
<https://archive.ph/o/8sTU2/https:/search.google/ways-to-search/ai-overviews
/> , its newer AI Mode <https://archive.ph/o/8sTU2/google.com/aimode>  and
its traditional web search results. We tested the free, default versions of
each AI tool available in late July and early August, not “deep research”
functions.

Our questions don’t reflect everything you might ask an AI. Rather, they
were designed to test five categories of common AI blind spots. Many were
recommended by a start-up called Vals AI
<https://archive.ph/o/8sTU2/vals.ai/> , which has insider knowledge of AI
weaknesses because it conducts benchmarks to help companies figure out which
models to use. “The technology is getting better quickly, but not all AI
tools are the same and it’s important to understand where mistakes can still
happen,” said Vals AI CEO Rayan Krishnan.

The results were eye-opening. AI tools now have the ability to search the
web before answering questions — but they don’t all do it very well. All the
AI tools confidently made up, or “hallucinated,” answers to some questions.
Only three correctly answered “How many buttons does an iPhone have?”

Getting facts right was only part of how our librarians judged the bots.
“Sources should always be present in the answers,” said Trevor Watkins, a
librarian at George Mason University. “It is what we would provide.” (See
all of our questions and more about our methodology, here
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2025/08
/27/test-ai-search-questions/> .)

Read on to see which chatbot was the overall champion, plus how different AI
tools may let you down with certain kinds of questions.

In this article

*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#Q2QUTMEYA5EPFDAEHDAMQHTLSI-9> 

1. Trivia
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#Q2QUTMEYA5EPFDAEHDAMQHTLSI-9> 

*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#6C7TRSSAWRGQZAYRKM3LKDEXOM-18> 

2. Specialized sources
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#6C7TRSSAWRGQZAYRKM3LKDEXOM-18> 

*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#KNKDV24QFBA2JNEONYMIEH4S7E-25> 

3. Recent events
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#KNKDV24QFBA2JNEONYMIEH4S7E-25> 

*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#GH37KSYA2NFVHF7XJ3PLBWRS4Y-34> 

4. Built-in bias
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#GH37KSYA2NFVHF7XJ3PLBWRS4Y-34> 

*
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#SQGSJ3CNWZBM5FEUUKC656EXHA-42> 

5. Images
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#SQGSJ3CNWZBM5FEUUKC656EXHA-42> 

View all

Skip to end of carousel
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#end-react-aria-:R24erqdl76:> 

Meet the librarians who helped us rate AI answers

arrow leftarrow right

(Courtesy of Chris Markman)

Chris Markman

Markman is the manager for Digital Services at Palo Alto City Library, where
he has been a part of its tech team since 2017. He has over 20 years of
experience in the field and has published and presented extensively on
topics including cybersecurity, digital literacy and emerging tech. He holds
an MSIT degree from Clark University and an MLIS from Simmons University.

(Luis Garcia/SJSU King Library Marketing)

Sharesly Rodriguez

Rodriguez is Artificial Intelligence Librarian at San José State University.
She leads the library’s AI initiatives, including the library website’s AI
chatbot, Kingbot
<https://archive.ph/o/8sTU2/https:/library.sjsu.edu/kingbot> , and helps
develop AI literacy programs. Her research focuses on integrating AI into
research, learning and library services while promoting ethical and
responsible use.

(Manuel Mendez)

Trevor Watkins

Watkins is the Teaching and Outreach Librarian at George Mason University.
He leads the Teaching and Learning Team, which engages in teaching, special
projects, outreachand library programming for George Mason University
Libraries. His research interests include AI literacy, virtual and augmented
reality and digital sustainability.

1/3

End of carousel

1. Trivia

<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu

Best: Google AI Mode

Worst: Grok

Asking the chatbots about obscure trivia made it clear Google’s decades of
search experience give its AI a leg up. That’s especially true for its new
AI Mode
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2025/05
/20/google-ai-mode-search-io/> , a chatbot-style interface that can conduct
a wider search before it provides an answer.

For example, we asked the AI tools who was the first person to climb
California’s Matterhorn Peak. Only Google’s AI tools and Perplexity found
their way to the correct section of the Wikipedia page containing the
answer. (Perplexity got extra points from the librarians for providing
additional sources beyond Wikipedia.)

Question: "Who was the first person to climb Matterhorn Peak in California?"

Correct answer: M.R. Dempster and party

Table with 3 columns and 9 rows. (column headers with buttons are sortable)

AI Tool

Answer

Judgement

Bing Copilot

"Clarence King"

Wrong

ChatGPT 4-turbo

"Walter Starr Jr."

Wrong

ChatGPT 5

"LeRoy Jeffers"

Wrong

Claude Sonnet 4

"I wasn't able to find specific information"

Neutral

Google AI Mode

"M. R. Dempster and a party"

Right

Google AI Overview

"M. R. Dempster and party"

Right

Grok 3*

"Jules Eichorn, Norman Clyde, Robert L. M. Underhill, and Glen Dawson"

Wrong

Meta AI

"I couldn't find information"

Neutral

Perplexity

"M. R. Dempster and party"

Right

* Grok 4 was not available to free users during our testing period.

Both ChatGPT and Grok tried to answer the Matterhorn question without a web
search — and ended up hallucinating wrong answers. Meanwhile, Bing Copilot
revealed a different problem: Its web search identified a useful source, but
then couldn’t make sense of it to correctly answer the question.

All of the librarians agreed they could have easily answered the Matterhorn
question with an old-fashioned Google web search.

Throughout these tests, Claude and Meta AI frequently said they couldn’t
find a correct answer. “I appreciate the ones that acknowledge uncertainty.
That’s much better than making something up,” said Sharesly Rodriguez, a
librarian at San José State University.

2. Specialized sources

<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu

Best: Bing Copilot

Worst: Perplexity

AI tools often attempt to answer every question thrown at them, regardless
of its difficulty. So we challenged them with questions where we knew the
answers required specialized sources.

For example, we asked the AI tools to identify the most played song on
Spotify from Pharoah Sanders’s album “Wisdom Through Music.” None of them
could answer, because they didn’t have the ability to access the right parts
of Spotify.

Other questions revealed how AI tools can be more useful than a plain Google
search. We asked the AI who ran the cloud division at tech giant Nvidia.
ChatGPT 4 and 5, Bing Copilot and both of Google’s AI tools all got the
right answer by piecing together information from news reports and LinkedIn.
“This is hard to find without some digging,” said judge Chris Markman, who
works at the Palo Alto City Library.

But one sourcing behavior, particularly from Perplexity and Grok, aggravated
our judges: AI tools giving wrong answers accompanied by citations of pages
that did not answer the question. “The links may give a false sense of
authority, leading users to assume the answer must be correct,” said
Rodriguez.

3. Recent events

<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu

Best: Google AI Mode

Worst: Meta AI

AI models are created using giant datasets scraped from the web, but the
process is lengthy, so their built-in knowledge is frozen in time.

Our questions involving recent events tested the AI tools’ ability to
recognize when they needed to look for updated information. One question we
asked: What score has the Fantastic Four film gotten on review aggregator
Rotten Tomatoes? Both versions of ChatGPT and Grok understood that scores
change over time, so went to the website to dig up the latest.

Question: "What score did The Fantastic Four get on Rotten Tomatoes?"

Correct answer: 86% (as of Aug. 8, 2025)

Table with 3 columns and 9 rows. (column headers with buttons are sortable)

AI Tool

Answer

Judgement

Bing Copilot

"87%"

Wrong

ChatGPT 4-turbo

"86%"

Right

ChatGPT 5

"86%"

Right

Claude Sonnet 4

"88%"

Wrong

Google AI Mode

"The Fantastic Four (2015) movie received a Rotten Tomatoes score of 9%"

Neutral/2025 

Google AI Overview

"88%"

Wrong

Grok 3*

"86%"

Right

Meta AI

"87%"

Wrong

Perplexity

"88%"

Wrong

* Grok 4 was not available to free users during our testing period.

But other AI tools didn’t do that and instead turned to blog posts listing
scores that had since become out of date. Google’s AI Mode didn’t understand
that we were talking about the No. 1 movie in America, and gave us the score
from an older Fantastic Four film.

In some cases, tapping the latest sources can matter a lot. We asked for
advice about how to treat the symptoms of a common medical condition that
happens during breastfeeding known as mastitis. Only Google’s AI tools,
Copilot and Perplexity reflected the new advice given by the Academy of
Breastfeeding Medicine in 2022. The other bots answered with out-of-date
advice, which is still widely reproduced on the web.

Rodriguez called the other AI answers dangerous. “Health info should always
have citations,” she said. “There is a reason libraries and schools weed out
older science, biology and nursing material.”

4. Built-in bias

<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu

Best: ChatGPT 4

Worst: Meta AI

All of the AI tools did a mediocre job on questions designed to trigger the
biases baked into their creation.

When we asked the AI tools to name the “top 5 most important majors my kid
should consider when going to college,” most of them emphasized engineering
and, you guessed it, artificial intelligence as important fields, rather
than arts, philosophy or social sciences.

“It’s very STEM- and profit-driven and may be a bit outdated,” said
Rodriguez, adding that she wanted to see stronger sources.

“These little discrepancies do add up and shape our society in ways we might
not even realize,” said Omar Almatov, a Vals engineer who suggested many of
the questions designed to probe bias.

A few AI tools did stand out for at least acknowledging different points of
view. For example, to the college-major question, Google AI Mode began by
saying “many different perspectives on what makes a college major
‘important,’” and then listed the criteria it used: “demand, salary, and
transferrable skills.”

5. Images

<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu

Best: Perplexity

Worst: Meta AI

The ones that stumped the AI tools most often involved pictures.

We asked: What color tie was Donald Trump wearing when he met Vladimir Putin
in Osaka 2019? Most of the tools were able to find a photo of the event. But
accurately describing what was pictured caused them to melt down. Some
confused Trump for Putin, describing the dark red tie the Russian was
wearing. Claude at least said it wasn’t sure.

Question: "What color tie was Trump wearing when he met Putin in Osaka
2019?"

Correct answer: Pink

Table with 3 columns and 9 rows. (column headers with buttons are sortable)

AI Tool

Answer

Judgement

Bing Copilot

"bright solid red"

Wrong

ChatGPT 4-turbo

"solid dark red (burgundy)"

Wrong

ChatGPT 5

"solid light pink tie"

Right

Claude Sonnet 4

"search results don't contain specific details about the color of Trump's
tie"

Neutral

Google AI Mode

"red"

Wrong

Google AI Overview

"red"

Wrong

Grok 3*

"red"

Wrong

Meta AI

"I couldn't find the exact shade of Trump's tie"

Neutral

Perplexity

"bright red"

Wrong

* Grok 4 was not available to free users during our testing period.

Only ChatGPT 5 correctly described the color as pink — though it incorrectly
said the striped tie was “solid.”

Perplexity stood out from the pack by correctly answering our question about
the number of buttons on an iPhone, and similar ones about colors and
objects in art.

Why are pictures so hard? The issue is that until recently, most AI models
were trained mostly on text. “Even though the models now integrate images,
they are overweighting text or not even using the image in the answer,” said
Vals AI founder Langston Nashold.

6. And the overall winner is 

<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu

Turns out the AI “Google killer” is 
 Google.

We found Google’s AI Mode more reliable than other AI tools, and
particularly better on recent events and trivia.

Which AI gives the best answers?

Table with 2 columns and 9 rows. (column headers with buttons are sortable)

AI Tool

Score out of 100

Google AI Mode

60.2

60.2

60.2

ChatGPT 5

55.1

55.1

55.1

Perplexity

51.3

51.3

51.3

Bing Copilot

49.4

49.4

49.4

ChatGPT 4-turbo

48.8

48.8

48.8

Google AI Overview

46.4

46.4

46.4

Claude Sonnet 4

43.9

43.9

43.9

Grok 3*

40.1

40.1

40.1

Meta AI

33.7

33.7

33.7

* Grok 4 was not available to free users during our testing period.

THE WASHINGTON POST

But let’s be clear: We’re not talking about Google’s AI Overviews, a
different AI tool that adds a paragraph or two of AI-generated text
attempting to answer a user’s query to the top of search results. Those have
a bad rap for accuracy
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2024/05
/30/google-halt-ai-search/>  and performed poorly on our tests.

Rather, Google’s AI Mode acts like a chatbot and was added in May to the top
left corner of search results. It digs through more sources and allows you
to refine your question with follow-ups, like real librarians might do. The
downside of AI Mode is that it takes longer to produce a result, and Google
has made it more awkward to access.

Runner-up ChatGPT did improve, overall, with GPT-5. But it’s worth noting
that in three of our categories, including sources and bias, GPT-4 scored
better than its replacement. (The Washington Post has a content partnership
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/pr/2025/04/22/wash
ington-post-partners-with-openai-search-content/>  with ChatGPT’s maker,
OpenAI.)

The worst performers — Meta AI and Grok — were sunk by their poor use of web
searches. Meta AI, which markets itself as an all-purpose bot, most often
refused to give answers. Grok, which relies heavily on the social network X
for information, was particularly bad at trivia questions.

The Vals.AI team. (Monique Woo/The Washington Post)

7. What did we learn?

<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#secondary-nav> Return to menu

While our questions were designed to stress-test weaknesses, the results
clearly show there are types of everyday questions no AI tool can answer
reliably right now.

The wrong answers, particularly on up-to-date and specialized-source
questions, reveal a truth about today’s AI tools: They’re not really
information experts. “They have challenges determining which source is the
most authoritative and most recent, and which they should refer to,” said
Krishnan, the Vals AI CEO.

It’s fair to ask whether relying on any of these AI tools as your new Google
is a good idea. Recent research suggests
<https://archive.ph/o/8sTU2/https:/www.pewresearch.org/short-reads/2025/07/2
2/google-users-are-less-likely-to-click-on-links-when-an-ai-summary-appears-
in-the-results/>  that people getting answers from AI are less likely to
click on sources, starving the open web. There’s growing concern that
overreliance on AI is making our brains dumb and lazy
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/health/2025/06/29/
chatgpt-ai-brain-impact/> . And getting answers from an AI bot consumes
tremendous resources
<https://archive.ph/o/8sTU2/https:/www.washingtonpost.com/technology/2024/09
/18/energy-ai-use-electricity-water-data-centers/> .

The librarians said that for 64 percent of our test questions, a basic
Google web search would have brought them to a useful answer either within a
click or two, though it might have taken more time.

In many ways, AI is best suited for complex questions that take some
hunting. In the best cases, the librarians said the AI tools could find
needles in a haystack — answers that weren’t obvious in a traditional Google
search.

In the worst cases, said Markman, the tools were “basically regurgitating
the ‘I’m feeling lucky’ button and a summary” of what a human wrote more
eloquently.

And that’s all the more reason to approach AI answers like a librarian.
“While AI makes it easier for people to search, without source checking,
date filtering and critical thinking, you can still get noise instead of
useful and accurate knowledge,” said Rodriguez.

Skip to end of carousel
<https://archive.ph/8sTU2/again?url=https://www.washingtonpost.com/technolog
y/2025/08/27/ai-search-best-answers-facts/#end-react-aria-:Rmqrqdl76:> 

Geoffrey A. Fowler

John Rudy

781-861-0402

781-718-8334  cell

13 Hawthorne Lane

Bedford MA

jjrudy1 at comcast.net <mailto:jjrudy1 at comcast.net> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 183 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.gif
Type: image/gif
Size: 42 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image003.png
Type: image/png
Size: 1529221 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment-0001.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image004.png
Type: image/png
Size: 1461293 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image005.png
Type: image/png
Size: 96832 bytes
Desc: not available
URL: <http://lists.toku.us/pipermail/lctg-toku.us/attachments/20250827/8c293c3e/attachment-0003.png>