The AI Pulse
Posts
🧠 LLMs Can’t Think, Here’s Why

🧠 LLMs Can’t Think, Here’s Why

PLUS: How Australian Researchers Uncovered LLMs’ Critical Limitations

Rohun Shroff
August 25, 2024

Subscribe | Contact | Meet The Team

Welcome back AI prodigies!

In today’s Sunday Special:

📦Distribution vs. Instance-Level Tasks
📚Planning, Reasoning, and Formal Logic
🤖LLM Reasoning, Refuted
🔑Key Takeaway

Read Time: 7 minutes

🎓Key Terms

Torus: the surface of a donut-shaped geometric object.
Large Language Models (LLMs): AI models pre-trained on vast amounts of data to generate human-like text.
Token: the smallest units of data used by AI models to process and generate text. Similarly, we break down sentences into words or characters.
LLM-Modulo: a framework that combines LLMs with external verifiers to check the generated content.
Retrieval-Augmented Generation (RAG): a technique that improves the accuracy of AI models by pulling relevant, up-to-date data directly related to a user’s query.

🩺 PULSE CHECK

Can conversational chatbots like OpenAI’s ChatGPT use logic or plan?

Vote Below to View Live Results

📦DISTRIBUTION VS. INSTANCE-LEVEL TASKS

We’ve often written about LLMs’ intelligence, reasoning limitations, and whether or not they can become more capable than humans. We once shared an anecdote about Sean M. Carroll, an American theoretical physicist and philosopher. Let’s revisit that anecdote to learn why conversational chatbots may not be able to reason. In 2023, Carroll suspected OpenAI’s ChatGPT didn’t truly understand human prompts. To test his theory, he fed it the following prompt:

“Imagine we’re playing a modified version of chess where the board is treated as a torus. From any one of the four sides, squares on the directly opposite side are counted as adjacent, and pieces can move in that direction. Is it possible to say whether white or black will generally win this kind of chess match?”

-Sean M. Carroll’s “Human Prompt Hypothesis”

OpenAI’s ChatGPT provided a long-winded but equivocal answer. It said chess on a torus-shaped board would open up new strategic and tactical possibilities but didn’t conclude whether white or black chess pieces would be more likely to win relative to a standard, square-shaped chess board. That’s because OpenAI’s ChatGPT analyzes text strings and produces responses in which each subsequent word is most likely to occur based on past words.

LLMs are text-generating machines that learn sequences in the training data. They excel in so-called distribution-level tasks (e.g., tone, formality, and syntax) that require learning a pattern, like regurgitating the style of the great poet Ernest Hemingway. LLMs can quickly correct text formatting or even copy a persona created by the user’s query, such as: “Explain AI like I’m in elementary school.”

However, the planning and reasoning required to comprehend a prompt about a modified version of chess are instance-level tasks (e.g., word choice or sentence structure) that require a deeper understanding of individual elements within a text; that’s why LLMs like OpenAI’s ChatGPT failed to answer Carroll’s prompt.

Today, we’ll settle the debate on LLMs’ planning and reasoning capabilities. First, we’ll formally define planning and reasoning. Then, we’ll address three common reasons why folks often conflate LLM text generation with logical thought. And as always, we’ll close with a key takeaway.

📚PLANNING, REASONING, AND FORMAL LOGIC

📆Planning?

LLMs like OpenAI’s ChatGPT are primarily designed to generate text based on patterns from large amounts of data. They cannot inherently plan or sequence actions over time toward a goal. LLMs generate text one token at a time without explicitly understanding future outcomes or states. Planning requires a structured approach to foresee steps and consequences, which differs from how LLMs predict the next word in a sequence.

💬Reasoning?

LLMs can’t reason if they can’t generalize their knowledge to new, unseen tasks. The best example of this is multiplication and introductory algebra. Even after being fine-tuned on a vast dataset to solve three-digit multiplication, LLMs failed to solve five-digit multiplication. This example suggests that while LLMs can perform well on familiar tasks, they may lack the ability to truly understand the underlying principles and apply them to novel situations.

You might say that current LLMs like OpenAI’s GPT-4o (“o” for “omni”) can do that correctly, and you’d be right. They may appear capable of complex tasks like five-digit multiplication. Still, their underlying mechanism relies on external tools like calculators or pre-programmed algorithms within an LLM-Modulo framework, where additional computational resources augment the LLM’s capabilities.

💭Formal Logic?

Formal logic helps us determine if a conclusion is a logical consequence of the given premises. It’s like a set of guidelines for evaluating the reasoning behind an argument. It involves using symbols to represent arguments and applying specific rules to manipulate these symbols to determine the validity of the arguments. If you’ve taken a geometry course or philosophy class, you might recognize the following statement: “If P implies Q, and P is true, then Q must also be true.”

Stanford Research Institute Problem Solver (STRIPS) is an excellent example of a formal planner. This framework breaks down a problem into a series of smaller sub-problems. The planning system doesn’t need input from an external verifier (i.e., LLM-Modulo) to determine whether it’s correct. It leverages a heuristic search algorithm to find a solution to each sub-problem. STRIPS has been used to solve various bottlenecks in AI, including scheduling issues and planning problems.

While LLMs can generate plans, their ability to produce truly feasible and verifiable schedules is limited. Unlike formal planning systems like STRIPS, which rely on logical reasoning and constraint satisfaction, LLMs generate plans based on patterns and correlations learned from their training data. This approach leads to plans that are impractical, incomplete, or incorrect. For example, when I asked OpenAI’s ChatGPT to plan a trip to Europe, it suggested plans that weren’t verifiable. For instance, it didn’t check whether train services were operational in the morning or if a museum was open on Monday. It just confidently spewed nonsense.

LLMs inherently lack the quality of a predictable order or plan because their outputs are randomly generated. In other words, LLMs can always take a different route with the same input, making consistent outputs, a precursor to formal planning, impossible.

🤖LLM REASONING, REFUTED

Here are three common reasons why many believe LLMs can reason:

1. LLMs Can Generate Code.

They can generate and correct code, find bugs, and develop multiple real solutions with few hallucinations. Over 37,000 organizations pay for their 1.3 million developers to use the most popular AI coding tool, GitHub Copilot. So, what gives?

LLMs can retrieve code snippets, but some mistakenly believe that the LLMs are deploying reasoning to write the code. In reality, LLMs aren’t just trained on the final correct code but on code repositories, where they’ve seen every code iteration. They know precisely what changed in each version of the code. When you prompt it for the first time, it just retrieves the first version of the code, and when you ask it to find the errors, it retrieves the next version of the same code, giving you the illusion of reasoning and understanding.

2. We Can Fine-Tune LLMs to Plan in a Specific Domain.

Fine-tuning involves training an AI model on a specific dataset to improve its performance on particular tasks.

Fine-tuning may not be as effective as we initially thought. Dr. Scott Barnett, Deputy Head of AI Centric Applications, and his team from Deakin University (DU) in Australia recently investigated the impact of fine-tuning on the performance of LLMs within Retrieval-Augmented Generation (RAG) systems. They utilized three open-source datasets from different scientific domains to evaluate the performance of fine-tuned LLMs against baseline AI models. The AI models assessed included Mistral AI’s Mistral Chat, Meta’s Llama 2, and OpenAI’s GPT-4, with fine-tuning conducted on varying sizes of domain-specific data. The evaluation focused on the accuracy and completeness of the AI models’ responses. Surprisingly, the results indicate that fine-tuning generally led to a decline in performance across multiple domains. Also, increasing the sample size for fine-tuning didn’t necessarily lead to better performance and, in some cases, decreased accuracy and completeness. This study highlights the need to properly validate fine-tuned AI models for domain-specific tasks.

3. LLMs Outperform Humans on Expertise Tests.

While it’s true that LLMs mop the floor with humans across various exams, from the Family Medicine Board Exams (FMBE) to official creativity challenges, these victories likely reflect its text prediction capabilities.

To test this hypothesis, researchers from the University of Technology Sydney (UTS) in Australia introduced a linguistic benchmark to assess the limitations of LLMs across several cognitive domains. Despite LLMs’ impressive capabilities, the study identified significant deficiencies in tasks like logical reasoning, spatial intelligence, and linguistic understanding that humans can easily perform. The best LLMs scored between 16% (i.e., Google’s Gemini Pro) and 38% (i.e., OpenAI’s GPT-4 Turbo) versus human adults’ benchmark score of 86%.

🔑KEY TAKEAWAY

Some decisions can be outsourced to AI, such as picking a Netflix show, finding the perfect YouTube video, or discovering an Amazon item, without significant downsides. But we cannot outsource thought to LLMs because they can’t think. And even if they could, we shouldn’t. Technology must augment human intelligence, not degrade it.

📒FINAL NOTE

If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.

How was today’s newsletter?

❤️TAIP Review of the Week

“I LOVE the Sunday Special deep dives into AI’s current limitations.”

-Catherine (⭐️⭐️⭐️⭐️⭐️Nailed it!)

REFER & EARN

🎉Your Friends Learn, You Earn!

You currently have 0 referrals, only 1 away from receiving 🎓3 Simple Steps to Turn ChatGPT Into an Instant Expert.

Refer 5 friends to enter 🎰October’s $200 Gift Card Giveaway.

Copy and paste this link to others: https://theaipulse.beehiiv.com/subscribe?ref=PLACEHOLDER