• The AI Pulse
  • Posts
  • šŸ§  LLMs Canā€™t Think, Hereā€™s Why

šŸ§  LLMs Canā€™t Think, Hereā€™s Why

PLUS: How Australian Researchers Uncovered LLMsā€™ Critical Limitations

Welcome back AI prodigies!

In todayā€™s Sunday Special:

  • šŸ“¦Distribution vs. Instance-Level Tasks

  • šŸ“šPlanning, Reasoning, and Formal Logic

  • šŸ¤–LLM Reasoning, Refuted

  • šŸ”‘Key Takeaway

Read Time: 7 minutes

šŸŽ“Key Terms

  • Torus: the surface of a donut-shaped geometric object.

  • Large Language Models (LLMs): AI models pre-trained on vast amounts of data to generate human-like text.

  • Token: the smallest units of data used by AI models to process and generate text. Similarly, we break down sentences into words or characters.

  • LLM-Modulo: a framework that combines LLMs with external verifiers to check the generated content.

  • Retrieval-Augmented Generation (RAG): a technique that improves the accuracy of AI models by pulling relevant, up-to-date data directly related to a userā€™s query.

šŸ©ŗ PULSE CHECK

Can conversational chatbots like OpenAIā€™s ChatGPT use logic or plan?

Vote Below to View Live Results

Login or Subscribe to participate in polls.

šŸ“¦DISTRIBUTION VS. INSTANCE-LEVEL TASKS

Weā€™ve often written about LLMsā€™ intelligence, reasoning limitations, and whether or not they can become more capable than humans. We once shared an anecdote about Sean M. Carroll, an American theoretical physicist and philosopher. Letā€™s revisit that anecdote to learn why conversational chatbots may not be able to reason. In 2023, Carroll suspected OpenAIā€™s ChatGPT didnā€™t truly understand human prompts. To test his theory, he fed it the following prompt:

ā€œImagine weā€™re playing a modified version of chess where the board is treated as a torus. From any one of the four sides, squares on the directly opposite side are counted as adjacent, and pieces can move in that direction. Is it possible to say whether white or black will generally win this kind of chess match?ā€

-Sean M. Carrollā€™s ā€œHuman Prompt Hypothesisā€

OpenAIā€™s ChatGPT provided a long-winded but equivocal answer. It said chess on a torus-shaped board would open up new strategic and tactical possibilities but didnā€™t conclude whether white or black chess pieces would be more likely to win relative to a standard, square-shaped chess board. Thatā€™s because OpenAIā€™s ChatGPT analyzes text strings and produces responses in which each subsequent word is most likely to occur based on past words.

LLMs are text-generating machines that learn sequences in the training data. They excel in so-called distribution-level tasks (e.g., tone, formality, and syntax) that require learning a pattern, like regurgitating the style of the great poet Ernest Hemingway. LLMs can quickly correct text formatting or even copy a persona created by the userā€™s query, such as: ā€œExplain AI like Iā€™m in elementary school.ā€

However, the planning and reasoning required to comprehend a prompt about a modified version of chess are instance-level tasks (e.g., word choice or sentence structure) that require a deeper understanding of individual elements within a text; thatā€™s why LLMs like OpenAIā€™s ChatGPT failed to answer Carrollā€™s prompt.

Today, weā€™ll settle the debate on LLMsā€™ planning and reasoning capabilities. First, weā€™ll formally define planning and reasoning. Then, weā€™ll address three common reasons why folks often conflate LLM text generation with logical thought. And as always, weā€™ll close with a key takeaway.

šŸ“šPLANNING, REASONING, AND FORMAL LOGIC

šŸ“†Planning?

LLMs like OpenAIā€™s ChatGPT are primarily designed to generate text based on patterns from large amounts of data. They cannot inherently plan or sequence actions over time toward a goal. LLMs generate text one token at a time without explicitly understanding future outcomes or states. Planning requires a structured approach to foresee steps and consequences, which differs from how LLMs predict the next word in a sequence.

šŸ’¬Reasoning?

LLMs canā€™t reason if they canā€™t generalize their knowledge to new, unseen tasks. The best example of this is multiplication and introductory algebra. Even after being fine-tuned on a vast dataset to solve three-digit multiplication, LLMs failed to solve five-digit multiplication. This example suggests that while LLMs can perform well on familiar tasks, they may lack the ability to truly understand the underlying principles and apply them to novel situations.

You might say that current LLMs like OpenAIā€™s GPT-4o (ā€œoā€ for ā€œomniā€) can do that correctly, and youā€™d be right. They may appear capable of complex tasks like five-digit multiplication. Still, their underlying mechanism relies on external tools like calculators or pre-programmed algorithms within an LLM-Modulo framework, where additional computational resources augment the LLMā€™s capabilities.

šŸ’­Formal Logic?

Formal logic helps us determine if a conclusion is a logical consequence of the given premises. Itā€™s like a set of guidelines for evaluating the reasoning behind an argument. It involves using symbols to represent arguments and applying specific rules to manipulate these symbols to determine the validity of the arguments. If youā€™ve taken a geometry course or philosophy class, you might recognize the following statement: ā€œIf P implies Q, and P is true, then Q must also be true.ā€

Stanford Research Institute Problem Solver (STRIPS) is an excellent example of a formal planner. This framework breaks down a problem into a series of smaller sub-problems. The planning system doesnā€™t need input from an external verifier (i.e., LLM-Modulo) to determine whether itā€™s correct. It leverages a heuristic search algorithm to find a solution to each sub-problem. STRIPS has been used to solve various bottlenecks in AI, including scheduling issues and planning problems.

While LLMs can generate plans, their ability to produce truly feasible and verifiable schedules is limited. Unlike formal planning systems like STRIPS, which rely on logical reasoning and constraint satisfaction, LLMs generate plans based on patterns and correlations learned from their training data. This approach leads to plans that are impractical, incomplete, or incorrect. For example, when I asked OpenAIā€™s ChatGPT to plan a trip to Europe, it suggested plans that werenā€™t verifiable. For instance, it didnā€™t check whether train services were operational in the morning or if a museum was open on Monday. It just confidently spewed nonsense.

LLMs inherently lack the quality of a predictable order or plan because their outputs are randomly generated. In other words, LLMs can always take a different route with the same input, making consistent outputs, a precursor to formal planning, impossible.

šŸ¤–LLM REASONING, REFUTED

Here are three common reasons why many believe LLMs can reason:

1. LLMs Can Generate Code.

They can generate and correct code, find bugs, and develop multiple real solutions with few hallucinations. Over 37,000 organizations pay for their 1.3 million developers to use the most popular AI coding tool, GitHub Copilot. So, what gives?

LLMs can retrieve code snippets, but some mistakenly believe that the LLMs are deploying reasoning to write the code. In reality, LLMs arenā€™t just trained on the final correct code but on code repositories, where theyā€™ve seen every code iteration. They know precisely what changed in each version of the code. When you prompt it for the first time, it just retrieves the first version of the code, and when you ask it to find the errors, it retrieves the next version of the same code, giving you the illusion of reasoning and understanding.

2. We Can Fine-Tune LLMs to Plan in a Specific Domain.

Fine-tuning involves training an AI model on a specific dataset to improve its performance on particular tasks.

Fine-tuning may not be as effective as we initially thought. Dr. Scott Barnett, Deputy Head of AI Centric Applications, and his team from Deakin University (DU) in Australia recently investigated the impact of fine-tuning on the performance of LLMs within Retrieval-Augmented Generation (RAG) systems. They utilized three open-source datasets from different scientific domains to evaluate the performance of fine-tuned LLMs against baseline AI models. The AI models assessed included Mistral AIā€™s Mistral Chat, Metaā€™s Llama 2, and OpenAIā€™s GPT-4, with fine-tuning conducted on varying sizes of domain-specific data. The evaluation focused on the accuracy and completeness of the AI modelsā€™ responses. Surprisingly, the results indicate that fine-tuning generally led to a decline in performance across multiple domains. Also, increasing the sample size for fine-tuning didnā€™t necessarily lead to better performance and, in some cases, decreased accuracy and completeness. This study highlights the need to properly validate fine-tuned AI models for domain-specific tasks.

3. LLMs Outperform Humans on Expertise Tests.

While itā€™s true that LLMs mop the floor with humans across various exams, from the Family Medicine Board Exams (FMBE) to official creativity challenges, these victories likely reflect its text prediction capabilities.

To test this hypothesis, researchers from the University of Technology Sydney (UTS) in Australia introduced a linguistic benchmark to assess the limitations of LLMs across several cognitive domains. Despite LLMsā€™ impressive capabilities, the study identified significant deficiencies in tasks like logical reasoning, spatial intelligence, and linguistic understanding that humans can easily perform. The best LLMs scored between 16% (i.e., Googleā€™s Gemini Pro) and 38% (i.e., OpenAIā€™s GPT-4 Turbo) versus human adultsā€™ benchmark score of 86%.

šŸ”‘KEY TAKEAWAY

Some decisions can be outsourced to AI, such as picking a Netflix show, finding the perfect YouTube video, or discovering an Amazon item, without significant downsides. But we cannot outsource thought to LLMs because they canā€™t think. And even if they could, we shouldnā€™t. Technology must augment human intelligence, not degrade it.

šŸ“’FINAL NOTE

If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.

How was todayā€™s newsletter?

ā¤ļøTAIP Review of the Week

ā€œI LOVE the Sunday Special deep dives into AIā€™s current limitations.ā€

-Catherine (ā­ļøā­ļøā­ļøā­ļøā­ļøNailed it!)
REFER & EARN

šŸŽ‰Your Friends Learn, You Earn!

You currently have 0 referrals, only 1 away from receiving āš™ļøUltimate Prompt Engineering Guide.

Refer 5 friends to enter šŸŽ°Octoberā€™s $200 Gift Card Giveaway.

Reply

or to participate.