- The AI Pulse
- Posts
- š§ LLMs Canāt Think, Hereās Why
š§ LLMs Canāt Think, Hereās Why
PLUS: How Australian Researchers Uncovered LLMsā Critical Limitations
Welcome back AI prodigies!
In todayās Sunday Special:
š¦Distribution vs. Instance-Level Tasks
šPlanning, Reasoning, and Formal Logic
š¤LLM Reasoning, Refuted
šKey Takeaway
Read Time: 7 minutes
šKey Terms
Torus: the surface of a donut-shaped geometric object.
Large Language Models (LLMs): AI models pre-trained on vast amounts of data to generate human-like text.
Token: the smallest units of data used by AI models to process and generate text. Similarly, we break down sentences into words or characters.
LLM-Modulo: a framework that combines LLMs with external verifiers to check the generated content.
Retrieval-Augmented Generation (RAG): a technique that improves the accuracy of AI models by pulling relevant, up-to-date data directly related to a userās query.
š©ŗ PULSE CHECK
Can conversational chatbots like OpenAIās ChatGPT use logic or plan?Vote Below to View Live Results |
š¦DISTRIBUTION VS. INSTANCE-LEVEL TASKS
Weāve often written about LLMsā intelligence, reasoning limitations, and whether or not they can become more capable than humans. We once shared an anecdote about Sean M. Carroll, an American theoretical physicist and philosopher. Letās revisit that anecdote to learn why conversational chatbots may not be able to reason. In 2023, Carroll suspected OpenAIās ChatGPT didnāt truly understand human prompts. To test his theory, he fed it the following prompt:
āImagine weāre playing a modified version of chess where the board is treated as a torus. From any one of the four sides, squares on the directly opposite side are counted as adjacent, and pieces can move in that direction. Is it possible to say whether white or black will generally win this kind of chess match?ā
OpenAIās ChatGPT provided a long-winded but equivocal answer. It said chess on a torus-shaped board would open up new strategic and tactical possibilities but didnāt conclude whether white or black chess pieces would be more likely to win relative to a standard, square-shaped chess board. Thatās because OpenAIās ChatGPT analyzes text strings and produces responses in which each subsequent word is most likely to occur based on past words.
LLMs are text-generating machines that learn sequences in the training data. They excel in so-called distribution-level tasks (e.g., tone, formality, and syntax) that require learning a pattern, like regurgitating the style of the great poet Ernest Hemingway. LLMs can quickly correct text formatting or even copy a persona created by the userās query, such as: āExplain AI like Iām in elementary school.ā
However, the planning and reasoning required to comprehend a prompt about a modified version of chess are instance-level tasks (e.g., word choice or sentence structure) that require a deeper understanding of individual elements within a text; thatās why LLMs like OpenAIās ChatGPT failed to answer Carrollās prompt.
Today, weāll settle the debate on LLMsā planning and reasoning capabilities. First, weāll formally define planning and reasoning. Then, weāll address three common reasons why folks often conflate LLM text generation with logical thought. And as always, weāll close with a key takeaway.
šPLANNING, REASONING, AND FORMAL LOGIC
šPlanning?
LLMs like OpenAIās ChatGPT are primarily designed to generate text based on patterns from large amounts of data. They cannot inherently plan or sequence actions over time toward a goal. LLMs generate text one token at a time without explicitly understanding future outcomes or states. Planning requires a structured approach to foresee steps and consequences, which differs from how LLMs predict the next word in a sequence.
š¬Reasoning?
LLMs canāt reason if they canāt generalize their knowledge to new, unseen tasks. The best example of this is multiplication and introductory algebra. Even after being fine-tuned on a vast dataset to solve three-digit multiplication, LLMs failed to solve five-digit multiplication. This example suggests that while LLMs can perform well on familiar tasks, they may lack the ability to truly understand the underlying principles and apply them to novel situations.
You might say that current LLMs like OpenAIās GPT-4o (āoā for āomniā) can do that correctly, and youād be right. They may appear capable of complex tasks like five-digit multiplication. Still, their underlying mechanism relies on external tools like calculators or pre-programmed algorithms within an LLM-Modulo framework, where additional computational resources augment the LLMās capabilities.
šFormal Logic?
Formal logic helps us determine if a conclusion is a logical consequence of the given premises. Itās like a set of guidelines for evaluating the reasoning behind an argument. It involves using symbols to represent arguments and applying specific rules to manipulate these symbols to determine the validity of the arguments. If youāve taken a geometry course or philosophy class, you might recognize the following statement: āIf P implies Q, and P is true, then Q must also be true.ā
Stanford Research Institute Problem Solver (STRIPS) is an excellent example of a formal planner. This framework breaks down a problem into a series of smaller sub-problems. The planning system doesnāt need input from an external verifier (i.e., LLM-Modulo) to determine whether itās correct. It leverages a heuristic search algorithm to find a solution to each sub-problem. STRIPS has been used to solve various bottlenecks in AI, including scheduling issues and planning problems.
While LLMs can generate plans, their ability to produce truly feasible and verifiable schedules is limited. Unlike formal planning systems like STRIPS, which rely on logical reasoning and constraint satisfaction, LLMs generate plans based on patterns and correlations learned from their training data. This approach leads to plans that are impractical, incomplete, or incorrect. For example, when I asked OpenAIās ChatGPT to plan a trip to Europe, it suggested plans that werenāt verifiable. For instance, it didnāt check whether train services were operational in the morning or if a museum was open on Monday. It just confidently spewed nonsense.
LLMs inherently lack the quality of a predictable order or plan because their outputs are randomly generated. In other words, LLMs can always take a different route with the same input, making consistent outputs, a precursor to formal planning, impossible.
š¤LLM REASONING, REFUTED
Here are three common reasons why many believe LLMs can reason:
1. LLMs Can Generate Code.
They can generate and correct code, find bugs, and develop multiple real solutions with few hallucinations. Over 37,000 organizations pay for their 1.3 million developers to use the most popular AI coding tool, GitHub Copilot. So, what gives?
LLMs can retrieve code snippets, but some mistakenly believe that the LLMs are deploying reasoning to write the code. In reality, LLMs arenāt just trained on the final correct code but on code repositories, where theyāve seen every code iteration. They know precisely what changed in each version of the code. When you prompt it for the first time, it just retrieves the first version of the code, and when you ask it to find the errors, it retrieves the next version of the same code, giving you the illusion of reasoning and understanding.
2. We Can Fine-Tune LLMs to Plan in a Specific Domain.
Fine-tuning involves training an AI model on a specific dataset to improve its performance on particular tasks.
Fine-tuning may not be as effective as we initially thought. Dr. Scott Barnett, Deputy Head of AI Centric Applications, and his team from Deakin University (DU) in Australia recently investigated the impact of fine-tuning on the performance of LLMs within Retrieval-Augmented Generation (RAG) systems. They utilized three open-source datasets from different scientific domains to evaluate the performance of fine-tuned LLMs against baseline AI models. The AI models assessed included Mistral AIās Mistral Chat, Metaās Llama 2, and OpenAIās GPT-4, with fine-tuning conducted on varying sizes of domain-specific data. The evaluation focused on the accuracy and completeness of the AI modelsā responses. Surprisingly, the results indicate that fine-tuning generally led to a decline in performance across multiple domains. Also, increasing the sample size for fine-tuning didnāt necessarily lead to better performance and, in some cases, decreased accuracy and completeness. This study highlights the need to properly validate fine-tuned AI models for domain-specific tasks.
3. LLMs Outperform Humans on Expertise Tests.
While itās true that LLMs mop the floor with humans across various exams, from the Family Medicine Board Exams (FMBE) to official creativity challenges, these victories likely reflect its text prediction capabilities.
To test this hypothesis, researchers from the University of Technology Sydney (UTS) in Australia introduced a linguistic benchmark to assess the limitations of LLMs across several cognitive domains. Despite LLMsā impressive capabilities, the study identified significant deficiencies in tasks like logical reasoning, spatial intelligence, and linguistic understanding that humans can easily perform. The best LLMs scored between 16% (i.e., Googleās Gemini Pro) and 38% (i.e., OpenAIās GPT-4 Turbo) versus human adultsā benchmark score of 86%.
šKEY TAKEAWAY
Some decisions can be outsourced to AI, such as picking a Netflix show, finding the perfect YouTube video, or discovering an Amazon item, without significant downsides. But we cannot outsource thought to LLMs because they canāt think. And even if they could, we shouldnāt. Technology must augment human intelligence, not degrade it.
šFINAL NOTE
If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.
How was todayās newsletter?
ā¤ļøTAIP Review of the Week
āI LOVE the Sunday Special deep dives into AIās current limitations.ā
REFER & EARN
šYour Friends Learn, You Earn!
You currently have 0 referrals, only 1 away from receiving āļøUltimate Prompt Engineering Guide.
Refer 5 friends to enter š°Octoberās $200 Gift Card Giveaway.
Copy and paste this link to others: https://theaipulse.beehiiv.com/subscribe?ref=PLACEHOLDER
Reply