- The AI Pulse
- Posts
- š§ Chatbots Think Like HumansāSort Of
š§ Chatbots Think Like HumansāSort Of
PLUS: Visualize the Scale and Smarts of ChatGPT

Welcome back AI prodigies!
In todayās Sunday Special:
š¤How Chatbots Uncover Context
āļøPrediction Machines
āļøThe Revolution: Scale and Intelligence
šKey Takeaway
Read Time: 7 minutes
šKey Terms
Large Language Models (LLMs): a deep-learning model that understands and generates text in a human-like fashion. Deep learning involves finding patterns in data without knowing which answers are right or wrong beforehand. For example, Google Photos automatically categorizes your photos into albums.
Word Vectors: allows computers to mathematically represent the meaning and context of words by placing them in a multi-dimensional space where words with similar meanings are closer together.
Attention Heads: individual components in a language model that work together to help the model understand complex language patterns and relationships by focusing on different aspects of the input text.
š¤HOW CHATBOTS UNCOVER TEXT
Hundreds of millions have used text-generating chatbots in the last year, yet few understand how they work. Weāve mentioned that LLMs are trained to āpredictā the next word and require huge amounts of training data from the internet. But how exactly do they predict the next word? In earnest, we donāt fully know. Researchers understand their inner mechanics but often have trouble pointing to specific processes that generated a particular word. Hereās what they do know. Chatbots use word vectors and attention heads to interpret language and require extensive training to produce the results weāve come to expect.
To understand how language models work, we must first know how they represent words. Human beings represent English words with a sequence of letters, like D-O-G for dogs. Language models use a long list of numbers called a word vector. For example, hereās one way to represent a dog as a vector. Each word vector represents a point in an imaginary āword space,ā and words with more similar meanings are placed closer together. For example, the words closest to a dog in vector space include cat, puppy, and pet. Representing words with vectors of real numbers (as opposed to a string of letters, like āD-O-Gā) enables mathematical operations that letters donāt. Words are too complex to represent only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. Each dimension, represented by a number, refers to the contextual understanding that a human might acquire while reading a story.
Computers can also āreasonā about words using vector arithmetic. For example, take the vector for fastest, subtract fast, and add slow. The word closest to the resulting vector is the slowest. Therefore, the resulting analogy is that fast is to fastest, as slow is to slowest. A simple word vector scheme like this doesnāt capture an essential fact about natural language: words often have multiple meanings, often with slight nuances. For instance, āsetā can refer to several things, varying by part of speech and context. A dinner table can be set (adjective), a volleyball can be set (verb), or a collection of dinner plates and cutlery can be a set (noun). LLMs would use more similar vectors for the 1st and 3rd versions of āset.ā Clarifying these ambiguities requires understanding facts about the world. But since computers canāt reason, they have to predict words using math.
āļøPREDICTION MACHINES
GPT-3, the model behind ChatGPT, uses 96 ālayersā to predict the next word. Each layer takes a sequence of word vectors (a sentence) as inputs. Then, it adds information to help clarify the meaning of that word and better predict which word might come next. Letās say you ask ChatGPT to fill in the blank: She set the [blank] for her partner on Venice Beach. Each layer, or round of analysis, has multiple attention heads. Each attention head might identify a particular element of each word: part of speech, modifying relationship to another word, definition, or companion words (e.g., Donald is typically followed by Trump). To resolve ambiguities, an attention head might compare the first numerical value assigned to āsetā with the values associated with each contextual definition of āset,ā of which there may be hundreds. Next, it might reconcile them with the values of the words surrounding āset.ā Finally, it will predict the following word (ball) based on its proximity to the coordinates associated with each word, which were fine-tuned during each layer of analysis. In its most advanced form, each layer of GPT-3 contains 96 attention heads, so GPT-3 performs 9,216 attention-based operations each time it predicts a new word.
āļøTHE REVOLUTION: SCALE AND INTELLIGENCE
Early predictive models required humans to label training examples. For example, training data might have been photos of dogs or cats with a human-supplied label (ādogā or ācatā) for each image. Thus, creating large training data sets (millions of parameters) was impossible without thousands of human laborers. LLMs, on the other hand, donāt need explicitly labeled data. Instead, they learn by trying to predict the next word in ordinary text passages. Almost any written materialāfrom Wikipedia pages to news articles to computer codeāis suitable for training as long as it is written correctly. As the text (500 billion words!) is fed into the model, attention heads will activate, iterating through millions of calculations to perfect its accuracy. For context, the typical human child encounters roughly 100 million words by age 10. Whatās more, OpenAI estimates that it took more than 300 quintillion floating point calculations to train GPT-3, or 40 times the number of grains of sand in the world.
This computational milestone has created LLMs with far more abstract āreasoningā capability than their predecessors. For example, consider the following prompt:
āHere is a bag filled with towels. There are no water bottles in the bag. Yet, the label on the bag says āwater bottlesā and not ātowels.ā John finds the bag. He had never seen the bag before. He cannot see what is inside the bag. He reads the label.ā
John likely believes the bag contains water bottles and will be surprised to discover popcorn inside. This capacity to reason about other peopleās mental states is known as the ātheory of mindā and is typically developed by age 5. Earlier this year, Stanford psychologist Michal Kosinski published research examining the ability of LLMs to solve theory-of-mind tasks. He gave various language model passages, like the one we quoted above, and then asked them to complete a sentence like āhe believes that the bag is full of.ā The correct answer is āwater bottles,ā but an unsophisticated language model might say ātowelsā or something else. GPT-1 and GPT-2 flunked this test. But the first version of GPT-3 got it right almost 40 percent of the time, about the same as the average three-year-old. GPT-4 answered about 95 percent of theory-of-mind questions correctly.
šKEY TAKEAWAY
To mitigate bias, hallucinations, and other inaccuracies, AI practitioners must understand how LLMs generate text to fine-tune them. Unfortunately, discovering how specific words are generated is so tedious and multidimensional that only a tiny fraction of the world has the expertise to try. As always, the burden falls on end-users to exercise sound judgment in using and applying text outputs in their own lives.
šFINAL NOTE
If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.
How was todayās newsletter?
ā¤ļøAI Pulse Review of The Week
āIt feels like the future is here. Thanks for doing this service.ā
šNOTION TEMPLATES
šØSubscribe to our newsletter for free and receive these powerful Notion templates:
āļø150 ChatGPT prompts for Copywriting
āļø325 ChatGPT prompts for Email Marketing
šSimple Project Management Board
ā±Time Tracker