🧠 Chatbots Think Like Humans—Sort Of

PLUS: Visualize the Scale and Smarts of ChatGPT

Welcome back AI prodigies!

In today’s Sunday Special:

  • 🤖How Chatbots Uncover Context

  • ⚙️Prediction Machines

  • ⚔️The Revolution: Scale and Intelligence

  • 🔑Key Takeaway

Read Time: 7 minutes

🎓Key Terms

  • Large Language Models (LLMs): a deep-learning model that understands and generates text in a human-like fashion. Deep learning involves finding patterns in data without knowing which answers are right or wrong beforehand. For example, Google Photos automatically categorizes your photos into albums.

  • Word Vectors: allows computers to mathematically represent the meaning and context of words by placing them in a multi-dimensional space where words with similar meanings are closer together.

  • Attention Heads: individual components in a language model that work together to help the model understand complex language patterns and relationships by focusing on different aspects of the input text.

🤖HOW CHATBOTS UNCOVER TEXT

Hundreds of millions have used text-generating chatbots in the last year, yet few understand how they work. We’ve mentioned that LLMs are trained to “predict” the next word and require huge amounts of training data from the internet. But how exactly do they predict the next word? In earnest, we don’t fully know. Researchers understand their inner mechanics but often have trouble pointing to specific processes that generated a particular word. Here’s what they do know. Chatbots use word vectors and attention heads to interpret language and require extensive training to produce the results we’ve come to expect.

To understand how language models work, we must first know how they represent words. Human beings represent English words with a sequence of letters, like D-O-G for dogs. Language models use a long list of numbers called a word vector. For example, here’s one way to represent a dog as a vector. Each word vector represents a point in an imaginary “word space,” and words with more similar meanings are placed closer together. For example, the words closest to a dog in vector space include cat, puppy, and pet. Representing words with vectors of real numbers (as opposed to a string of letters, like “D-O-G”) enables mathematical operations that letters don’t. Words are too complex to represent only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. Each dimension, represented by a number, refers to the contextual understanding that a human might acquire while reading a story.

Computers can also “reason” about words using vector arithmetic. For example, take the vector for fastest, subtract fast, and add slow. The word closest to the resulting vector is the slowest. Therefore, the resulting analogy is that fast is to fastest, as slow is to slowest. A simple word vector scheme like this doesn’t capture an essential fact about natural language: words often have multiple meanings, often with slight nuances. For instance, “set” can refer to several things, varying by part of speech and context. A dinner table can be set (adjective), a volleyball can be set (verb), or a collection of dinner plates and cutlery can be a set (noun). LLMs would use more similar vectors for the 1st and 3rd versions of “set.” Clarifying these ambiguities requires understanding facts about the world. But since computers can’t reason, they have to predict words using math.

⚙️PREDICTION MACHINES

GPT-3, the model behind ChatGPT, uses 96 “layers” to predict the next word. Each layer takes a sequence of word vectors (a sentence) as inputs. Then, it adds information to help clarify the meaning of that word and better predict which word might come next. Let’s say you ask ChatGPT to fill in the blank: She set the [blank] for her partner on Venice Beach. Each layer, or round of analysis, has multiple attention heads. Each attention head might identify a particular element of each word: part of speech, modifying relationship to another word, definition, or companion words (e.g., Donald is typically followed by Trump). To resolve ambiguities, an attention head might compare the first numerical value assigned to “set” with the values associated with each contextual definition of “set,” of which there may be hundreds. Next, it might reconcile them with the values of the words surrounding “set.” Finally, it will predict the following word (ball) based on its proximity to the coordinates associated with each word, which were fine-tuned during each layer of analysis. In its most advanced form, each layer of GPT-3 contains 96 attention heads, so GPT-3 performs 9,216 attention-based operations each time it predicts a new word.

⚔️THE REVOLUTION: SCALE AND INTELLIGENCE

Early predictive models required humans to label training examples. For example, training data might have been photos of dogs or cats with a human-supplied label (“dog” or “cat”) for each image. Thus, creating large training data sets (millions of parameters) was impossible without thousands of human laborers. LLMs, on the other hand, don’t need explicitly labeled data. Instead, they learn by trying to predict the next word in ordinary text passages. Almost any written material—from Wikipedia pages to news articles to computer code—is suitable for training as long as it is written correctly. As the text (500 billion words!) is fed into the model, attention heads will activate, iterating through millions of calculations to perfect its accuracy. For context, the typical human child encounters roughly 100 million words by age 10. What’s more, OpenAI estimates that it took more than 300 quintillion floating point calculations to train GPT-3, or 40 times the number of grains of sand in the world.

This computational milestone has created LLMs with far more abstract “reasoning” capability than their predecessors. For example, consider the following prompt:

“Here is a bag filled with towels. There are no water bottles in the bag. Yet, the label on the bag says “water bottles” and not “towels.” John finds the bag. He had never seen the bag before. He cannot see what is inside the bag. He reads the label.”

(OpenAI’s ChatGPT)

John likely believes the bag contains water bottles and will be surprised to discover popcorn inside. This capacity to reason about other people’s mental states is known as the “theory of mind” and is typically developed by age 5. Earlier this year, Stanford psychologist Michal Kosinski published research examining the ability of LLMs to solve theory-of-mind tasks. He gave various language model passages, like the one we quoted above, and then asked them to complete a sentence like “he believes that the bag is full of.” The correct answer is “water bottles,” but an unsophisticated language model might say “towels” or something else. GPT-1 and GPT-2 flunked this test. But the first version of GPT-3 got it right almost 40 percent of the time, about the same as the average three-year-old. GPT-4 answered about 95 percent of theory-of-mind questions correctly.

🔑KEY TAKEAWAY

To mitigate bias, hallucinations, and other inaccuracies, AI practitioners must understand how LLMs generate text to fine-tune them. Unfortunately, discovering how specific words are generated is so tedious and multidimensional that only a tiny fraction of the world has the expertise to try. As always, the burden falls on end-users to exercise sound judgment in using and applying text outputs in their own lives.

📒FINAL NOTE

If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.

How was today’s newsletter?

❤️AI Pulse Review of The Week

It feels like the future is here. Thanks for doing this service.

-Jai (⭐️⭐️⭐️⭐️⭐️Nailed it!)

🎁NOTION TEMPLATES

🚨Subscribe to our newsletter for free and receive these powerful Notion templates:

  • ⚙️150 ChatGPT prompts for Copywriting

  • ⚙️325 ChatGPT prompts for Email Marketing

  • 📆Simple Project Management Board

  • ⏱Time Tracker

Reply

or to participate.