
Welcome back AI prodigies!
In todayâs Sunday Special:
đ€How Chatbots Uncover Context
âïžPrediction Machines
âïžThe Revolution: Scale and Intelligence
đKey Takeaway
Read Time: 7 minutes
đKey Terms
Large Language Models (LLMs): a deep-learning model that understands and generates text in a human-like fashion. Deep learning involves finding patterns in data without knowing which answers are right or wrong beforehand. For example, Google Photos automatically categorizes your photos into albums.
Word Vectors: allows computers to mathematically represent the meaning and context of words by placing them in a multi-dimensional space where words with similar meanings are closer together.
Attention Heads: individual components in a language model that work together to help the model understand complex language patterns and relationships by focusing on different aspects of the input text.
đ€HOW CHATBOTS UNCOVER TEXT
Hundreds of millions have used text-generating chatbots in the last year, yet few understand how they work. Weâve mentioned that LLMs are trained to âpredictâ the next word and require huge amounts of training data from the internet. But how exactly do they predict the next word? In earnest, we donât fully know. Researchers understand their inner mechanics but often have trouble pointing to specific processes that generated a particular word. Hereâs what they do know. Chatbots use word vectors and attention heads to interpret language and require extensive training to produce the results weâve come to expect.
To understand how language models work, we must first know how they represent words. Human beings represent English words with a sequence of letters, like D-O-G for dogs. Language models use a long list of numbers called a word vector. For example, hereâs one way to represent a dog as a vector. Each word vector represents a point in an imaginary âword space,â and words with more similar meanings are placed closer together. For example, the words closest to a dog in vector space include cat, puppy, and pet. Representing words with vectors of real numbers (as opposed to a string of letters, like âD-O-Gâ) enables mathematical operations that letters donât. Words are too complex to represent only two dimensions, so language models use vector spaces with hundreds or even thousands of dimensions. Each dimension, represented by a number, refers to the contextual understanding that a human might acquire while reading a story.
Computers can also âreasonâ about words using vector arithmetic. For example, take the vector for fastest, subtract fast, and add slow. The word closest to the resulting vector is the slowest. Therefore, the resulting analogy is that fast is to fastest, as slow is to slowest. A simple word vector scheme like this doesnât capture an essential fact about natural language: words often have multiple meanings, often with slight nuances. For instance, âsetâ can refer to several things, varying by part of speech and context. A dinner table can be set (adjective), a volleyball can be set (verb), or a collection of dinner plates and cutlery can be a set (noun). LLMs would use more similar vectors for the 1st and 3rd versions of âset.â Clarifying these ambiguities requires understanding facts about the world. But since computers canât reason, they have to predict words using math.
âïžPREDICTION MACHINES
GPT-3, the model behind ChatGPT, uses 96 âlayersâ to predict the next word. Each layer takes a sequence of word vectors (a sentence) as inputs. Then, it adds information to help clarify the meaning of that word and better predict which word might come next. Letâs say you ask ChatGPT to fill in the blank: She set the [blank] for her partner on Venice Beach. Each layer, or round of analysis, has multiple attention heads. Each attention head might identify a particular element of each word: part of speech, modifying relationship to another word, definition, or companion words (e.g., Donald is typically followed by Trump). To resolve ambiguities, an attention head might compare the first numerical value assigned to âsetâ with the values associated with each contextual definition of âset,â of which there may be hundreds. Next, it might reconcile them with the values of the words surrounding âset.â Finally, it will predict the following word (ball) based on its proximity to the coordinates associated with each word, which were fine-tuned during each layer of analysis. In its most advanced form, each layer of GPT-3 contains 96 attention heads, so GPT-3 performs 9,216 attention-based operations each time it predicts a new word.
âïžTHE REVOLUTION: SCALE AND INTELLIGENCE
Early predictive models required humans to label training examples. For example, training data might have been photos of dogs or cats with a human-supplied label (âdogâ or âcatâ) for each image. Thus, creating large training data sets (millions of parameters) was impossible without thousands of human laborers. LLMs, on the other hand, donât need explicitly labeled data. Instead, they learn by trying to predict the next word in ordinary text passages. Almost any written materialâfrom Wikipedia pages to news articles to computer codeâis suitable for training as long as it is written correctly. As the text (500 billion words!) is fed into the model, attention heads will activate, iterating through millions of calculations to perfect its accuracy. For context, the typical human child encounters roughly 100 million words by age 10. Whatâs more, OpenAI estimates that it took more than 300 quintillion floating point calculations to train GPT-3, or 40 times the number of grains of sand in the world.
This computational milestone has created LLMs with far more abstract âreasoningâ capability than their predecessors. For example, consider the following prompt:
âHere is a bag filled with towels. There are no water bottles in the bag. Yet, the label on the bag says âwater bottlesâ and not âtowels.â John finds the bag. He had never seen the bag before. He cannot see what is inside the bag. He reads the label.â
John likely believes the bag contains water bottles and will be surprised to discover popcorn inside. This capacity to reason about other peopleâs mental states is known as the âtheory of mindâ and is typically developed by age 5. Earlier this year, Stanford psychologist Michal Kosinski published research examining the ability of LLMs to solve theory-of-mind tasks. He gave various language model passages, like the one we quoted above, and then asked them to complete a sentence like âhe believes that the bag is full of.â The correct answer is âwater bottles,â but an unsophisticated language model might say âtowelsâ or something else. GPT-1 and GPT-2 flunked this test. But the first version of GPT-3 got it right almost 40 percent of the time, about the same as the average three-year-old. GPT-4 answered about 95 percent of theory-of-mind questions correctly.
đKEY TAKEAWAY
To mitigate bias, hallucinations, and other inaccuracies, AI practitioners must understand how LLMs generate text to fine-tune them. Unfortunately, discovering how specific words are generated is so tedious and multidimensional that only a tiny fraction of the world has the expertise to try. As always, the burden falls on end-users to exercise sound judgment in using and applying text outputs in their own lives.
đFINAL NOTE
If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.
How was todayâs newsletter?
â€ïžAI Pulse Review of The Week
âIt feels like the future is here. Thanks for doing this service.â
đNOTION TEMPLATES
đšSubscribe to our newsletter for free and receive these powerful Notion templates:
âïž150 ChatGPT prompts for Copywriting
âïž325 ChatGPT prompts for Email Marketing
đSimple Project Management Board
â±Time Tracker
