The AI Pulse
Posts
🧠 Why Meta’s Tiny AI Models Matter

🧠 Why Meta’s Tiny AI Models Matter

PLUS: What Apple Intelligence’s Delayed Launch Tells Us About the Future of Computing

Rohun Shroff
July 14, 2024

Subscribe | Contact | Meet The Team

Welcome back AI prodigies!

In today’s Sunday Special:

⚡️Energy Is Essential
📦Supply Side Shortages
🤖A New AI Model?
📊The Solution
🔑Key Takeaway

Read Time: 8 minutes

🎓Key Terms

Small Language Models (SLMs): AI models that require less computational power and memory, making them cost-effective and quick to train and deploy.
Graphics Processing Unit (GPU): a specialized computer chip capable of parallel processing (i.e., performing mathematical calculations simultaneously), making it ideal for complex applications like generative AI (GenAI).
Search-Augmented LLMs: LLMs that retrieve up-to-date information from external knowledge bases like the internet.
Capital Expenditures (CapEx): the funds a company uses to purchase, improve, or maintain long-term assets essential for its operations.
Tokens: the smallest units of data used by an AI model to process and generate text. Similarly, we break down sentence into words or characters.

🩺 PULSE CHECK

What do you think is the biggest challenge to widespread adoption of AI?

Vote Below to View Live Results

⚡️ENERGY IS ESSENTIAL

Before the mass adoption of generative AI (GenAI), we must address energy constraints. At current standards, the world’s power grids won’t meet the expected demand for AI-enabled products and services and their accompanying infrastructure.

Given this reality, powerful sub-billion-parameter Small Language Models (SLMs) are the future. Meta has proposed various algorithmic innovations to create MobileLLM, a family of optimized AI models for on-device applications, prioritizing AI model architecture over data and parameter quantity. This new state-of-the-art AI model may soon become the standard at scale and prevent all the great promises AI enthusiasts envision from ending up being just that.

If we observe the distribution issues AI will face in the foreseeable future, it’s a very long tail, particularly given the industry’s challenges in meeting foreseeable demand.

But before we convince you that the answer is sub-billion-parameter SLMs, let’s consider the scale of the energy challenge.

📦SUPPLY SIDE SHORTAGES

Assuming the status quo continues, we might soon face a real Graphics Processing Unit (GPU) shortage.

Before you jump to “we already had a shortage not too long ago,” the answer is yes, but unprecedented Capital Expenditures (CapEx) drove that shortage. Nvidia failed to meet the investment demands of Big Tech companies investing billions of dollars to build massive GPU data centers based on a future demand that doesn’t exist. In other words, there was a short-term mismatch between investments in GenAI and the technology’s revenue. Instead, we could soon face a shortage of GPUs relative to end-user demands once LLMs are fully integrated into products or services like Google Search.

According to Meta {Appendix I}, in a future where most humans use LLMs just 5% of their day, we would need 100 million Nvidia H100 Tensor Core GPUs to power OpenAI’s GPT-4, assuming an acceptable latency of 50 tokens per second and a very short average sequence length, which refers to the number of tokens processed within a prompt.

While such numbers may sound nonsensical, that future isn’t far off. As noted, LLMs will supercharge Google’s AI Overviews: a feature that provides AI-generated summaries at the top of Google Search results.

Google Search is used 8.5 billion times per day, and according to research by SemiAnalysis, GenAI-enhanced Google Search could cost an average of 9 watt-hours (Wh).

Assuming that at least 60% of all searches will be based on GenAI generations, the total energy demand for GenAI generations would be 17 terawatt-hours (TWh), or 17 million megawatt-hours (Mwh). For reference, xAI’s upcoming 100,000 Nvidia H100 Tensor Core GPU cluster, the largest in the world, will require a mere 140 megawatt-hours (Mwh) to run.

Now, you might say that all GPUs don’t need to be in one data center, and these computational demands will be distributed. On the contrary, a cluster of GPUs running a single LLM must be accumulated in one data center, as the AI models, due to their size, need to be distributed across hundreds of GPUs. A single LLM demands continuous GPU-to-GPU communication, which requires costly cables that triple in price beyond the 50-meter (m) mark. And that’s without factoring in latency, which would negatively impact the user experience.

🤖A NEW AI MODEL?

Envisioning the energy constraints from a GPU perspective is already daunting, but it’s just the tip of the iceberg. The energy challenges worsen when considering the projected global growth in AI use and the emergence of even more powerful frontier AI models.

Suppose we assume that the current compute and memory cost complexity of AI models (i.e., how expensive they are to run and store) continues. In that case, the estimates we provided in the previous segment may not be sufficient. Doubling the input sequence of an LLM quadruples the compute and memory requirements.

While LLMs have conquered memorization by regurgitating most of the internet’s data, their reasoning capabilities are modest.

Most people consider search-augmented LLMs, or long inference AI models, as the solution. These LLMs explore the solution space instead of directly responding to user queries, generating up to millions of possible responses before settling for one.

Here’s a breakdown of how search-augmented LLMs work:

Understanding Your Request: The LLM analyzes the prompt to grasp its meaning and intent.
Knowledge Base Search: It then taps into a vast external knowledge base, like a super-powered search engine.
Identifying Relevant Information: The LLM sifts through the external knowledge base to find the information that aligns with the prompt.
Enhancing the Prompt: The LLM incorporates the most relevant information into a more refined and detail-oriented prompt.
Generating the Response: The LLM leverages the more refined and detail-oriented prompt to generate a response.

Here’s an analogy: Imagine a student writing an essay. The LLM is like the student who first understands the essay prompt. Then, the student consults a library (i.e., an external knowledge base) to find relevant sources. After identifying the key points from those sources, the student incorporates them into the essay (i.e., enhancing the prompt). Finally, the student uses their writing skills to craft the essay (i.e., generate the response).

This approach not only skyrockets average token usage but likely requires the development of verifiers or additional AI models to validate the LLM’s search for the solution.

If this is the future of AI, then the numbers we saw above will fall short, with some requests far exceeding the 9-watt-hour (Wh) mark we discussed earlier.

According to the International Energy Agency (IEA), data center demand in the U.S. and China is expected to grow annually to approximately 710 terawatt-hours (TWh) by 2026. For reference, that’s almost as large as France and Italy’s combined energy consumption in 2022 of 720 terawatt-hours (TWh).

📊THE SOLUTION

For all those reasons, many are looking toward Edge AI, or “on-device” language models, as a possible solution. These AI models can run on mobile devices, eliminating the need for GPU data centers. However, as Apple Intelligence’s reported delay of GenAI-enhanced Siri to Spring 2025 proves, cost-competitive devices aren’t up to the challenge. But why?

Training and deploying LLMs at the data center scale is a complex balancing act. While personal devices can run simpler LLMs, their capabilities are limited by processing power and battery life.

Quality-wise, the best results obtained in AI today come from AI models with file sizes well above the terabyte (TB) range. A terabyte (TB) is 1,000 times larger than a gigabyte (GB). Considering Apple Intelligence’s on-device LLM, OpenELM, has a size of around 1.5 gigabytes (GB), and lower-sized AI models decrease quality too much, it’s no wonder why Apple may be stalling the release of GenAI-enhanced Siri.

Battery life also presents a significant hurdle. Meta calculated that a 7 billion parameter LLM consumes 0.7 J/Token. A fully charged iPhone, with 50 KJ of battery life, can sustain this LLM for less than two hours, with every 64 tokens draining 0.2% of battery life.

While Big Tech chases billion-dollar data centers for massive AI models, the key to unlocking AI’s potential lies in smaller, sub-billion-parameter AI models that deliver exceptional performance.

Meta is focused on developing several “minute” AI models that are 15 times smaller than current state-of-the-art LLMs to deploy conversational chatbots at scale on mobile devices.

🔑KEY TAKEAWAY

The hype surrounding Artificial General Intelligence (AGI), building more expansive foundational AI models, and spending wars among Big Tech companies has led to an AI industry CapEx of $600 billion. According to venture firm Sequoia Capitol, this is 20 times larger than actual revenues.

Big Tech needs to make sure they’re building AI infrastructure consumers actually want, not just guessing what might be popular later. SLMs are still in their infancy, but Meta’s money is in the right place.

📒FINAL NOTE

If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.

How was today’s newsletter?

❤️TAIP Review of the Week

“Another great Sunday Special, Rohun and James! Particularly because it dives deep into the implications of AI developments within a specific profession or field. In this case, music composition. AI is spreading on all fronts, so its impact won’t be a one-size-fits-all scenario. I’d like to see detailed analyses within other professions or processes. Well done!”

-Lucan (⭐️⭐️⭐️⭐️⭐️Nailed it!)

REFER & EARN

🎉Your Friends Learn, You Earn!

You currently have 0 referrals, only 1 away from receiving 🎓3 Simple Steps to Turn ChatGPT Into an Instant Expert.

Refer 5 friends to enter 🎰July’s $200 Gift Card Giveaway.

Copy and paste this link to others: https://theaipulse.beehiiv.com/subscribe?ref=PLACEHOLDER