• The AI Pulse
  • Posts
  • šŸ§  Why Metaā€™s Tiny AI Models Matter

šŸ§  Why Metaā€™s Tiny AI Models Matter

PLUS: What Apple Intelligenceā€™s Delayed Launch Tells Us About the Future of Computing

Welcome back AI prodigies!

In todayā€™s Sunday Special:

  • āš”ļøEnergy Is Essential

  • šŸ“¦Supply Side Shortages

  • šŸ¤–A New AI Model?

  • šŸ“ŠThe Solution

  • šŸ”‘Key Takeaway

Read Time: 8 minutes

šŸŽ“Key Terms

  • Small Language Models (SLMs): AI models that require less computational power and memory, making them cost-effective and quick to train and deploy.

  • Graphics Processing Unit (GPU): a specialized computer chip capable of parallel processing (i.e., performing mathematical calculations simultaneously), making it ideal for complex applications like generative AI (GenAI).

  • Search-Augmented LLMs: LLMs that retrieve up-to-date information from external knowledge bases like the internet.

  • Capital Expenditures (CapEx): the funds a company uses to purchase, improve, or maintain long-term assets essential for its operations.

  • Tokens: the smallest units of data used by an AI model to process and generate text. Similarly, we break down sentence into words or characters.

šŸ©ŗ PULSE CHECK

What do you think is the biggest challenge to widespread adoption of AI?

Vote Below to View Live Results

Login or Subscribe to participate in polls.

āš”ļøENERGY IS ESSENTIAL

Before the mass adoption of generative AI (GenAI), we must address energy constraints. At current standards, the worldā€™s power grids wonā€™t meet the expected demand for AI-enabled products and services and their accompanying infrastructure.

Given this reality, powerful sub-billion-parameter Small Language Models (SLMs) are the future. Meta has proposed various algorithmic innovations to create MobileLLM, a family of optimized AI models for on-device applications, prioritizing AI model architecture over data and parameter quantity. This new state-of-the-art AI model may soon become the standard at scale and prevent all the great promises AI enthusiasts envision from ending up being just that.

If we observe the distribution issues AI will face in the foreseeable future, itā€™s a very long tail, particularly given the industryā€™s challenges in meeting foreseeable demand.

But before we convince you that the answer is sub-billion-parameter SLMs, letā€™s consider the scale of the energy challenge.

šŸ“¦SUPPLY SIDE SHORTAGES

Assuming the status quo continues, we might soon face a real Graphics Processing Unit (GPU) shortage.

Before you jump to ā€œwe already had a shortage not too long ago,ā€ the answer is yes, but unprecedented Capital Expenditures (CapEx) drove that shortage. Nvidia failed to meet the investment demands of Big Tech companies investing billions of dollars to build massive GPU data centers based on a future demand that doesnā€™t exist. In other words, there was a short-term mismatch between investments in GenAI and the technologyā€™s revenue. Instead, we could soon face a shortage of GPUs relative to end-user demands once LLMs are fully integrated into products or services like Google Search.

According to Meta {Appendix I}, in a future where most humans use LLMs just 5% of their day, we would need 100 million Nvidia H100 Tensor Core GPUs to power OpenAIā€™s GPT-4, assuming an acceptable latency of 50 tokens per second and a very short average sequence length, which refers to the number of tokens processed within a prompt.

While such numbers may sound nonsensical, that future isnā€™t far off. As noted, LLMs will supercharge Googleā€™s AI Overviews: a feature that provides AI-generated summaries at the top of Google Search results.

Google Search is used 8.5 billion times per day, and according to research by SemiAnalysis, GenAI-enhanced Google Search could cost an average of 9 watt-hours (Wh).

Assuming that at least 60% of all searches will be based on GenAI generations, the total energy demand for GenAI generations would be 17 terawatt-hours (TWh), or 17 million megawatt-hours (Mwh). For reference, xAIā€™s upcoming 100,000 Nvidia H100 Tensor Core GPU cluster, the largest in the world, will require a mere 140 megawatt-hours (Mwh) to run.

Now, you might say that all GPUs donā€™t need to be in one data center, and these computational demands will be distributed. On the contrary, a cluster of GPUs running a single LLM must be accumulated in one data center, as the AI models, due to their size, need to be distributed across hundreds of GPUs. A single LLM demands continuous GPU-to-GPU communication, which requires costly cables that triple in price beyond the 50-meter (m) mark. And thatā€™s without factoring in latency, which would negatively impact the user experience.

šŸ¤–A NEW AI MODEL?

Envisioning the energy constraints from a GPU perspective is already daunting, but itā€™s just the tip of the iceberg. The energy challenges worsen when considering the projected global growth in AI use and the emergence of even more powerful frontier AI models.

Suppose we assume that the current compute and memory cost complexity of AI models (i.e., how expensive they are to run and store) continues. In that case, the estimates we provided in the previous segment may not be sufficient. Doubling the input sequence of an LLM quadruples the compute and memory requirements.

While LLMs have conquered memorization by regurgitating most of the internetā€™s data, their reasoning capabilities are modest.

Most people consider search-augmented LLMs, or long inference AI models, as the solution. These LLMs explore the solution space instead of directly responding to user queries, generating up to millions of possible responses before settling for one.

Hereā€™s a breakdown of how search-augmented LLMs work:

  1. Understanding Your Request: The LLM analyzes the prompt to grasp its meaning and intent.

  2. Knowledge Base Search: It then taps into a vast external knowledge base, like a super-powered search engine.

  3. Identifying Relevant Information: The LLM sifts through the external knowledge base to find the information that aligns with the prompt.

  4. Enhancing the Prompt: The LLM incorporates the most relevant information into a more refined and detail-oriented prompt.

  5. Generating the Response: The LLM leverages the more refined and detail-oriented prompt to generate a response.

Hereā€™s an analogy: Imagine a student writing an essay. The LLM is like the student who first understands the essay prompt. Then, the student consults a library (i.e., an external knowledge base) to find relevant sources. After identifying the key points from those sources, the student incorporates them into the essay (i.e., enhancing the prompt). Finally, the student uses their writing skills to craft the essay (i.e., generate the response).

This approach not only skyrockets average token usage but likely requires the development of verifiers or additional AI models to validate the LLMā€™s search for the solution.

If this is the future of AI, then the numbers we saw above will fall short, with some requests far exceeding the 9-watt-hour (Wh) mark we discussed earlier.

According to the International Energy Agency (IEA), data center demand in the U.S. and China is expected to grow annually to approximately 710 terawatt-hours (TWh) by 2026. For reference, thatā€™s almost as large as France and Italyā€™s combined energy consumption in 2022 of 720 terawatt-hours (TWh).

šŸ“ŠTHE SOLUTION

For all those reasons, many are looking toward Edge AI, or ā€œon-deviceā€ language models, as a possible solution. These AI models can run on mobile devices, eliminating the need for GPU data centers. However, as Apple Intelligenceā€™s reported delay of GenAI-enhanced Siri to Spring 2025 proves, cost-competitive devices arenā€™t up to the challenge. But why?

Training and deploying LLMs at the data center scale is a complex balancing act. While personal devices can run simpler LLMs, their capabilities are limited by processing power and battery life.

Quality-wise, the best results obtained in AI today come from AI models with file sizes well above the terabyte (TB) range. A terabyte (TB) is 1,000 times larger than a gigabyte (GB). Considering Apple Intelligenceā€™s on-device LLM, OpenELM, has a size of around 1.5 gigabytes (GB), and lower-sized AI models decrease quality too much, itā€™s no wonder why Apple may be stalling the release of GenAI-enhanced Siri.

Battery life also presents a significant hurdle. Meta calculated that a 7 billion parameter LLM consumes 0.7 J/Token. A fully charged iPhone, with 50 KJ of battery life, can sustain this LLM for less than two hours, with every 64 tokens draining 0.2% of battery life.

While Big Tech chases billion-dollar data centers for massive AI models, the key to unlocking AIā€™s potential lies in smaller, sub-billion-parameter AI models that deliver exceptional performance.

Meta is focused on developing several ā€œminuteā€ AI models that are 15 times smaller than current state-of-the-art LLMs to deploy conversational chatbots at scale on mobile devices.

šŸ”‘KEY TAKEAWAY

The hype surrounding Artificial General Intelligence (AGI), building more expansive foundational AI models, and spending wars among Big Tech companies has led to an AI industry CapEx of $600 billion. According to venture firm Sequoia Capitol, this is 20 times larger than actual revenues.

Big Tech needs to make sure theyā€™re building AI infrastructure consumers actually want, not just guessing what might be popular later. SLMs are still in their infancy, but Metaā€™s money is in the right place.

šŸ“’FINAL NOTE

If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.

How was todayā€™s newsletter?

ā¤ļøTAIP Review of the Week

ā€œAnother great Sunday Special, Rohun and James! Particularly because it dives deep into the implications of AI developments within a specific profession or field. In this case, music composition. AI is spreading on all fronts, so its impact wonā€™t be a one-size-fits-all scenario. Iā€™d like to see detailed analyses within other professions or processes. Well done!ā€

-Lucan (ā­ļøā­ļøā­ļøā­ļøā­ļøNailed it!)
REFER & EARN

šŸŽ‰Your Friends Learn, You Earn!

You currently have 0 referrals, only 1 away from receiving āš™ļøUltimate Prompt Engineering Guide.

Refer 5 friends to enter šŸŽ°Julyā€™s $200 Gift Card Giveaway.

Reply

or to participate.