- The AI Pulse
- Posts
- š§ Why Metaās Tiny AI Models Matter
š§ Why Metaās Tiny AI Models Matter
PLUS: What Apple Intelligenceās Delayed Launch Tells Us About the Future of Computing
Welcome back AI prodigies!
In todayās Sunday Special:
ā”ļøEnergy Is Essential
š¦Supply Side Shortages
š¤A New AI Model?
šThe Solution
šKey Takeaway
Read Time: 8 minutes
šKey Terms
Small Language Models (SLMs): AI models that require less computational power and memory, making them cost-effective and quick to train and deploy.
Graphics Processing Unit (GPU): a specialized computer chip capable of parallel processing (i.e., performing mathematical calculations simultaneously), making it ideal for complex applications like generative AI (GenAI).
Search-Augmented LLMs: LLMs that retrieve up-to-date information from external knowledge bases like the internet.
Capital Expenditures (CapEx): the funds a company uses to purchase, improve, or maintain long-term assets essential for its operations.
Tokens: the smallest units of data used by an AI model to process and generate text. Similarly, we break down sentence into words or characters.
š©ŗ PULSE CHECK
What do you think is the biggest challenge to widespread adoption of AI?Vote Below to View Live Results |
ā”ļøENERGY IS ESSENTIAL
Before the mass adoption of generative AI (GenAI), we must address energy constraints. At current standards, the worldās power grids wonāt meet the expected demand for AI-enabled products and services and their accompanying infrastructure.
Given this reality, powerful sub-billion-parameter Small Language Models (SLMs) are the future. Meta has proposed various algorithmic innovations to create MobileLLM, a family of optimized AI models for on-device applications, prioritizing AI model architecture over data and parameter quantity. This new state-of-the-art AI model may soon become the standard at scale and prevent all the great promises AI enthusiasts envision from ending up being just that.
If we observe the distribution issues AI will face in the foreseeable future, itās a very long tail, particularly given the industryās challenges in meeting foreseeable demand.
But before we convince you that the answer is sub-billion-parameter SLMs, letās consider the scale of the energy challenge.
š¦SUPPLY SIDE SHORTAGES
Assuming the status quo continues, we might soon face a real Graphics Processing Unit (GPU) shortage.
Before you jump to āwe already had a shortage not too long ago,ā the answer is yes, but unprecedented Capital Expenditures (CapEx) drove that shortage. Nvidia failed to meet the investment demands of Big Tech companies investing billions of dollars to build massive GPU data centers based on a future demand that doesnāt exist. In other words, there was a short-term mismatch between investments in GenAI and the technologyās revenue. Instead, we could soon face a shortage of GPUs relative to end-user demands once LLMs are fully integrated into products or services like Google Search.
According to Meta {Appendix I}, in a future where most humans use LLMs just 5% of their day, we would need 100 million Nvidia H100 Tensor Core GPUs to power OpenAIās GPT-4, assuming an acceptable latency of 50 tokens per second and a very short average sequence length, which refers to the number of tokens processed within a prompt.
While such numbers may sound nonsensical, that future isnāt far off. As noted, LLMs will supercharge Googleās AI Overviews: a feature that provides AI-generated summaries at the top of Google Search results.
Google Search is used 8.5 billion times per day, and according to research by SemiAnalysis, GenAI-enhanced Google Search could cost an average of 9 watt-hours (Wh).
Assuming that at least 60% of all searches will be based on GenAI generations, the total energy demand for GenAI generations would be 17 terawatt-hours (TWh), or 17 million megawatt-hours (Mwh). For reference, xAIās upcoming 100,000 Nvidia H100 Tensor Core GPU cluster, the largest in the world, will require a mere 140 megawatt-hours (Mwh) to run.
Now, you might say that all GPUs donāt need to be in one data center, and these computational demands will be distributed. On the contrary, a cluster of GPUs running a single LLM must be accumulated in one data center, as the AI models, due to their size, need to be distributed across hundreds of GPUs. A single LLM demands continuous GPU-to-GPU communication, which requires costly cables that triple in price beyond the 50-meter (m) mark. And thatās without factoring in latency, which would negatively impact the user experience.
š¤A NEW AI MODEL?
Envisioning the energy constraints from a GPU perspective is already daunting, but itās just the tip of the iceberg. The energy challenges worsen when considering the projected global growth in AI use and the emergence of even more powerful frontier AI models.
Suppose we assume that the current compute and memory cost complexity of AI models (i.e., how expensive they are to run and store) continues. In that case, the estimates we provided in the previous segment may not be sufficient. Doubling the input sequence of an LLM quadruples the compute and memory requirements.
While LLMs have conquered memorization by regurgitating most of the internetās data, their reasoning capabilities are modest.
Most people consider search-augmented LLMs, or long inference AI models, as the solution. These LLMs explore the solution space instead of directly responding to user queries, generating up to millions of possible responses before settling for one.
Hereās a breakdown of how search-augmented LLMs work:
Understanding Your Request: The LLM analyzes the prompt to grasp its meaning and intent.
Knowledge Base Search: It then taps into a vast external knowledge base, like a super-powered search engine.
Identifying Relevant Information: The LLM sifts through the external knowledge base to find the information that aligns with the prompt.
Enhancing the Prompt: The LLM incorporates the most relevant information into a more refined and detail-oriented prompt.
Generating the Response: The LLM leverages the more refined and detail-oriented prompt to generate a response.
Hereās an analogy: Imagine a student writing an essay. The LLM is like the student who first understands the essay prompt. Then, the student consults a library (i.e., an external knowledge base) to find relevant sources. After identifying the key points from those sources, the student incorporates them into the essay (i.e., enhancing the prompt). Finally, the student uses their writing skills to craft the essay (i.e., generate the response).
This approach not only skyrockets average token usage but likely requires the development of verifiers or additional AI models to validate the LLMās search for the solution.
If this is the future of AI, then the numbers we saw above will fall short, with some requests far exceeding the 9-watt-hour (Wh) mark we discussed earlier.
According to the International Energy Agency (IEA), data center demand in the U.S. and China is expected to grow annually to approximately 710 terawatt-hours (TWh) by 2026. For reference, thatās almost as large as France and Italyās combined energy consumption in 2022 of 720 terawatt-hours (TWh).
šTHE SOLUTION
For all those reasons, many are looking toward Edge AI, or āon-deviceā language models, as a possible solution. These AI models can run on mobile devices, eliminating the need for GPU data centers. However, as Apple Intelligenceās reported delay of GenAI-enhanced Siri to Spring 2025 proves, cost-competitive devices arenāt up to the challenge. But why?
Training and deploying LLMs at the data center scale is a complex balancing act. While personal devices can run simpler LLMs, their capabilities are limited by processing power and battery life.
Quality-wise, the best results obtained in AI today come from AI models with file sizes well above the terabyte (TB) range. A terabyte (TB) is 1,000 times larger than a gigabyte (GB). Considering Apple Intelligenceās on-device LLM, OpenELM, has a size of around 1.5 gigabytes (GB), and lower-sized AI models decrease quality too much, itās no wonder why Apple may be stalling the release of GenAI-enhanced Siri.
Battery life also presents a significant hurdle. Meta calculated that a 7 billion parameter LLM consumes 0.7 J/Token. A fully charged iPhone, with 50 KJ of battery life, can sustain this LLM for less than two hours, with every 64 tokens draining 0.2% of battery life.
While Big Tech chases billion-dollar data centers for massive AI models, the key to unlocking AIās potential lies in smaller, sub-billion-parameter AI models that deliver exceptional performance.
Meta is focused on developing several āminuteā AI models that are 15 times smaller than current state-of-the-art LLMs to deploy conversational chatbots at scale on mobile devices.
šKEY TAKEAWAY
The hype surrounding Artificial General Intelligence (AGI), building more expansive foundational AI models, and spending wars among Big Tech companies has led to an AI industry CapEx of $600 billion. According to venture firm Sequoia Capitol, this is 20 times larger than actual revenues.
Big Tech needs to make sure theyāre building AI infrastructure consumers actually want, not just guessing what might be popular later. SLMs are still in their infancy, but Metaās money is in the right place.
šFINAL NOTE
If you found this useful, follow us on Twitter or provide honest feedback below. It helps us improve our content.
How was todayās newsletter?
ā¤ļøTAIP Review of the Week
āAnother great Sunday Special, Rohun and James! Particularly because it dives deep into the implications of AI developments within a specific profession or field. In this case, music composition. AI is spreading on all fronts, so its impact wonāt be a one-size-fits-all scenario. Iād like to see detailed analyses within other professions or processes. Well done!ā
REFER & EARN
šYour Friends Learn, You Earn!
You currently have 0 referrals, only 1 away from receiving āļøUltimate Prompt Engineering Guide.
Refer 5 friends to enter š°Julyās $200 Gift Card Giveaway.
Copy and paste this link to others: https://theaipulse.beehiiv.com/subscribe?ref=PLACEHOLDER
Reply