• The AI Pulse
  • Posts
  • 🧠 Building AI Is Hard. But Implementing It? Even Harder.

🧠 Building AI Is Hard. But Implementing It? Even Harder.

PLUS: What Makes Real-Time Disease Detection So Tricky

Welcome back AI prodigies!

In today’s Sunday Special:

  • šŸ“œThe Prelude

  • šŸ“ŠBenchmarks vs. The Real World

  • 🦾Technicalities and Trust

  • ā³Real-Time Integration

  • šŸ”‘Key Takeaway

Read Time: 7 minutes

šŸŽ“Key Terms

  • Machine Learning (ML): Leverages data to recognize patterns and make predictions without being explicitly programmed to do so.

  • Large Language Models (LLMs): AI Models pre-trained on vast amounts of data to generate human-like text.

  • Large Reasoning Models (LRMs): AI Models designed to mimic a human’s decision-making abilities to solve complex, multi-step problems.

🩺 PULSE CHECK

When will you feel comfortable riding in self-driving cars?

Vote Below to View Live Results

Login or Subscribe to participate in polls.

šŸ“œTHE PRELUDE

AI Capability is moving fast, but AI Adoption isn’t.

Each new advanced AI model seems to break a new benchmark by acing a Ph.D.-level exam or crushing a coding competition.

These accomplishments would suggest we’re sprinting toward societal transformation. In reality, our progress resembles a walk.

Benchmarks aren’t the real world. Just because an advanced AI model is capable doesn’t mean it’s operational. To excel in the real world, advanced AI models must be reliable, secure, seamless, scalable, adaptable, user-friendly, cost-efficient, and aligned with human intent.

So, what’s slowing AI Adoption down? How can we bridge the gap between AI Capability and AI Adoption?

šŸ“ŠBENCHMARKS VS. THE REAL WORLD

Today, advanced AI models are celebrated for achieving human-level performance or superhuman-level performance on standardized tests.

LLMs like OpenAI’s ā€œGPT-4ā€ beat 90% of law school graduates on the Uniform Bar Exam (UBE). LRMs like OpenAI’s ā€œOpenAI o3ā€ scored 99% on the HumanEval benchmark, which consists of 164 hand-crafted programming problems to assess an advanced AI model’s ability to generate functional code.

However, these accomplishments only measure performance on narrow, static problems that fail to reflect the complexity of real-world litigation or real-world software development.

āš–ļøWhy AI Can’t Replace Lawyers, Yet.

ā€œGPT-4’sā€ score on the UBE made headlines, fueling speculation that advanced AI models might replace lawyers sooner than we think. Not only does UBE evaluate memorization and issue-spotting, it also asks test-takers to form persuasive arguments and exercise ethical judgment. Nevertheless, test-takers rely on a narrow set of facts, far simpler than real disputes in a court of law. Real-world litigation is deeply contextual. It depends on tacit knowledge, like non-verbal cues, societal norms, and unwritten rules, which help lawyers assess the integrity of evidence. While LLMs can summarize case law or analyze legal proceedings in a hypothetical scenario, they don’t come close to producing court-ready documents that reflect the nuances, judgments, and situational awareness required when practicing law.

šŸ’»Why AI Can’t Replace Developers, Yet.

LLMs and LRMs regularly outperform developers on benchmarks like HumanEval. Yet, developers know that real software workflows involve debugging across abstraction layers, managing codebase dependencies, and ensuring the long-term maintainability of digital platforms. None of this is captured in benchmarks like HumanEval, which are built on clean, well-defined programming problems with fixed inputs and fixed outputs. While you can leverage AI-powered tools like bolt.new to draft initial codebases, they don’t replace the need for developers.

According to Addy Osmani, Head of Developer Experience at Google Chrome, developers don’t merely accept AI-generated code. Instead, they meticulously restructure it into smaller modules and rework the architecture of those modules to make sure the AI-generated code integrates properly into existing codebases. Integrating AI-generated code requires understanding the functionality and limitations of it.

🦾TECHNICALITIES AND TRUST

šŸ”The Reality Gap.

AI Capability is starting to reveal the Reality Gap: As advanced AI models become more capable, the need for trust grows even more.

Even if advanced AI models perform well on benchmarks, implementing them in the real world still requires a lot of trust, and without trust, AI Adoption stalls.

šŸš–Self-Driving Cars?

Consider the case of self-driving cars. Waymo, an autonomous ride-hailing service owned by Alphabet, Google’s parent company, has provided fully autonomous rides for nearly a decade.

Waymo has over 40 million miles of real-world driving experience in Phoenix, AZ, San Francisco, CA, Los Angeles, CA, and Austin, TX. Compared to human drivers with the same miles in the same locations, Waymo achieved 83% fewer airbag deployment crashes, 81% fewer injury-causing crashes, and 64% fewer police-reported crashes. Of 35 crashes between July 2024 and February 2025, Waymo was at fault for just one, with the other 34 caused by human error. The verdict seems clear: self-driving cars are safer than human-driven cars.

Even if Waymo continues to overcome all the technical challenges of self-driving cars and builds a near-perfect autonomous ride-hailing service, it must gain society’s trust. Trust ultimately comes down to whether a person feels comfortable in a self-driving car.

šŸ’„Acceptable Error Rates?

To create trust, we must agree on an acceptable error rate. Self-driving cars will always cause some crashes. However, AI-caused harm introduces a new ethical paradigm. We’re willing to tolerate human error, creating cultural norms that accept imperfection, like saying: ā€œno one’s perfect.ā€ But will we hold self-driving cars to the same imperfect standard?

People expect machines to be better than humans. This perception stems from traditional software, which is Deterministic. It produces the same output when given the same input. Every time you press the checkout button while shopping online, you expect something to go into your shopping cart. On the other hand, AI-enabled technology like self-driving cars relies on ML, which is Probabilistic. It produces different outputs when given the same input. Until we become comfortable with the Probabilistic nature of AI-enabled technology, we may disagree on a threshold of acceptable harm for self-driving cars.

ā³REAL-TIME INTEGRATION

Consider the Epic Sepsis Model (ESM) Inpatient Predictive Analytic Tool, an ML framework developed by Epic Systems to identify Sepsis.

Sepsis occurs when your body mounts a significant immune response against a bacterial infection, attacking your organs and tissues. As the third most common cause of death in U.S. hospitals, it’s notoriously difficult to diagnose. To help doctors identify Sepsis, the ESM analyzes the Electronic Health Records (EHRs) of hospitalized patients to generate Sepsis risk estimates every 20 minutes throughout their stay. Though initially promising, ESM’s performance varies significantly depending on which of the three detection stages it’s deployed in:

  1. Late-Stage Detection: After clinical signs of Sepsis became somewhat apparent, it correctly identified high-risk Sepsis patients 87% of the time.

  2. Pre-Diagnosis Detection: When making predictions before patients met the full clinical criteria for Sepsis, it correctly identified high-risk Sepsis patients 62% of the time.

  3. Early Detection: When making predictions before any blood tests were ordered to check for bacterial infections that may lead to Sepsis, it correctly identified high-risk Sepsis patients 53% of the time.

Timing creates a trade-off between accuracy and utility that persists in real-time disease detection. Early predictions are less accurate but more useful. Late predictions are more accurate but less useful. AI can’t be trusted if it can’t generate actionable insights at the right time. We accept a doctor making a diagnostic error, but hesitate when AI misdiagnoses, even if AI’s overall accuracy is higher most of the time.

šŸ”‘KEY TAKEAWAY

Bridging the gap between AI Capability and AI Adoption requires more than just innovation; it demands trust. Trust hinges not only on performing well on benchmarks, but also on delivering consistent, context-aware results in real-world settings with near-perfect accuracy. Until we get comfortable with the Probabilistic component of AI, AI Adoption will lag AI Capability.

šŸ“’FINAL NOTE

FEEDBACK

How would you rate today’s email?

It helps us improve the content for you!

Login or Subscribe to participate in polls.

ā¤ļøTAIP Review of The Week

ā€œIt’s timely, unique, and informative. I can tell you put tons of effort into every email.ā€

-Michelle (1ļøāƒ£ šŸ‘Nailed it!)
REFER & EARN

šŸŽ‰Your Friends Learn, You Earn!

You currently have 0 referrals, only 1 away from receiving šŸŽ“3 Simple Steps to Turn ChatGPT Into an Instant Expert.

Reply

or to participate.