For a decade, Nvidia sold the world a simple story: GPUs are all you need. Training? GPUs. Inference? Also GPUs. That story built a $3 trillion empire.
On March 16 at GTC 2026 in San Jose, Jensen Huang is expected to blow it up himself.
Nvidia will reportedly unveil a dedicated inference processor — not a GPU — built on technology from Groq, the inference startup it absorbed in a $20 billion deal last December. OpenAI is lined up as the first major customer. And the implications for the entire AI hardware ecosystem are enormous.
The GPU’s Inference Problem
Nvidia’s GPUs are extraordinary at training AI models. Billions of parallel computations, weeks of crunching — that’s their sweet spot. But inference — actually running trained models in production — is a different animal.
During inference, the bottleneck isn’t compute. It’s memory bandwidth. The chip spends most of its time waiting for model weights to arrive from off-chip memory. Even the H100’s impressive 3.35 TB/s of HBM bandwidth becomes a chokepoint when you’re serving millions of real-time requests.
This matters because inference is where the money is going. As AI shifts from research labs to production — chatbots, coding assistants, autonomous agents — inference workloads are exploding. By some estimates, inference already accounts for over 60% of total AI compute spending. And it’s growing faster than training.
Nvidia kept insisting GPUs handle both workloads just fine. Meanwhile, its biggest customers started shopping elsewhere.
OpenAI Forced the Issue
The catalyst, per Wall Street Journal reporting, was OpenAI’s frustration with GPU-based inference for Codex. Engineers found Nvidia’s chips too power-hungry and too slow for real-time, latency-sensitive code generation. Internal teams blamed the hardware directly.
So OpenAI started looking around. Cerebras. SambaNova. And most critically, Groq — whose Language Processing Units were already delivering inference speeds that made GPUs look sluggish.
OpenAI even signed a multibillion-dollar contract with Cerebras in January 2026. The message to Nvidia was unmistakable: fix inference, or we’ll find someone who will.
Nvidia got the message. In December 2025, it moved to neutralize the Groq threat with a $20 billion licensing deal — one of the largest acqui-hires in Silicon Valley history. Groq founder Jonathan Ross, President Sunny Madra, and the bulk of the engineering team came into Nvidia’s fold. Then Nvidia invested $30 billion in OpenAI to lock down the relationship.
Why Groq’s Architecture Is Fundamentally Different
What makes Groq’s tech worth $20 billion? The core innovation is deceptively simple: put the memory on the chip itself.
Instead of off-chip HBM, Groq’s LPU stores model weights directly in on-chip SRAM. A single LPU holds around 230 MB of SRAM and delivers roughly 80 TB/s of internal memory bandwidth — approximately 24 times what an H100 achieves.
The tradeoff is capacity. 230 MB can’t hold even a small language model, so you link hundreds of LPU chips together. But when you do, the latency characteristics are fundamentally different from anything a GPU cluster can match.
The second innovation is deterministic execution. GPUs use hardware schedulers that dynamically manage instructions at runtime — deciding which threads run where, arbitrating memory access, handling cache misses. This creates unpredictable tail latency and jitter. Groq eliminates all of it. The compiler schedules every memory load, every operation, every packet transmission at compile time. No cache misses (no cache). No runtime decisions (no hardware autonomy).
The result? Groq demonstrations have produced 10,000 “thought tokens” in roughly two seconds. That’s the kind of speed that makes real-time AI agents — ones that reason, plan, and act autonomously — actually viable.
LPX: The New Rack-Scale Platform
Nvidia isn’t just slapping a Groq chip into a server. It’s building an entirely new rack-scale platform called LPX.
Initial LPX racks integrate 64 LPUs, packaged as 32 RealScale ASIC tiles. Groq’s RealScale network uses a direct, switch-less topology where each LPU connects directly to others in a dragonfly-plus design. Because each chip operates in a “plesiosynchronous” regime — where clock oscillators have small, predictable drift — the compiler precomputes packet timing for every transfer. The result: 576 LPUs operating as if they shared a single memory space.
Around GTC, Nvidia plans to introduce an enhanced version with 256 LPUs per rack — a fourfold increase. Paired with new 52-layer M9 Q-glass PCBs and larger on-chip memory, LPX will sit alongside Rubin GPUs as a complementary inference solution.
This is the key strategic signal. Nvidia isn’t replacing GPUs with LPUs. It’s creating a two-architecture strategy: GPUs for training, LPUs for low-latency inference. An admission that one chip can’t rule them all.
The Ripple Effects
Cloud providers like Amazon, Google, and Microsoft have all been developing custom inference chips (Inferentia, TPUs, Maia). Nvidia entering the dedicated inference market with Groq’s technology raises the stakes dramatically. If LPX delivers, those custom silicon investments get harder to justify.
Startups like Cerebras and SambaNova just lost competitive moat. Their pitch was “we do inference better than Nvidia.” Now Nvidia is coming for that market with a $20 billion technology acquisition and unmatched distribution.
For anyone building with AI, more efficient inference means lower costs per query, faster response times, and viable always-on agents. If LPX delivers 10x the memory bandwidth at lower power, the economics of AI applications shift fundamentally.
For the energy conversation, a dedicated inference chip using less energy per token could ease the political and environmental pressure that’s been building around AI’s power consumption.
The Strategic Paradox
Let’s be honest about what Nvidia just did. It spent $20 billion to buy technology that proves its core product isn’t good enough for a growing market segment. That takes either remarkable strategic clarity — or a recognition that standing still means falling behind.
Analyst Holger Mueller of Constellation Research nailed it: at last year’s GTC, Huang positioned exploding inference demand as a GPU win. Now he’s conceding that a different architecture is needed.
The vulnerability is real. Nvidia is essentially validating what competitors have been saying for years — GPUs aren’t optimal for inference. LPX needs to be extraordinary out of the gate, or it risks confirming the narrative that specialized chips are the future and GPUs are yesterday’s architecture.
What to Watch March 16
When Huang takes the stage:
- Benchmarks: Did Nvidia improve Groq’s architecture, or just rebadge it?
- Pricing: LPX priced at GPU levels won’t disrupt anything. The cost-per-token story has to be dramatic.
- The Rubin relationship: How does LPX fit alongside Nvidia’s next-gen GPU architecture?
- OpenAI’s commitment: Reports say OpenAI is dedicating 3 GW of capacity to the new chip. What are they building with that kind of firepower?
The era of “GPUs for everything” is over. Nvidia just said so — with a $20 billion receipt to prove it.