A single research paper from Google just wiped billions off memory chip stocks across three continents. No earnings miss. No supply chain disruption. Just math.
The algorithm is called TurboQuant. If it delivers on its promises, it rewrites the economics of running every major AI model on the planet. We’re talking 6x less memory, 8x faster inference, and zero accuracy loss.
The Bottleneck Everyone Ignored
Every AI conversation eats memory. When you chat with an AI, the model stores your context in a key-value (KV) cache — its working memory. Longer conversations mean bigger caches, which means more expensive GPU memory consumed.
This is why running frontier models costs a fortune. It’s why your phone chokes on long AI conversations. And it’s why data centers are devouring high-bandwidth memory chips like there’s no tomorrow.
TurboQuant kills this bottleneck. Published by Google Research on March 24, the algorithm compresses each value in the KV cache from 16 bits to just 3 bits. On Nvidia H100 GPUs, that translates to up to 8x faster inference throughput.
The kicker: no retraining required. Drop it into existing models as a plug-and-play optimization layer. Done.
How It Actually Works
TurboQuant isn’t brute-force compression. It’s a two-stage pipeline that solves problems traditional quantization has struggled with for years.
Stage 1 — PolarQuant: Instead of standard coordinates, the algorithm converts KV cache vectors into polar coordinates, separating each into magnitude and angles. These angular distributions turn out to be highly predictable and concentrated, eliminating the normalization constants that typically eat 1–2 extra bits per number and undermine the whole compression effort.
Stage 2 — QJL Error Correction: Residual error from Stage 1 gets compressed using the Johnson-Lindenstrauss Transform, a technique that preserves distances in high-dimensional space. This squeezes the error down to a single sign bit per dimension — essentially free.
Result: three bits per value. No overhead. No accuracy loss.
On benchmarks including Needle in a Haystack, LongBench, ZeroSCROLLS, and RULER, TurboQuant hit perfect or near-perfect scores on models like Llama-3.1-8B and Mistral-7B. An independent developer built a PyTorch implementation within hours, tested it on a consumer RTX 4090, and reportedly achieved identical outputs at 2-bit precision — pushing beyond what Google officially published.
Wall Street Didn’t Wait for Peer Review
Within hours, memory chip stocks started bleeding. Over two days, the damage spread globally:
- SK Hynix: Down ~6%
- Samsung Electronics: Down ~5%
- Kioxia: Down ~6%
- Western Digital: Down ~4.7%
- Micron Technology: Down ~3.4%
South Korea’s KOSPI index dropped as much as 3%, dragged by SK Hynix and Samsung. Cloudflare CEO Matthew Prince called it “Google’s DeepSeek moment” — referencing the Chinese lab’s efficiency breakthroughs in early 2025 that triggered a massive tech selloff.
The investor logic is straightforward: if AI models run on 6x less memory, maybe those explosive HBM demand forecasts were overblown. Samsung had risen nearly 200% in a year. Micron and SK Hynix were up over 300%. TurboQuant introduced the question mark.
The Panic Might Be Premature
Several analysts are pushing back on the selloff narrative.
Ben Barringer at Quilter Cheviot told CNBC the move was largely profit-taking: “Memory stocks have had a very strong run and this is a highly cyclical sector. The Google TurboQuant innovation has added to pressure, but this is evolutionary, not revolutionary.”
Morgan Stanley’s Shawn Kim argued the impact should actually be positive — it addresses a critical bottleneck, improving overall AI hardware capabilities.
The most compelling counterargument comes from the Jevons paradox: making something more efficient doesn’t reduce total consumption. It often increases it. If running a frontier model costs 6x less memory, companies won’t just pocket the savings. They’ll run bigger models, serve more users, and tackle problems that were previously too expensive to attempt.
Ray Wang at SemiAnalysis put it bluntly: “When you address a bottleneck, you help AI hardware become more capable. When models become more powerful, you require better hardware to support them.”
Who Wins and Who Should Pay Attention
Startups: A company spending $50K/month on GPU inference might achieve similar throughput for under $10K. That’s the difference between a viable business and burning through your Series A.
On-device AI: If TurboQuant works on consumer hardware — and that RTX 4090 test suggests it does — phones, laptops, and budget machines could run genuinely capable local AI. The dream of powerful offline AI just got a lot more real.
Google: By open-sourcing this research (full paper at ICLR 2026 in Rio de Janeiro, April), Google sets the efficiency agenda for the entire industry while directly benefiting its own infrastructure — Search, YouTube recommendations, and ad targeting all use the same vector operations.
Efficiency Is Eating Hardware Margins
TurboQuant builds on two earlier papers from the same Google team: QJL (AAAI 2025) and PolarQuant (AISTATS 2026). This is years of work by Amir Zandieh and Vahab Mirrokni reaching its payoff.
It’s also part of a broader wave. DeepSeek’s efficiency innovations, Meta’s distillation work, and now Google’s compression breakthroughs are shifting the AI narrative from “bigger models, more compute” to “smarter algorithms, better efficiency.”
That shift determines who gets access to AI. When frontier models require millions in hardware, only Big Tech plays. When clever algorithms cut those costs by 6x, universities, startups, governments, and individuals get in the game.
The memory chip market will probably recover. But the message is unmistakable: in the race to make AI ubiquitous, software efficiency is eating hardware margins for breakfast.