AI Inference

AWS and Cerebras Are Ripping AI Inference Apart — On Purpose

The biggest bottleneck in AI isn’t training anymore. It’s inference — the moment a model actually does something useful. And AWS just partnered with Cerebras Systems to attack that bottleneck with an approach nobody has tried at this scale. The deal: Cerebras’ massive wafer-scale CS-3 chips will sit inside AWS data centers, accessible through Amazon Bedrock. The promise: 5x faster inference. The method: tearing the inference pipeline in half. Splitting the Brain Traditional AI inference runs both stages on the same GPU. You send a prompt, the chip processes it (prefill), then generates a response token by token (decode). One chip, both jobs. ...

Abstract visualization of Nvidia's Vera Rubin AI chip architecture

Nvidia GTC 2026: Vera Rubin, a $1 Trillion Bet, and the Dawn of AI's Inference Era

Jensen Huang stood in front of 18,000 people at San Jose’s SAP Center on Monday, wearing his signature black leather jacket, and casually dropped a number that would make most Fortune 500 CEOs choke on their coffee: $1 trillion. That’s the revenue opportunity Nvidia now sees for its AI chips through 2027 — doubled from the $500 billion estimate it gave investors just last month. And after a nearly three-hour keynote that covered everything from space-based data centers to Disney robots to the future of gaming graphics, one thing is crystal clear: Nvidia isn’t just riding the AI wave anymore. It’s building the ocean. ...

Abstract illustration of Nvidia's new inference chip architecture

Nvidia Just Admitted GPUs Aren't Enough — Its $20B Groq Bet Changes Everything

For a decade, Nvidia sold the world a simple story: GPUs are all you need. Training? GPUs. Inference? Also GPUs. That story built a $3 trillion empire. On March 16 at GTC 2026 in San Jose, Jensen Huang is expected to blow it up himself. Nvidia will reportedly unveil a dedicated inference processor — not a GPU — built on technology from Groq, the inference startup it absorbed in a $20 billion deal last December. OpenAI is lined up as the first major customer. And the implications for the entire AI hardware ecosystem are enormous. ...

Abstract illustration of AI model architecture hard-wired into silicon

Taalas HC1: The Chip That Bakes AI Models Directly Into Silicon at 17,000 Tokens Per Second

What if instead of running an AI model on a chip, you turned the model into the chip? That’s the bet Taalas just went public with — and the numbers are making the entire semiconductor industry sit up straight. This 25-person startup out of Toronto emerged from stealth with $169 million in funding and a working product called the HC1: a chip that hard-wires a large language model directly into silicon transistors. No software stack. No HBM memory. No liquid cooling. Just raw, physics-level inference. ...