Taalas HC1: The Chip That Bakes AI Models Directly Into Silicon at 17,000 Tokens Per Second

What if instead of running an AI model on a chip, you turned the model into the chip?

That’s the bet Taalas just went public with — and the numbers are making the entire semiconductor industry sit up straight. This 25-person startup out of Toronto emerged from stealth with $169 million in funding and a working product called the HC1: a chip that hard-wires a large language model directly into silicon transistors. No software stack. No HBM memory. No liquid cooling. Just raw, physics-level inference.

The result: 17,000 tokens per second per user on Llama 3.1 8B. That’s nearly 10x faster than Nvidia’s B200, at 20x lower build cost and 10x less power consumption.

If those numbers hold at scale, this isn’t another AI chip press release. It’s a different category entirely.

One Transistor Does Everything

Traditional AI inference has an ugly bottleneck: shuffling model weights between memory and compute. It’s like a chef sprinting between the pantry and the stove for every ingredient. Taalas eliminates the trip entirely.

The HC1 encodes model weights directly into transistors using a mask ROM architecture. Founder Ljubisa Bajic told The Next Platform: “We have got this scheme where we can store four bits and do the multiply related to it — everything — with a single transistor.” One transistor stores a weight and computes with it. The density is, in Bajic’s word, insane.

The chip sits on TSMC’s 6nm process with an 815 mm² die — roughly H100-sized. But while the H100 needs exotic HBM3 stacks, advanced packaging, liquid cooling, and massive power delivery, the HC1 is a single chip on a standard PCIe card. Plug it in and go.

The Benchmarks That Broke Reddit

Here’s the comparison that made r/LocalLLaMA lose its collective mind:

Chip	Tokens/sec per user
Taalas HC1	17,000
Nvidia B200	~2,200
Cerebras	~2,400
Groq (LPU)	~1,200
SambaNova	~1,000

All running Llama 3.1 8B with 1K input/1K output. These aren’t cherry-picked internal tests — Taalas compiled them from Nvidia’s published figures and Artificial Analysis benchmarks.

Users who tested the company’s public demo at chatjimmy.ai reported responses that felt instantaneous. “A wall of text in the blink of an eye,” one wrote. At 17,000 tokens per second, that’s not hyperbole — it’s math.

The Obvious Catch

The HC1 runs one model. Period. It’s Llama 3.1 8B, permanently etched into silicon. You can’t swap in GPT-5. When Llama 4 drops, you need a new chip.

This is both the fatal flaw and the secret weapon. By committing to one model, Taalas optimizes at a level general-purpose hardware can never reach. Scalpel vs. Swiss Army knife.

The rigidity isn’t total, though. The HC1 pairs its hard-wired mask ROM with programmable SRAM supporting configurable context windows and LoRA adapters. The base model is locked; the behavior is tunable. And Taalas claims it can go from receiving an unseen model to shipping a working chip in two months through a streamlined workflow with TSMC.

Why Now: The Inference Cost Crisis

The timing isn’t accidental. The AI industry is sprinting toward an inference cost wall that most people outside the industry don’t fully grasp.

Training happens once. Inference — actually running the model for millions of users — is the recurring bill that scales with every query, every agent action, every API call. As AI agents multiply (multi-step reasoning, tool use, autonomous workflows), inference demands are exploding.

The industry’s response so far: throw money at it. Nvidia acquired Groq for $20 billion. Meta is deploying “millions” more AI chips. Microsoft is spending $50 billion on AI infrastructure. All of it general-purpose hardware solving a problem that might have a specialized answer.

A 20x reduction in hardware cost and 10x reduction in power consumption would fundamentally reshape the economics of running AI at scale. That’s what Taalas is selling.

The Team Makes the Bet Credible

Bajic founded Tenstorrent, designed hybrid CPU-GPU architectures at AMD, and worked as a senior architect at Nvidia. His co-founders carry similar pedigrees. These aren’t AI researchers dabbling in chip design — they’re veteran semiconductor architects who concluded the industry was solving the wrong problem.

They recently hired Paresh Kharya (formerly managing Google Cloud’s GPU and TPU hardware) as VP of Products. That’s a commercialization signal, not a science project signal.

What’s Coming

Taalas has already demonstrated a 30-chip cluster running DeepSeek R1 at 12,000 tokens/sec per user, proving the architecture scales to larger models. A mid-sized reasoning LLM on the HC1 platform ships this spring. A frontier-class model on the next-gen HC2 chip is planned for winter 2026.

The technology works — the demo is live, the benchmarks are published, the physics checks out. The real question is whether AI’s model churn is too fast for hard-wired silicon, or whether we’re entering an era where the best models stabilize enough for specialization to dominate.

If agentic AI takes off the way everyone predicts — millions of AI agents needing millisecond-latency inference around the clock — the case for hard-wired silicon gets very strong, very fast.

The ENIAC didn’t make computing ubiquitous. The transistor did. Maybe GPUs won’t make AI ubiquitous either.

Sources: Taalas Blog, The Next Platform, EE Times, Reuters