Huawei's Ascend 950PR Cracks Nvidia's CUDA Moat — and China's Tech Giants Are Lining Up

Nvidia’s deepest moat was never the silicon. It was CUDA — the software ecosystem that made every AI developer on Earth, including China’s, completely dependent on Nvidia’s way of doing things. You could build a faster chip, but if developers had to rewrite their entire codebase to use it? Dead on arrival.

Huawei just found the side door.

The Ascend 950PR, paired with Huawei’s overhauled CANN Next software stack, has reportedly won over ByteDance and Alibaba — two of China’s largest AI consumers. After years of Beijing practically begging its tech giants to go domestic, Huawei may have finally built a chip they actually want to use.

The 910C Failure Set the Stage

Huawei’s previous flagship, the Ascend 910C, was supposed to be China’s answer to Nvidia. On paper, competitive. In practice, China’s biggest tech companies largely avoided it.

The reason wasn’t patriotism failing — it was pragmatism winning. Migrating from CUDA to Huawei’s proprietary CANN framework meant rewriting code, retraining teams, and accepting uncertain performance. Even with government subsidies, the switching costs were simply too high for companies running AI at scale.

The result embarrassed Beijing’s self-sufficiency agenda. Despite export controls, Chinese hyperscalers found workarounds — renting offshore compute, stockpiling chips before bans, allegedly using smuggled hardware. Anything to avoid leaving CUDA behind.

If You Can’t Beat CUDA, Become CUDA

CANN Next is the real breakthrough here, not the chip itself.

Instead of asking developers to learn a new programming model, Huawei adopted CUDA’s own paradigms — thread blocks, warps, kernel launches, the entire SIMT architecture that Nvidia developers know by heart. CANN Next treats CUDA as a language standard while optimizing execution for Ascend hardware underneath.

If CUDA is English, Huawei stopped trying to teach everyone Mandarin and built an English-speaking interface that thinks in Mandarin behind the scenes. Developers write code that looks and feels like CUDA. The compiler handles the translation. Near-drop-in compatibility — exactly what ByteDance and Alibaba needed to hear.

The Numbers That Matter

The 950PR is purpose-built for inference — running trained models at scale, not training new ones. That’s a strategic choice reflecting where China’s AI industry is heading.

1 PFLOPS FP8 compute — optimized for low-precision inference math
2 PFLOPS FP4 — for lighter workloads
2.87x compute performance over Nvidia’s H20 in prefill and recommendation tasks
128GB HiBL 1.0 memory — Huawei’s own HBM equivalent, 1.6 TB/s bandwidth
2 TB/s interconnect bandwidth for cluster scaling
Price: $6,900–$9,700 per card — significantly undercutting Nvidia’s data center offerings

Huawei plans 750,000 units in 2026. Samples shipped in January, mass production starts April, full-scale shipments in H2.

Perfect Timing, Terrible for Nvidia

Washington’s export controls have created regulatory whiplash for Chinese hyperscalers. Never knowing which Nvidia chips they’ll be allowed to buy next quarter — or whether existing approvals will get yanked — has made “good enough and available” enormously attractive.

The 950PR checks three boxes simultaneously: competitive performance for inference, CUDA compatibility that eliminates migration pain, and no geopolitical strings attached. At aggressive pricing, the total cost of ownership math could work heavily in Huawei’s favor for Chinese companies building massive inference clusters.

The Vertical Integration Play

One detail that’s easy to overlook: the 950PR’s HBM variant uses Huawei’s own HiBL 1.0 memory technology.

HBM has been another supply chain chokepoint. SK Hynix and Samsung dominate production, and further export restrictions on memory tech remain a constant threat. By building its own HBM equivalent, Huawei is de-risking the entire supply chain. Chip, memory, software stack — all in-house. That’s the Apple playbook: control everything, depend on no one’s export policy.

What This Really Means

The Ascend 950PR represents something bigger than one chip launch. It’s potentially the moment China’s semiconductor ecosystem becomes self-sustaining for AI workloads.

The bull case for Nvidia always included China as a massive captive market. Export controls were supposed to slow China’s AI development. Instead, they accelerated domestic chip development by forcing Huawei to solve problems it might have punted for years if Nvidia chips kept flowing freely.

The 950PR doesn’t need to match Nvidia’s Blackwell or Vera Rubin on raw performance. It needs to be good enough, available, and frictionless for Chinese developers. With CANN Next’s CUDA compatibility, strong inference specs, self-built memory, and aggressive pricing, it appears to have checked all three boxes.

The Inference Shift Is Global

This connects to a deeper trend. The entire AI industry is pivoting from training to inference. Training a frontier model is a one-time event. Running it billions of times for millions of users is continuous and ever-growing.

Huawei, Arm (with its new AGI CPU), Google (with TurboQuant compression) — everyone’s optimizing for inference. The companies that win inference hardware capture the most value as AI moves from labs to production at scale.

Huawei’s bet: China’s inference market is large enough and underserved enough that a “good enough” chip with great software compatibility captures enormous share. Given China’s accelerating AI deployment — open-source models, agentic frameworks, massive consumer platforms — that bet looks increasingly sound.

The Real Test Starts Soon

Full-scale shipments begin H2 2026. Can Huawei hit 750,000 units? Will CANN Next hold up at scale? How does Nvidia respond — aggressive pricing, lobbying for looser export controls, or doubling down on performance advantages that matter less for inference?

One thing seems clear: Nvidia’s unchallenged dominance in China is ending. Not with a bang, but with a CUDA-compatible compiler.