ARC-AGI-3 Proves Every Frontier AI Model Scores Below 1% — While Humans Ace It

The AI industry spent last week congratulating itself. Jensen Huang declared on Lex Fridman’s podcast that “we’ve achieved AGI.” Arm named its new data center chip the “AGI CPU.” Sam Altman said they’ve “basically built AGI.”

Then François Chollet’s ARC Prize Foundation dropped ARC-AGI-3, and the results were humiliating.

Google’s Gemini 3.1 Pro — the best performer — scored 0.37%. GPT-5.4 managed 0.26%. Claude Opus 4.6 hit 0.25%. Grok-4.20 scored a clean, round zero.

Humans? They solved 100% of the environments. Without instructions. Without training. Without even being told what the goal was.

Not Your Typical Benchmark

ARC-AGI-3 doesn’t test trivia recall, coding ability, or PhD-level math. The ARC Prize Foundation built an in-house game studio and created 135 original interactive environments from scratch. The concept: drop an AI agent into an unfamiliar game-like world with zero instructions, zero stated goals, and no description of the rules. Explore. Figure it out. Solve it.

Any curious five-year-old can do this. Poke around, test boundaries, find the pattern, crack the puzzle. It’s intelligence so fundamental we barely recognize it as intelligence.

Previous ARC versions tested static visual puzzles. ARC-AGI-1 (2019) was eventually conquered by test-time training. ARC-AGI-2 lasted about a year before Gemini 3.1 Pro hit 77.1%. The labs are phenomenally good at saturating benchmarks they can train against.

Version 3 was designed to prevent exactly that.

A Scoring System That Punishes Brute Force

ARC-AGI-3 uses RHAE — Relative Human Action Efficiency. The baseline is the second-best human performance out of ten first-time players per environment (top player excluded to filter outliers).

Here’s where it gets brutal: the formula squares the penalty for inefficiency. If a human completes a task in 10 actions and the AI takes 100, the AI doesn’t get 10% — it gets 1%. Wandering, backtracking, and brute-forcing get punished hard. Being faster than the human earns zero bonus — per-level scores cap at 1.0.

This targets exactly what current AI systems do: throw compute at problems, try every permutation, stumble into answers through volume. ARC-AGI-3 says: if you can’t be efficient like a human, your solution doesn’t count.

The 97-Point Paradox

The most revealing data point in the entire report: researchers at Duke University built a custom harness for Claude Opus 4.6 and tested it on a known environment variant. With the harness, Claude scored 97.1%. Without it, on an unfamiliar environment, it scored 0%.

A 97-point swing based entirely on whether humans pre-built the strategy.

The official leaderboard tests models via API with identical system prompts and no custom tooling. The reasoning is straightforward: the benchmark measures the AI’s general intelligence, not the human intelligence that went into building a task-specific wrapper.

A separate community leaderboard allows harness-driven results. The best agent in the month-long developer preview scored 12.58%. But the Foundation explicitly warns against interpreting these results as AGI progress.

Chollet put it simply on X: “The G in AGI stands for ‘general.’ General intelligence doesn’t mean being specifically trained for a wide range of tasks. It means facing any new task and solving it independently.”

The Marketing Term Problem

“AGI” has become a marketing term. When Huang says AGI is here, he’s not using the same definition as researchers. When companies name products “AGI CPU” or build labs for “ASI,” they’re selling futures, not describing present reality.

Chollet sees only two coherent positions: either you believe AGI is possible — in which case a true AGI system will eventually solve ARC-AGI-3, because normal humans can — or you believe AI is fundamentally an automation tool that will always need human intervention for every new task.

There’s no middle ground where current systems are “basically AGI” but can’t handle what a kindergartner does instinctively.

Some engineers have argued the JSON-based input format disadvantages models. The Foundation rejected this: “Frame content perception and API format are not limiting factors for frontier model performance on ARC-AGI-3.” The real gap lies in reasoning and generalization — not perception.

Why ARC’s Track Record Matters

Previous ARC versions were right about what was coming.

ARC-AGI-1 was likely the first benchmark to identify the breakthrough of frontier reasoning systems like OpenAI’s o3, when every other benchmark was already saturated. ARC-AGI-2 captured the rapid progress of reasoning models and scaffolding techniques now deployed in production tools like Claude Code and Codex.

Both were eventually saturated, but getting there required genuine breakthroughs.

ARC-AGI-3 measures the next open gap: agentic intelligence — the ability to navigate unfamiliar environments without specific training. It’s the only unsaturated general agentic intelligence benchmark as of March 2026.

If history rhymes, the techniques that crack ARC-AGI-3 will define the next major wave of AI capabilities. Right now, nobody’s even close.

$2 Million on the Line

The ARC Prize Foundation has put $2 million across three competition tracks on Kaggle. Every winning solution must be open-sourced. You can play 25 of the environments yourself and discover how easy they are for humans.

110 of the 135 environments are kept private. There’s no dataset to memorize, no training data to overfit on, no benchmark to game.

The Real Gap

The models are genuinely remarkable. They write code, pass medical exams, generate photorealistic images, and hold conversations that feel human. The progress has been staggering.

But ARC-AGI-3 reveals the gap between “impressively capable on trained tasks” and “generally intelligent.” It’s the difference between a chess grandmaster who can’t figure out tic-tac-toe with slightly different rules and a child who’s never seen either game but picks up both in minutes.

The question isn’t whether AI will crack ARC-AGI-3 — it almost certainly will. The question is how. More data and bigger models? Architectural breakthroughs? Something fundamentally new?

Whatever the answer, one thing is clear: we’re not there yet. And maybe we should stop saying we are.