Is ARC-AGI a Valid Measure of AI General Intelligence Limitations

Mar 5, 2026

The Problem

I watched OpenAI’s o3 model achieve 87.5% on ARC-AGI-1, and headlines started popping up claiming we’re nearing AGI. But then I dug deeper and found something unsettling - the same model that aced ARC-AGI couldn’t reliably write a simple program or understand a conversation.

That contradiction bothered me. If ARC-AGI truly measures “general intelligence,” shouldn’t success correlate with other cognitive abilities?

So I started researching: Is ARC-AGI actually a valid measure of AI general intelligence?

What I found surprised me.

What ARC-AGI Actually Measures

Let me first explain what ARC-AGI is testing, because understanding the test is essential to understanding the criticisms.

ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was created by François Chollet in 2019 and redesigned as ARC-AGI-2 in 2025. The benchmark tests what Chollet calls fluid intelligence - specifically, the efficiency of skill acquisition on unknown tasks.

Here’s how it works:

Training Examples:          Test Input:          Expected Output:
[[0,1,0],[1,1,1],[0,1,0]]   [[1,0,1,0],          [[?,?,?,?],
 [rotate this pattern]       [0,1,0,1],           [?,?,?,?],
                               [1,0,1,0],          [?,?,?,?],
                               [0,1,0,1]]          [?,?,?,?]]

The model sees several example transformations and must infer the rule to apply to a new input. This tests pattern recognition and rule induction in a visual-spatial domain.

The human baseline is 85% on ARC-AGI-1. When o3 hit 87.5%, it supposedly “surpassed human performance.”

But here’s where my confusion started.

The Core Criticisms

I found three major criticisms of ARC-AGI’s validity as a general intelligence measure:

1. Narrow Scope

ARC-AGI tests only visual-spatial reasoning through synthetic grid puzzles. That’s it.

It completely ignores:

Language understanding and comprehension
Real-world knowledge
Common-sense reasoning
Long-term planning
Embodied reasoning (physical world interaction)
Social reasoning or theory of mind

When critics say ARC-AGI “equates success on a narrow set of small, synthetic grid-transformation puzzles with broad cognitive capability,” they’re pointing out this exact limitation.

Think about it: when was the last time you solved a grid transformation puzzle in your daily life? These tasks are nothing like real cognitive challenges humans face.

2. Synthetic vs. Real Intelligence

The tasks in ARC-AGI are artificially constructed. They lack what researchers call ecological validity - they don’t reflect real-world cognitive demands.

A model that can invert a 3x3 pixel grid doesn’t necessarily understand:

What objects are
Physical causality
Social dynamics
Temporal sequences
Goal-oriented planning

As one Reddit commenter put it: “success on synthetic puzzles doesn’t translate to broad cognitive capability.”

3. Missing Cognitive Dimensions

ARC-AGI doesn’t measure:

Long-term planning - All tasks are immediate, single-step transformations
Recursive or hierarchical complexity - No nested goals or sub-goal decomposition
Knowledge accumulation - ARC-AGI specifically measures skill-acquisition efficiency, NOT learned knowledge
Error recovery - Tasks don’t require recovery from mistakes

This is perhaps the most damning criticism: ARC-AGI measures ONE specific type of intelligence (fluid reasoning), not “general” intelligence.

What ARC-AGI Does Well

Now, I don’t want to be unfair to ARC-AGI. The benchmark has genuine strengths:

Successfully Exposes Benchmark Gaming

This is ARC-AGI’s killer feature. By design, it prevents “benchmaxxing” - the practice of optimizing models specifically for benchmark test sets.

The private test set and novel task generation make it nearly impossible to game. Models can’t memorize answers because they’ve never seen the test tasks before.

I think this is genuinely valuable. If we can’t trust benchmark scores, we have no objective way to measure AI progress.

Measures Generalization

ARC-AGI directly tests the “essence of intelligence” - generalization to novel situations. This is philosophically aligned with how we think about intelligence: the ability to apply knowledge to new problems.

Tests Efficiency, Not Memorization

Unlike tests that reward accumulated knowledge, ARC-AGI rewards learning efficiency. A model that learns quickly from few examples scores higher than one that requires many examples.

This aligns with Chollet’s definition: “intelligence is the efficiency of skill-acquisition on unknown tasks.”

The Validity Debate

Here’s my takeaway from this research:

ARC-AGI is a valuable but limited tool.

It effectively measures one dimension of intelligence (fluid reasoning) and does an excellent job preventing benchmark gaming. But it cannot be considered a complete measure of general intelligence.

The debate really comes down to definitions:

What ARC-AGI Measures	What General Intelligence Requires
Novel pattern recognition	Language understanding
Short-horizon reasoning	Long-term planning
Visual-spatial rules	Physical world reasoning
Skill acquisition efficiency	Knowledge accumulation
Single-step transformations	Hierarchical goal decomposition

No single benchmark can capture the full spectrum of AGI. ARC-AGI should be viewed as one piece of a larger evaluation puzzle.

Alternative Perspectives

Some researchers argue that ARC-AGI’s narrow focus is actually a feature, not a bug. By isolating one cognitive dimension, it provides cleaner measurement than broader benchmarks.

Other benchmarks address different dimensions:

GAIA - Tests real-world multi-step tasks requiring external knowledge
AGIEval - Evaluates human-level reasoning in diverse domains
HumanEval - Focuses on code generation capabilities

The honest answer is: we need multiple benchmarks evaluating diverse cognitive dimensions to get a complete picture of AI capabilities.

Conclusion

I went into this research expecting to find a clear answer. Instead, I found a nuanced debate.

ARC-AGI is not a valid complete measure of general intelligence - that’s clear. But it’s also not useless. It measures something real and difficult: the ability to acquire skills on novel tasks efficiently.

The criticisms are valid: success on synthetic grid puzzles doesn’t automatically translate to broad cognitive capability. But the benchmark wasn’t designed to measure everything - just one specific aspect of intelligence.

If you’re evaluating AI systems, use ARC-AGI alongside other benchmarks. And when you see headlines claiming “AI surpasses human intelligence” based on a single benchmark, be skeptical. Progress is real, but it’s more complicated than any single number can capture.

My advice for beginners: Don’t get fixated on any single benchmark score. Look at performance across multiple evaluations, and understand what each benchmark actually tests. That’s the only way to get an accurate picture of AI capabilities.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 ARC-AGI Official Documentation
👨‍💻 ARC Prize 2024 Technical Report
👨‍💻 François Chollet: On the Measure of Intelligence
👨‍💻 Arc Prize Leaderboard
👨‍💻 GAIA Benchmark
👨‍💻 François Chollet Twitter Thread on ARC-AGI Validity

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!