Is ARC-AGI a Valid Measure of AI General Intelligence Limitations
The Problem
I watched OpenAI’s o3 model achieve 87.5% on ARC-AGI-1, and headlines started popping up claiming we’re nearing AGI. But then I dug deeper and found something unsettling - the same model that aced ARC-AGI couldn’t reliably write a simple program or understand a conversation.
That contradiction bothered me. If ARC-AGI truly measures “general intelligence,” shouldn’t success correlate with other cognitive abilities?
So I started researching: Is ARC-AGI actually a valid measure of AI general intelligence?
What I found surprised me.
What ARC-AGI Actually Measures
Let me first explain what ARC-AGI is testing, because understanding the test is essential to understanding the criticisms.
ARC-AGI (Abstraction and Reasoning Corpus for Artificial General Intelligence) was created by François Chollet in 2019 and redesigned as ARC-AGI-2 in 2025. The benchmark tests what Chollet calls fluid intelligence - specifically, the efficiency of skill acquisition on unknown tasks.
Here’s how it works:
Training Examples: Test Input: Expected Output:[[0,1,0],[1,1,1],[0,1,0]] [[1,0,1,0], [[?,?,?,?], [rotate this pattern] [0,1,0,1], [?,?,?,?], [1,0,1,0], [?,?,?,?], [0,1,0,1]] [?,?,?,?]]The model sees several example transformations and must infer the rule to apply to a new input. This tests pattern recognition and rule induction in a visual-spatial domain.
The human baseline is 85% on ARC-AGI-1. When o3 hit 87.5%, it supposedly “surpassed human performance.”
But here’s where my confusion started.
The Core Criticisms
I found three major criticisms of ARC-AGI’s validity as a general intelligence measure:
1. Narrow Scope
ARC-AGI tests only visual-spatial reasoning through synthetic grid puzzles. That’s it.
It completely ignores:
- Language understanding and comprehension
- Real-world knowledge
- Common-sense reasoning
- Long-term planning
- Embodied reasoning (physical world interaction)
- Social reasoning or theory of mind
When critics say ARC-AGI “equates success on a narrow set of small, synthetic grid-transformation puzzles with broad cognitive capability,” they’re pointing out this exact limitation.
Think about it: when was the last time you solved a grid transformation puzzle in your daily life? These tasks are nothing like real cognitive challenges humans face.
2. Synthetic vs. Real Intelligence
The tasks in ARC-AGI are artificially constructed. They lack what researchers call ecological validity - they don’t reflect real-world cognitive demands.
A model that can invert a 3x3 pixel grid doesn’t necessarily understand:
- What objects are
- Physical causality
- Social dynamics
- Temporal sequences
- Goal-oriented planning
As one Reddit commenter put it: “success on synthetic puzzles doesn’t translate to broad cognitive capability.”
3. Missing Cognitive Dimensions
ARC-AGI doesn’t measure:
- Long-term planning - All tasks are immediate, single-step transformations
- Recursive or hierarchical complexity - No nested goals or sub-goal decomposition
- Knowledge accumulation - ARC-AGI specifically measures skill-acquisition efficiency, NOT learned knowledge
- Error recovery - Tasks don’t require recovery from mistakes
This is perhaps the most damning criticism: ARC-AGI measures ONE specific type of intelligence (fluid reasoning), not “general” intelligence.
What ARC-AGI Does Well
Now, I don’t want to be unfair to ARC-AGI. The benchmark has genuine strengths:
Successfully Exposes Benchmark Gaming
This is ARC-AGI’s killer feature. By design, it prevents “benchmaxxing” - the practice of optimizing models specifically for benchmark test sets.
The private test set and novel task generation make it nearly impossible to game. Models can’t memorize answers because they’ve never seen the test tasks before.
I think this is genuinely valuable. If we can’t trust benchmark scores, we have no objective way to measure AI progress.
Measures Generalization
ARC-AGI directly tests the “essence of intelligence” - generalization to novel situations. This is philosophically aligned with how we think about intelligence: the ability to apply knowledge to new problems.
Tests Efficiency, Not Memorization
Unlike tests that reward accumulated knowledge, ARC-AGI rewards learning efficiency. A model that learns quickly from few examples scores higher than one that requires many examples.
This aligns with Chollet’s definition: “intelligence is the efficiency of skill-acquisition on unknown tasks.”
The Validity Debate
Here’s my takeaway from this research:
ARC-AGI is a valuable but limited tool.
It effectively measures one dimension of intelligence (fluid reasoning) and does an excellent job preventing benchmark gaming. But it cannot be considered a complete measure of general intelligence.
The debate really comes down to definitions:
| What ARC-AGI Measures | What General Intelligence Requires |
|---|---|
| Novel pattern recognition | Language understanding |
| Short-horizon reasoning | Long-term planning |
| Visual-spatial rules | Physical world reasoning |
| Skill acquisition efficiency | Knowledge accumulation |
| Single-step transformations | Hierarchical goal decomposition |
No single benchmark can capture the full spectrum of AGI. ARC-AGI should be viewed as one piece of a larger evaluation puzzle.
Alternative Perspectives
Some researchers argue that ARC-AGI’s narrow focus is actually a feature, not a bug. By isolating one cognitive dimension, it provides cleaner measurement than broader benchmarks.
Other benchmarks address different dimensions:
- GAIA - Tests real-world multi-step tasks requiring external knowledge
- AGIEval - Evaluates human-level reasoning in diverse domains
- HumanEval - Focuses on code generation capabilities
The honest answer is: we need multiple benchmarks evaluating diverse cognitive dimensions to get a complete picture of AI capabilities.
Conclusion
I went into this research expecting to find a clear answer. Instead, I found a nuanced debate.
ARC-AGI is not a valid complete measure of general intelligence - that’s clear. But it’s also not useless. It measures something real and difficult: the ability to acquire skills on novel tasks efficiently.
The criticisms are valid: success on synthetic grid puzzles doesn’t automatically translate to broad cognitive capability. But the benchmark wasn’t designed to measure everything - just one specific aspect of intelligence.
If you’re evaluating AI systems, use ARC-AGI alongside other benchmarks. And when you see headlines claiming “AI surpasses human intelligence” based on a single benchmark, be skeptical. Progress is real, but it’s more complicated than any single number can capture.
My advice for beginners: Don’t get fixated on any single benchmark score. Look at performance across multiple evaluations, and understand what each benchmark actually tests. That’s the only way to get an accurate picture of AI capabilities.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 ARC-AGI Official Documentation
- 👨💻 ARC Prize 2024 Technical Report
- 👨💻 François Chollet: On the Measure of Intelligence
- 👨💻 Arc Prize Leaderboard
- 👨💻 GAIA Benchmark
- 👨💻 François Chollet Twitter Thread on ARC-AGI Validity
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments