What is ARC-AGI 2 and How Does It Detect AI Benchmark Gaming
The Problem
I was excited when I saw DeepSeek R1 and other Chinese reasoning models claiming impressive benchmark scores. Then I stumbled onto a Reddit discussion where someone pointed out that these models scored only 1-1.3% on ARC-AGI-2 - while humans averaged 60%.
Something didn’t add up. How could a model that scored 90%+ on other benchmarks fail so dramatically on ARC-AGI-2?
That’s when I learned about benchmark gaming (or “benchmaxxing”) - and why ARC-AGI-2 was specifically designed to expose it.
What is ARC-AGI-2?
ARC-AGI-2 is an AI benchmark created by François Chollet (creator of Keras) and the Arc Prize Foundation. Unlike traditional benchmarks that test specific skills like language understanding or image recognition, ARC-AGI-2 tests general fluid intelligence - the ability to solve novel problems you’ve never seen before.
The test consists of visual grid puzzles. Here’s a simplified example:
Input Grid: Target Output:[[0, 1, 0], [[1, 0, 1], [1, 1, 1], ---> [0, 0, 0], [0, 1, 0]] [1, 0, 1]]The model sees several training examples (input-output pairs) and must figure out the underlying rule to apply to a test input. This requires genuine abstract reasoning - not memorization.
Why Do Most AI Models Fail?
I was puzzled why frontier models like OpenAI’s o1-pro and DeepSeek R1 scored so low. These are models that can write code, reason through math problems, and pass graduate-level exams.
The answer is in how ARC-AGI-2 is designed:
-
Private evaluation set - The test tasks are hidden from model trainers. You can’t optimize against what you can’t see.
-
Efficiency metric - Models must solve tasks quickly, preventing brute-force approaches that rely on massive compute.
-
Novelty requirement - The puzzles are designed to resist pattern matching against training data. If you’ve seen a similar problem, you still need to figure out the rule.
-
Human baseline - Over 400 humans tested, averaging 60%. This provides meaningful context for what “intelligence” looks like.
Traditional benchmarks like MMLU or HumanEval have public test sets. AI developers can:
- Fine-tune models on similar problems
- Use test-time augmentation
- Optimize specifically for the benchmark format
This is “benchmaxxing” - and it produces misleading progress claims.
How Does ARC-AGI-2 Detect Gaming?
Here’s the key insight: when you can’t see the test set, you can’t optimize for it.
Traditional Benchmark:Train on public data -> Optimize for public test -> High score (but maybe just memorization)
ARC-AGI-2:Train on public tasks -> Solve PRIVATE test -> True generalization ability exposedThe private evaluation set contains tasks specifically designed to be different from the training set. If a model has only memorized patterns, it will fail. If a model has learned genuine reasoning, it will transfer.
This is exactly what happened with Chinese models. On papers-with-leaders benchmarks, they looked impressive. On ARC-AGI-2’s private test, they collapsed to 1-3%.
What I Learned
The ARC-AGI-2 results taught me several things:
Benchmark scores are not intelligence scores. A model can score 95% on MMLU but still fail at simple pattern matching puzzles that a human child could solve.
Private tests matter. Any benchmark that publishes its test set will eventually be gamed. The only way to measure true capability is to keep some tasks hidden.
“Efficiency” is a feature, not a bug. Some benchmarks allow unlimited compute. ARC-AGI-2 rewards efficient solutions, which better correlates with genuine understanding.
Chinese models have a benchmark gaming problem. The Reddit discussion I found confirmed this - models that claimed to rival GPT-4 on leaderboards scored barely above random on ARC-AGI-2.
Why This Matters for AI Research
I think about it this way:
When researchers claim “AGI is near” based on benchmark improvements, but those benchmarks can be gamed, the entire field loses reliable progress metrics.
Benchmaxxing creates a false sense of advancement. Companies can claim their models are “human-level” or “AGI-ready” by optimizing for known tests - while the models fail on any novel task.
ARC-AGI-2 represents a different approach:
- Design benchmarks that resist manipulation
- Measure true generalization, not memorization
- Keep test sets private to prevent optimization
The Bottom Line
If you’re evaluating AI models, don’t just look at benchmark scores. Ask:
- Can the model solve novel problems it hasn’t seen before?
- Is there a private test set that wasn’t gamed?
- How does performance compare to human baselines?
ARC-AGI-2 shows that the gap between “good benchmark scores” and “genuine intelligence” is still enormous - and that gap won’t close through benchmark optimization alone.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 ARC-AGI-2 Official Dataset
- 👨💻 ARC Prize 2024 Technical Report
- 👨💻 François Chollet: On the Measure of Intelligence
- 👨💻 Arc Prize Leaderboard
- 👨💻 Reddit: Chinese models ARC-AGI-2 results
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments