为什么中国AI模型在ARC-AGI-2上得分这么低?Qwen和Kimi到底怎么了
Problem
I was comparing AI models for a project and noticed something strange: Qwen2.5-Max ranks 9th on LiveBench, outperforming Gemini 2.0 Flash and DeepSeek V3. Kimi K2 shows impressive results on code generation tasks. But when I checked their scores on ARC-AGI-2, I got a shock—both are stuck at around 1%, barely above random guessing.
This didn’t make sense. How can a model that beats Gemini on LiveBench score worse than a coin flip on another benchmark?
That’s when I discovered the “benchmaxxing” debate and realized I’d been looking at AI capabilities completely wrong.
What is ARC-AGI-2?
Let me back up and explain what ARC-AGI actually measures.
ARC stands for Abstraction and Reasoning Corpus, created by François Chollet (Keras author) in 2019. Unlike typical benchmarks that test what you’ve memorized, ARC tests fluid intelligence—the ability to solve completely novel problems you’ve never seen before.
Here’s what an ARC task looks like conceptually:
Input Grid: Output Rule:┌─────┐ Find the color that appears exactly twice,│ A B │ change all other colors to that color.│ C A │└─────┘Each task gives you input-output examples that reveal a transformation rule. The model must:
- Figure out the underlying rule from examples
- Apply that rule to a new input
- The rule might involve symmetry, color counting, geometric transformations…
The key point: You can’t memorize ARC tasks. There are 1,000 training tasks and only 120 evaluation tasks that are kept secret. Any solution that works by memorization will fail.
The Stunning Numbers
Here’s what I found when I compared model scores:
| Benchmark | Qwen/Kimi | GPT-4o | Claude 3.7 | Human |
|---|---|---|---|---|
| MMLU | ~85% | 88% | 88% | ~89% |
| LiveBench | ~72% | 70% | 72% | N/A |
| ARC-AGI-2 | ~1% | ~1% | ~1% | 60% |
Wait—what? Even GPT-4o and Claude 3.7 Sonnet score around 1% on ARC-AGI-2. This isn’t just a Chinese model problem.
Then I saw that Gemini 3.1 Pro scored 77% on ARC-AGI-2, approaching human levels. And DeepSeek R1 scored only 1.3%.
This is when things got interesting.
Why Chinese Models Excel on Standard Benchmarks
Let me explain what’s happening. Standard benchmarks like MMLU, HumanEval, and LiveBench have something in common:
- They test learned skills (math, coding, language understanding)
- They have predictable patterns that can be optimized
- Training data can include similar problems
Chinese models like Qwen and Kimi were trained on massive datasets specifically optimized for these benchmarks. They learned to recognize patterns that appear frequently in training:
- Mathematical reasoning patterns
- Code structure patterns
- Question-answering patterns
This is what people call “benchmaxxing”—optimizing specifically for popular evaluation datasets.
Why ARC-AGI-2 Exposes the Gap
Here’s the crucial difference: ARC-AGI-2 tests generalization, not pattern matching.
When I look at an ARC task, I can’t just match it to something I’ve seen before. I need to:
- Symbolic interpretation - Understand that shapes represent abstract concepts
- Compositional reasoning - Apply multiple rules that interact
- Contextual adaptation - Modify the rule based on the specific context
This is genuinely hard. Even GPT-4o struggles because these tasks require a kind of reasoning that’s different from next-token prediction.
Think of it this way:
Standard Benchmark: "What's 2 + 2?" → Can be memorizedARC-AGI-2: "If red squares turn blue, and blue circles turn red, what happens to a red circle?" → Requires genuine reasoningThe Cost Factor
Here’s something important: Chinese models are dramatically cheaper.
| Model | Price (approx.) |
|---|---|
| GPT-4o | $15-30/M tokens |
| Claude 3.7 | $15-25/M tokens |
| Qwen | $0.3-2/M tokens |
| Kimi | $0.5-2/M tokens |
That’s 10-50x cheaper for most tasks.
For 95% of real-world applications—building chatbots, summarizing documents, writing code—you don’t need ARC-AGI-2 level reasoning. Chinese models are exceptionally cost-effective.
The Benchmark Validity Question
There’s an ongoing debate about whether ARC-AGI-2 scores truly measure “real” intelligence, or if they’re just another benchmark that can be gamed.
The fact that Gemini 3.1 Pro scored 77% suggests that with enough specific optimization, models can improve on ARC-AGI-2. This raises questions:
- Is ARC-AGI-2 now being “benchmaxxed” too?
- Does high ARC-AGI-2 score equal better real-world performance?
François Chollet himself has argued that ARC measures something fundamental about intelligence that other benchmarks miss. But he’s also acknowledged that benchmarks evolve.
What Should You Choose?
Here’s my practical take:
For cost-effective general tasks (chatbots, content generation, simple coding):
- Qwen and Kimi are excellent choices
- Save 10-50x on API costs
For novel problem-solving where you need genuine reasoning:
- Consider the specific task requirements
- ARC-AGI-2 scores may predict performance on truly novel problems
For most business applications:
- The difference between 1% and 77% on ARC-AGI-2 rarely matters
- Standard benchmarks are more relevant to your actual use case
The Bigger Picture
The “benchmaxxing” controversy reveals something important: different benchmarks measure different capabilities.
When I choose an AI model, I need to ask:
- What am I actually trying to solve?
- Does this benchmark measure that capability?
- What’s the cost-benefit trade-off?
Chinese models have democratized access to capable AI. Their low ARC-AGI-2 scores don’t mean they’re “bad”—they mean they optimize for different things than ARC-AGI-2 measures.
The real lesson: look at benchmarks that matter for your specific use case, not the loudest or most publicized ones.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 ARC-AGI Official Website
- 👨💻 ARC-AGI-2 Leaderboard
- 👨💻 ARC-AGI: Measuring AGI
- 👨💻 François Chollet on ARC
- 👨💻 Qwen3 Technical Report
- 👨💻 DeepSeek R1 Paper
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments