为什么中国AI模型在ARC-AGI-2上得分这么低？Qwen和Kimi到底怎么了

Mar 5, 2026

Problem

I was comparing AI models for a project and noticed something strange: Qwen2.5-Max ranks 9th on LiveBench, outperforming Gemini 2.0 Flash and DeepSeek V3. Kimi K2 shows impressive results on code generation tasks. But when I checked their scores on ARC-AGI-2, I got a shock—both are stuck at around 1%, barely above random guessing.

This didn’t make sense. How can a model that beats Gemini on LiveBench score worse than a coin flip on another benchmark?

That’s when I discovered the “benchmaxxing” debate and realized I’d been looking at AI capabilities completely wrong.

What is ARC-AGI-2?

Let me back up and explain what ARC-AGI actually measures.

ARC stands for Abstraction and Reasoning Corpus, created by François Chollet (Keras author) in 2019. Unlike typical benchmarks that test what you’ve memorized, ARC tests fluid intelligence—the ability to solve completely novel problems you’ve never seen before.

Here’s what an ARC task looks like conceptually:

Input Grid:        Output Rule:
┌─────┐           Find the color that appears exactly twice,
│ A B │           change all other colors to that color.
│ C A │
└─────┘

Each task gives you input-output examples that reveal a transformation rule. The model must:

Figure out the underlying rule from examples
Apply that rule to a new input
The rule might involve symmetry, color counting, geometric transformations…

The key point: You can’t memorize ARC tasks. There are 1,000 training tasks and only 120 evaluation tasks that are kept secret. Any solution that works by memorization will fail.

The Stunning Numbers

Here’s what I found when I compared model scores:

Benchmark	Qwen/Kimi	GPT-4o	Claude 3.7	Human
MMLU	~85%	88%	88%	~89%
LiveBench	~72%	70%	72%	N/A
ARC-AGI-2	~1%	~1%	~1%	60%

Wait—what? Even GPT-4o and Claude 3.7 Sonnet score around 1% on ARC-AGI-2. This isn’t just a Chinese model problem.

Then I saw that Gemini 3.1 Pro scored 77% on ARC-AGI-2, approaching human levels. And DeepSeek R1 scored only 1.3%.

This is when things got interesting.

Why Chinese Models Excel on Standard Benchmarks

Let me explain what’s happening. Standard benchmarks like MMLU, HumanEval, and LiveBench have something in common:

They test learned skills (math, coding, language understanding)
They have predictable patterns that can be optimized
Training data can include similar problems

Chinese models like Qwen and Kimi were trained on massive datasets specifically optimized for these benchmarks. They learned to recognize patterns that appear frequently in training:

Mathematical reasoning patterns
Code structure patterns
Question-answering patterns

This is what people call “benchmaxxing”—optimizing specifically for popular evaluation datasets.

Why ARC-AGI-2 Exposes the Gap

Here’s the crucial difference: ARC-AGI-2 tests generalization, not pattern matching.

When I look at an ARC task, I can’t just match it to something I’ve seen before. I need to:

Symbolic interpretation - Understand that shapes represent abstract concepts
Compositional reasoning - Apply multiple rules that interact
Contextual adaptation - Modify the rule based on the specific context

This is genuinely hard. Even GPT-4o struggles because these tasks require a kind of reasoning that’s different from next-token prediction.

Think of it this way:

Standard Benchmark:  "What's 2 + 2?" → Can be memorized
ARC-AGI-2:           "If red squares turn blue, and blue circles turn red,
                      what happens to a red circle?" → Requires genuine reasoning

The Cost Factor

Here’s something important: Chinese models are dramatically cheaper.

Model	Price (approx.)
GPT-4o	$15-30/M tokens
Claude 3.7	$15-25/M tokens
Qwen	$0.3-2/M tokens
Kimi	$0.5-2/M tokens

That’s 10-50x cheaper for most tasks.

For 95% of real-world applications—building chatbots, summarizing documents, writing code—you don’t need ARC-AGI-2 level reasoning. Chinese models are exceptionally cost-effective.

The Benchmark Validity Question

There’s an ongoing debate about whether ARC-AGI-2 scores truly measure “real” intelligence, or if they’re just another benchmark that can be gamed.

The fact that Gemini 3.1 Pro scored 77% suggests that with enough specific optimization, models can improve on ARC-AGI-2. This raises questions:

Is ARC-AGI-2 now being “benchmaxxed” too?
Does high ARC-AGI-2 score equal better real-world performance?

François Chollet himself has argued that ARC measures something fundamental about intelligence that other benchmarks miss. But he’s also acknowledged that benchmarks evolve.

What Should You Choose?

Here’s my practical take:

For cost-effective general tasks (chatbots, content generation, simple coding):

Qwen and Kimi are excellent choices
Save 10-50x on API costs

For novel problem-solving where you need genuine reasoning:

Consider the specific task requirements
ARC-AGI-2 scores may predict performance on truly novel problems

For most business applications:

The difference between 1% and 77% on ARC-AGI-2 rarely matters
Standard benchmarks are more relevant to your actual use case

The Bigger Picture

The “benchmaxxing” controversy reveals something important: different benchmarks measure different capabilities.

When I choose an AI model, I need to ask:

What am I actually trying to solve?
Does this benchmark measure that capability?
What’s the cost-benefit trade-off?

Chinese models have democratized access to capable AI. Their low ARC-AGI-2 scores don’t mean they’re “bad”—they mean they optimize for different things than ARC-AGI-2 measures.

The real lesson: look at benchmarks that matter for your specific use case, not the loudest or most publicized ones.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!