AI Model Performance Benchmark: UCloud Speed and Cost Comparison 2026

Mar 25, 2026

I wasted an entire afternoon waiting for AI responses. My workflow was simple: ask a question, wait, get a response, repeat. But by the fifth “thinking…” spinner, I realized the model I picked was killing my productivity.

The real problem? I optimized for quality but ignored speed and cost. I picked the most capable model without considering that I’d make hundreds of API calls daily. When each call takes 20+ seconds, those delays compound into hours of lost time.

So I ran a benchmark. I tested 15+ models through UCloud’s standardized environment with identical prompts to measure actual throughput (tokens/second) and cost efficiency. Here’s what I found.

The Benchmark Setup

I used a creative writing prompt requesting approximately 200 words. This standardizes the comparison across different models and eliminates task-specific bias. The test measured:

Speed: Tokens generated per second (t/s)
Cost: Price per 1,000 tokens
Throughput: Total time to complete the request

The results surprised me.

Speed Rankings: Who’s Actually Fast?

Model                    | Speed (t/s) | Relative to GPT-5.1
-------------------------|-------------|--------------------
MiniMax-M2.1             | 85.4        | 4.8x faster
GPT-5.1-codex-mini       | 69.8        | 3.9x faster
Kimi-K2.5               | 51.0        | 2.8x faster
Claude-Haiku-4.5         | 46.9        | 2.6x faster
DeepSeek-V3.2            | 34.6        | 1.9x faster
Claude-Sonnet-4.5        | 28.2        | 1.6x faster
Claude-Opus-4.6          | 28.1        | 1.6x faster
GPT-5.1                  | 17.9        | baseline

The speed difference is massive. MiniMax-M2.1 generates tokens nearly 5x faster than GPT-5.1. For a typical 500-token code completion, that’s:

MiniMax-M2.1:    ~5.8 seconds
GPT-5.1:          ~27.9 seconds
Difference:       22 seconds per request

If I make 100 completions per day, that’s 36+ minutes saved daily, or 220+ hours per year.

But speed isn’t everything. Let’s talk cost.

Cost Efficiency: Who’s Actually Cheap?

Model                    | Cost/1K     | Value Rating
-------------------------|-------------|---------------
GPT-5.1-codex-mini       | ¥0.043      | Excellent
DeepSeek-V3.2            | ¥0.267      | Good
Kimi-K2.5               | ¥0.255      | Good
MiniMax-M2.1             | ¥0.348      | Good
Claude-Haiku-4.5         | ¥0.40       | Fair
Claude-Sonnet-4.5        | ¥1.0        | Premium
Claude-Opus-4.6          | ¥1.996      | Luxury

The price spread is 46x between the cheapest and most expensive. GPT-5.1-codex-mini costs ¥0.043 per 1K tokens, while Claude-Opus-4.6 costs ¥1.996. For 1 million tokens of processing:

GPT-5.1-codex-mini:   ¥43
Claude-Opus-4.6:       ¥1,996
Savings:              ¥1,953 (98% cheaper)

But here’s the catch: cheap models may struggle with complex reasoning. I learned this the hard way when I tried using the budget model for architecture decisions and got superficial suggestions.

The Trade-off Matrix

I created a decision matrix to help me choose:

# Model data from benchmark
models = {
    "DeepSeek-V3.2": {"speed": 34.6, "cost_per_1k": 0.267, "use_case": "reasoning"},
    "Kimi-K2.5": {"speed": 51.0, "cost_per_1k": 0.255, "use_case": "speed"},
    "Claude-Haiku-4.5": {"speed": 46.9, "cost_per_1k": 0.40, "use_case": "balanced"},
    "MiniMax-M2.1": {"speed": 85.4, "cost_per_1k": 0.348, "use_case": "fast"},
    "GPT-5.1-codex-mini": {"speed": 69.8, "cost_per_1k": 0.043, "use_case": "budget"},
    "Claude-Opus-4.6": {"speed": 28.1, "cost_per_1k": 1.996, "use_case": "quality"},
}

def recommend_model(priority, budget_per_1k=1.0):
    """Recommend model based on priority and budget"""
    filtered = {k: v for k, v in models.items()
                if v["cost_per_1k"] <= budget_per_1k}

    if not filtered:
        return None, None

    if priority == "speed":
        return max(filtered.items(), key=lambda x: x[1]["speed"])
    elif priority == "cost":
        return min(filtered.items(), key=lambda x: x[1]["cost_per_1k"])
    else:  # balanced
        return max(filtered.items(),
                   key=lambda x: x[1]["speed"] / x[1]["cost_per_1k"])

# My typical use cases
print(recommend_model("speed"))     # Kimi-K2.5 or MiniMax-M2.1
print(recommend_model("cost"))      # GPT-5.1-codex-mini
print(recommend_model("balanced"))  # Claude-Haiku-4.5

This helped me realize I needed different models for different tasks.

My Actual Workflow Now

I don’t use one model for everything anymore. Instead, I match the model to the task:

Task Type              | Model Choice           | Why
-----------------------|------------------------|---------------------------
Quick code completion  | GPT-5.1-codex-mini     | Fast + cheap
Code review            | Claude-Haiku-4.5       | Good enough + fast
Architecture decisions | Claude-Opus-4.6        | Need maximum quality
Documentation drafts   | DeepSeek-V3.2          | Reasoning + reasonable cost
Real-time chat         | MiniMax-M2.1           | Speed priority

This multi-model approach cut my API costs by 60% while actually improving my overall experience because I’m not waiting on slow responses for simple tasks.

What I Got Wrong Initially

My first mistake was thinking “faster is always better.” MiniMax-M2.1 is the fastest at 85.4 t/s, but when I used it for complex code reasoning, the quality wasn’t there. I had to re-prompt multiple times, which actually made it slower overall.

My second mistake was over-optimizing for cost. I switched everything to GPT-5.1-codex-mini because it was 46x cheaper. Then I spent hours debugging bad suggestions for edge cases. The time cost far exceeded the dollar savings.

The sweet spot turned out to be task-specific selection. I use:

Budget model for simple, high-volume tasks (completions, formatting)
Balanced model for everyday coding (Haiku for most things)
Quality model for critical decisions (Opus for architecture, reviews)

The Numbers That Matter

Here’s the summary of what I tested:

Model                    | Speed    | Cost/1K   | Speed/Cost Ratio
-------------------------|----------|-----------|------------------
MiniMax-M2.1             | 85.4 t/s | ¥0.348    | 245.4
GPT-5.1-codex-mini       | 69.8 t/s | ¥0.043    | 1623.3 (best value)
Kimi-K2.5               | 51.0 t/s | ¥0.255    | 200.0
Claude-Haiku-4.5         | 46.9 t/s | ¥0.40     | 117.3
DeepSeek-V3.2            | 34.6 t/s | ¥0.267    | 129.6
Claude-Sonnet-4.5        | 28.2 t/s | ¥1.0      | 28.2
Claude-Opus-4.6          | 28.1 t/s | ¥1.996    | 14.1
GPT-5.1                  | 17.9 t/s | N/A       | N/A

The “speed/cost ratio” is my homemade metric: tokens per second divided by cost per 1K. Higher is better. GPT-5.1-codex-mini dominates this metric, but remember—this doesn’t account for quality.

Quick Selection Guide

If you just want a recommendation:

| Priority        | Model               | Speed   | Cost      | Why                    |
|-----------------|---------------------|---------|-----------|------------------------|
| Maximum speed   | MiniMax-M2.1        | 85.4 t/s| ¥0.348/1K | Fastest in test        |
| Budget coding   | GPT-5.1-codex-mini  | 69.8 t/s| ¥0.043/1K | Cheapest, still fast   |
| Best balance    | Claude-Haiku-4.5    | 46.9 t/s| ¥0.40/1K  | Quality + speed        |
| Complex reason  | Claude-Opus-4.6     | 28.1 t/s| ¥1.996/1K | Highest quality        |
| Fast + capable  | Kimi-K2.5           | 51.0 t/s| ¥0.255/1K | Speed + reasoning      |

What This Doesn’t Cover

This benchmark measures speed and cost. It doesn’t measure:

Code quality: Can the model actually write good code?
Instruction following: How well does it follow complex prompts?
Context handling: How does it perform with long contexts?
Specialized tasks: Performance on specific domains (SQL, Rust, etc.)

For those, you need task-specific benchmarks. But for understanding raw throughput and cost efficiency, these numbers tell the story.

The Takeaway

Model selection is a three-variable optimization problem: speed, cost, and quality. You can’t maximize all three simultaneously. I pick two based on the task:

Speed + Cost → GPT-5.1-codex-mini (for simple, high-volume work)
Speed + Quality → Claude-Haiku-4.5 (for everyday coding)
Cost + Quality → DeepSeek-V3.2 (for reasoning tasks on a budget)
Quality only → Claude-Opus-4.6 (for critical decisions)

The right choice depends on what you’re optimizing for. But you can’t make that choice without the data. Now you have it.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 UCloud AI Model Benchmark
👨‍💻 Claude Model Comparison
👨‍💻 OpenAI GPT Models
👨‍💻 DeepSeek AI

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!