Rate Limit Errors vs. Quota: Why Your AI Service Throttles You Even With Usage Remaining

Mar 11, 2026

I was in the middle of a critical automation task when suddenly—bam. A 429 error. Rate limited. But wait, I checked my dashboard and it clearly showed 97% quota remaining. What gives?

If you’ve experienced this frustration, you’re not alone. I’ve seen this question pop up repeatedly on Reddit, especially in communities like r/ZaiGLM. Users report getting rate limited even when their quota shows plenty of room. One user put it perfectly: “My GLM is so slow I can’t get it to count to 3 in under 5 minutes at which point I get a 429.”

The confusion stems from mixing up two separate mechanisms: rate limits and quotas.

The Gas Tank Analogy

I find it helpful to think of it this way:

Quota     = Gas tank capacity (total fuel you have)
Rate Limit = Speed limit (how fast you can use it)

You can have a completely full gas tank but still get a speeding ticket. Your quota tells you how much you can use in total, but rate limits control how quickly you can make those requests.

Why Multiple Rate Limit Dimensions Exist

Here’s where it gets tricky. Rate limits aren’t just about request count. I discovered there are multiple dimensions:

| Limit Type              | What It Controls                    |
|------------------------|-------------------------------------|
| Requests Per Minute    | How many API calls you can make     |
| Tokens Per Minute      | Total tokens processed per minute   |
| Concurrent Requests    | Simultaneous connections allowed    |
| Model-Specific Limits  | Different limits per model tier     |

This explains why I’d sometimes hit limits unexpectedly. I might be under my request count but over my token limit. Or I might be hitting concurrent request limits when running parallel processes.

Different models have different limits too. GPT-4 has stricter limits than GPT-3.5. Claude Opus is more constrained than Claude Sonnet or Haiku. The tier you’re on matters significantly.

Peak Usage and Dynamic Adjustments

Another factor I’ve noticed: rate limits can be dynamically adjusted based on server load. During peak usage times, you might hit limits even when you normally wouldn’t. This isn’t always documented, but it’s a pattern I’ve observed across ChatGPT, Claude, and other AI services.

The Solution: Exponential Backoff

The most reliable fix I’ve implemented is exponential backoff with jitter. Here’s a pattern that works well:

import time
import random

def retry_with_backoff(func, max_retries=5):
    """
    Retry a function with exponential backoff and jitter.

    Args:
        func: The function to retry
        max_retries: Maximum number of retry attempts

    Returns:
        The result of the function if successful

    Raises:
        MaxRetriesExceeded: When all retries are exhausted
    """
    for attempt in range(max_retries):
        try:
            return func()
        except RateLimitError:
            # Exponential backoff with jitter
            wait = (2 ** attempt) + random.random()
            time.sleep(wait)
    raise MaxRetriesExceeded()

The key elements here:

Exponential growth - Wait time doubles with each retry (1s, 2s, 4s, 8s, 16s)
Jitter - Adding randomness prevents thundering herd problems when multiple clients retry simultaneously
Maximum retries - Prevent infinite loops

Practical Tips That Worked for Me

Beyond implementing backoff, here are strategies that actually helped:

Add delays between requests. I started with 1-2 second delays between API calls. This alone reduced my 429 errors significantly.

Monitor rate limit headers. Most APIs return headers like X-RateLimit-Remaining or X-RateLimit-Reset. I now track these and adjust my request pace accordingly.

Reduce token usage. Shorter, more focused prompts mean fewer tokens per request. I’ve started being more concise in my queries.

Check your tier’s limits. I was surprised to find my tier’s actual limits were different from what I assumed. Documentation is your friend here.

Cross-Platform Pattern

This isn’t unique to one service. I’ve seen similar patterns across:

OpenAI’s API (ChatGPT, GPT-4)
Anthropic’s Claude API
Google’s Gemini
Local AI services under heavy load

The solution is generally the same: respect rate limits as speed limits, implement proper retry logic, and pace your requests.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 r/ZaiGLM Community

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!