Free LLM API Rate Limits Compared: Which Provider Fits Your Use Case?

Mar 23, 2026

I was building a prototype chatbot using free LLM APIs when I hit a wall. My app kept crashing mid-conversation with cryptic rate limit errors. Turns out, I had hit Google Gemini’s daily limit of just 20 requests—in less than 5 minutes of testing.

That’s when I realized: free tier rate limits vary wildly across providers. Groq offers 14,400 requests per day, while Google Gemini gives you only 20. That’s a 720x difference.

The Problem: Rate Limits Are Hidden Landmines

I thought all free LLM APIs would be roughly similar. I was wrong.

After getting burned by Gemini’s restrictive limits, I spent a weekend testing every major free LLM provider. I discovered three critical issues:

Daily limits matter more than minute limits - Everyone advertises RPM (requests per minute), but RPD (requests per day) is often the real bottleneck
No standardization - Some providers limit by requests, others by tokens, one even uses “neurons” (Cloudflare)
Documentation is sparse - Most providers bury rate limit details in fine print

Here’s what I found:

Provider	RPM	RPD	Monthly Quota	Notes
Groq	30	14,400	Unlimited	Best for prototyping
Cerebras	30	14,400	Unlimited	Similar to Groq
NVIDIA NIM	40	Unknown	Unknown	Good for steady traffic
Mistral AI	60	Unknown	1B tokens/month	Token-based limits
OpenRouter	20	50	Unknown	Aggregates multiple models
GitHub Models	10-15	50-150	Unknown	Integrates with dev workflow
Cohere	20	Unknown	1K requests/month	Strict monthly cap
Google Gemini	10	20	Unknown	Most restrictive
Cloudflare Workers AI	Variable	10K neurons/day	Unknown	Uses “neurons” metric
Hugging Face	Variable	Variable	$0.10/month credits	Credit-based system

Why This Matters: Prototyping vs. Production

I learned this the hard way. I built my prototype on Groq because of its generous limits. But when I started planning for production, I realized I hadn’t tested:

Cost per request on paid tiers
Latency under load
Fallback strategies when rate limits hit

The trap: Building on a generous free tier, then discovering production costs are prohibitive or migration is painful.

Prototyping Phase Mistakes I Made

My first mistake was treating rate limits as an afterthought. I’d just pick a provider, start coding, and deal with limits when I hit them. This led to:

App crashes during demo (hit Gemini’s 20 RPD mid-presentation)
Lost work when I couldn’t test my code (hit Cohere’s 1K monthly quota in week 2)
Rewrites when I had to switch providers (different base URLs, model names)

Production Phase Reality Check

When I started thinking about production, I asked:

Can my users tolerate rate limit errors?
What’s the upgrade cost if I need 10x more requests?
Do I need fallback providers?

Free tiers are for evaluation, not production. The providers know this. That’s why they offer generous limits to get you hooked, then charge premium rates when you need to scale.

My Solution: Multi-Provider Architecture

I learned to design for provider switching from day one. The good news: all major free LLM providers use OpenAI SDK-compatible endpoints. This means one codebase can work with any provider.

Here’s the architecture I now use:

from openai import OpenAI
import time
from typing import Optional

class MultiProviderLLM:
    """Handles multiple free tier LLM providers with automatic fallback."""

    PROVIDERS = {
        'groq': {
            'base_url': 'https://api.groq.com/openai/v1',
            'rpm_limit': 30,
            'rpd_limit': 14400,
        },
        'cerebras': {
            'base_url': 'https://api.cerebras.ai/v1',
            'rpm_limit': 30,
            'rpd_limit': 14400,
        },
        'openrouter': {
            'base_url': 'https://openrouter.ai/api/v1',
            'rpm_limit': 20,
            'rpd_limit': 50,
        },
    }

    def __init__(self):
        self.request_counts = {p: {'minute': 0, 'day': 0} for p in self.PROVIDERS}
        self.current_provider = 'groq'

    def get_client(self, provider: str) -> OpenAI:
        config = self.PROVIDERS[provider]
        return OpenAI(
            base_url=config['base_url'],
            api_key=self._get_api_key(provider),
        )

    def check_rate_limit(self, provider: str) -> bool:
        """Check if provider has capacity remaining."""
        config = self.PROVIDERS[provider]
        counts = self.request_counts[provider]

        if counts['minute'] >= config['rpm_limit']:
            return False
        if counts['day'] >= config['rpd_limit']:
            return False
        return True

    def complete(self, messages: list, model: str) -> Optional[str]:
        """Try current provider, fallback to alternatives on rate limit."""
        providers = [self.current_provider] + [
            p for p in self.PROVIDERS if p != self.current_provider
        ]

        for provider in providers:
            if not self.check_rate_limit(provider):
                continue

            try:
                client = self.get_client(provider)
                response = client.chat.completions.create(
                    model=model,
                    messages=messages,
                )
                self.request_counts[provider]['minute'] += 1
                self.request_counts[provider]['day'] += 1
                return response.choices[0].message.content

            except Exception as e:
                if 'rate limit' in str(e).lower():
                    continue
                raise

        raise Exception("All providers rate limited")

    def reset_minute_counts(self):
        """Call every minute via scheduler."""
        for p in self.request_counts:
            self.request_counts[p]['minute'] = 0

    def reset_day_counts(self):
        """Call every day via scheduler."""
        for p in self.request_counts:
            self.request_counts[p]['day'] = 0

    def _get_api_key(self, provider: str) -> str:
        """Retrieve API key from environment variables."""
        import os
        key_map = {
            'groq': 'GROQ_API_KEY',
            'cerebras': 'CEREBRAS_API_KEY',
            'openrouter': 'OPENROUTER_API_KEY',
        }
        return os.environ.get(key_map[provider], '')

The key insight: I track both per-minute and per-day request counts. When one provider hits its limit, I automatically fall back to the next.

Provider Selection Strategy

I developed a simple decision tree based on my use case:

def select_provider(use_case: str) -> str:
    """Choose the best free tier provider for your use case."""

    recommendations = {
        'prototyping': 'groq',      # High daily limits, fast
        'production': 'mistral',    # Token-based limits, reliable
        'hobby': 'openrouter',      # Model variety, moderate limits
        'evaluation': 'google',     # Gemini quality, low volume
    }

    return recommendations.get(use_case, 'groq')

When to Use Each Provider

For rapid prototyping: Groq or Cerebras

14,400 daily requests let you iterate quickly
Fast inference speeds (Groq’s LPU is impressive)
Perfect for development and debugging

For production planning: Mistral AI

1B tokens per month gives you breathing room
Token-based limits are more predictable than request-based
Clear upgrade path to paid tiers

For hobby projects: OpenRouter or GitHub Models

Moderate limits fit intermittent usage
OpenRouter gives you access to multiple models
GitHub Models integrates with your existing dev workflow

For enterprise evaluation: Cohere or Google Gemini

Lower limits force focused, deliberate testing
Good for proof-of-concept before committing budget
Enterprise tiers available when you need to scale

Common Mistakes I’ve Seen (and Made)

1. Ignoring Daily Limits

I focused on RPM (requests per minute) and ignored RPD (requests per day). Big mistake.

Example: Google Gemini’s 20 RPD vs 10 RPM. In theory, you can make 10 requests per minute. In practice, you’ll hit the daily limit in just 2 minutes of burst usage.

Fix: Always check both RPM and RPD. For sustained usage, RPD is often more restrictive.

2. Not Planning for Scale

I built on free tiers without checking paid pricing. When I needed to scale, I discovered:

Some providers charge 10x more than competitors
Upgrade paths aren’t always clear
Migration can be costly

Fix: Test pricing early. Know what you’ll pay when you need 10x, 100x, 1000x scale.

3. Single Provider Lock-in

I hardcoded one provider’s base URL and model names. When that provider had issues, I was stuck.

Fix: Implement fallback logic from day one. The OpenAI SDK makes this easy—just swap base_url and api_key.

4. Misunderstanding Token Limits

Some providers limit by tokens, not requests. This tripped me up when I sent long prompts.

Example: Mistral’s 1B tokens per month. A single long conversation could eat your entire quota.

Fix: Monitor token usage, not just request counts. Implement token budgeting in your app.

5. Forgetting Cold Starts

Some free tiers have latency penalties for idle models. I built a chatbot that felt slow because of cold starts.

Fix: Test actual response times under realistic conditions. Don’t just measure throughput.

OpenAI SDK Compatibility: All major free LLM providers support OpenAI SDK-compatible endpoints. This means you can use the same code with different providers by just changing:

base_url: The provider’s API endpoint
api_key: Your provider-specific key
model: The model identifier (varies by provider)

Rate Limit Headers: Most providers return rate limit info in response headers:

X-RateLimit-Limit: Maximum requests allowed
X-RateLimit-Remaining: Requests remaining in current window
X-RateLimit-Reset: When the limit resets (Unix timestamp)

You can parse these headers to implement more sophisticated rate limit handling.

Exponential Backoff: When you hit rate limits, use exponential backoff instead of immediately retrying:

import time
import random

def exponential_backoff(attempt: int, base_delay: float = 1.0):
    """Calculate delay with exponential backoff and jitter."""
    max_delay = base_delay * (2 ** attempt)
    jitter = random.uniform(0, 0.1 * max_delay)
    delay = min(max_delay + jitter, 60)  # Cap at 60 seconds
    time.sleep(delay)

Final Thoughts

Free LLM API rate limits span an enormous range—from 20 to 14,400 requests per day. The right provider depends entirely on your use case:

Usage pattern: Burst vs. sustained, single-threaded vs. parallel
Growth trajectory: Does the provider offer affordable paid tiers?
Integration effort: All providers use OpenAI SDK, minimizing switching costs
Fallback strategy: Implement multi-provider support from day one

Quick recommendations:

Maximum testing capacity: Groq or Cerebras (14,400 RPD)
Best for production scaling: Mistral AI (1B tokens/month)
Simplest for hobbyists: OpenRouter or GitHub Models
Quality over quantity: Google Gemini (strict limits but strong model)

Remember: Free tiers are for evaluation and prototyping. Plan your upgrade path before hitting rate limits in production.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!