Free LLM API Rate Limits Compared: Which Provider Fits Your Use Case?
I was building a prototype chatbot using free LLM APIs when I hit a wall. My app kept crashing mid-conversation with cryptic rate limit errors. Turns out, I had hit Google Gemini’s daily limit of just 20 requests—in less than 5 minutes of testing.
That’s when I realized: free tier rate limits vary wildly across providers. Groq offers 14,400 requests per day, while Google Gemini gives you only 20. That’s a 720x difference.
The Problem: Rate Limits Are Hidden Landmines
I thought all free LLM APIs would be roughly similar. I was wrong.
After getting burned by Gemini’s restrictive limits, I spent a weekend testing every major free LLM provider. I discovered three critical issues:
- Daily limits matter more than minute limits - Everyone advertises RPM (requests per minute), but RPD (requests per day) is often the real bottleneck
- No standardization - Some providers limit by requests, others by tokens, one even uses “neurons” (Cloudflare)
- Documentation is sparse - Most providers bury rate limit details in fine print
Here’s what I found:
| Provider | RPM | RPD | Monthly Quota | Notes |
|---|---|---|---|---|
| Groq | 30 | 14,400 | Unlimited | Best for prototyping |
| Cerebras | 30 | 14,400 | Unlimited | Similar to Groq |
| NVIDIA NIM | 40 | Unknown | Unknown | Good for steady traffic |
| Mistral AI | 60 | Unknown | 1B tokens/month | Token-based limits |
| OpenRouter | 20 | 50 | Unknown | Aggregates multiple models |
| GitHub Models | 10-15 | 50-150 | Unknown | Integrates with dev workflow |
| Cohere | 20 | Unknown | 1K requests/month | Strict monthly cap |
| Google Gemini | 10 | 20 | Unknown | Most restrictive |
| Cloudflare Workers AI | Variable | 10K neurons/day | Unknown | Uses “neurons” metric |
| Hugging Face | Variable | Variable | $0.10/month credits | Credit-based system |
Why This Matters: Prototyping vs. Production
I learned this the hard way. I built my prototype on Groq because of its generous limits. But when I started planning for production, I realized I hadn’t tested:
- Cost per request on paid tiers
- Latency under load
- Fallback strategies when rate limits hit
The trap: Building on a generous free tier, then discovering production costs are prohibitive or migration is painful.
Prototyping Phase Mistakes I Made
My first mistake was treating rate limits as an afterthought. I’d just pick a provider, start coding, and deal with limits when I hit them. This led to:
- App crashes during demo (hit Gemini’s 20 RPD mid-presentation)
- Lost work when I couldn’t test my code (hit Cohere’s 1K monthly quota in week 2)
- Rewrites when I had to switch providers (different base URLs, model names)
Production Phase Reality Check
When I started thinking about production, I asked:
- Can my users tolerate rate limit errors?
- What’s the upgrade cost if I need 10x more requests?
- Do I need fallback providers?
Free tiers are for evaluation, not production. The providers know this. That’s why they offer generous limits to get you hooked, then charge premium rates when you need to scale.
My Solution: Multi-Provider Architecture
I learned to design for provider switching from day one. The good news: all major free LLM providers use OpenAI SDK-compatible endpoints. This means one codebase can work with any provider.
Here’s the architecture I now use:
from openai import OpenAIimport timefrom typing import Optional
class MultiProviderLLM: """Handles multiple free tier LLM providers with automatic fallback."""
PROVIDERS = { 'groq': { 'base_url': 'https://api.groq.com/openai/v1', 'rpm_limit': 30, 'rpd_limit': 14400, }, 'cerebras': { 'base_url': 'https://api.cerebras.ai/v1', 'rpm_limit': 30, 'rpd_limit': 14400, }, 'openrouter': { 'base_url': 'https://openrouter.ai/api/v1', 'rpm_limit': 20, 'rpd_limit': 50, }, }
def __init__(self): self.request_counts = {p: {'minute': 0, 'day': 0} for p in self.PROVIDERS} self.current_provider = 'groq'
def get_client(self, provider: str) -> OpenAI: config = self.PROVIDERS[provider] return OpenAI( base_url=config['base_url'], api_key=self._get_api_key(provider), )
def check_rate_limit(self, provider: str) -> bool: """Check if provider has capacity remaining.""" config = self.PROVIDERS[provider] counts = self.request_counts[provider]
if counts['minute'] >= config['rpm_limit']: return False if counts['day'] >= config['rpd_limit']: return False return True
def complete(self, messages: list, model: str) -> Optional[str]: """Try current provider, fallback to alternatives on rate limit.""" providers = [self.current_provider] + [ p for p in self.PROVIDERS if p != self.current_provider ]
for provider in providers: if not self.check_rate_limit(provider): continue
try: client = self.get_client(provider) response = client.chat.completions.create( model=model, messages=messages, ) self.request_counts[provider]['minute'] += 1 self.request_counts[provider]['day'] += 1 return response.choices[0].message.content
except Exception as e: if 'rate limit' in str(e).lower(): continue raise
raise Exception("All providers rate limited")
def reset_minute_counts(self): """Call every minute via scheduler.""" for p in self.request_counts: self.request_counts[p]['minute'] = 0
def reset_day_counts(self): """Call every day via scheduler.""" for p in self.request_counts: self.request_counts[p]['day'] = 0
def _get_api_key(self, provider: str) -> str: """Retrieve API key from environment variables.""" import os key_map = { 'groq': 'GROQ_API_KEY', 'cerebras': 'CEREBRAS_API_KEY', 'openrouter': 'OPENROUTER_API_KEY', } return os.environ.get(key_map[provider], '')The key insight: I track both per-minute and per-day request counts. When one provider hits its limit, I automatically fall back to the next.
Provider Selection Strategy
I developed a simple decision tree based on my use case:
def select_provider(use_case: str) -> str: """Choose the best free tier provider for your use case."""
recommendations = { 'prototyping': 'groq', # High daily limits, fast 'production': 'mistral', # Token-based limits, reliable 'hobby': 'openrouter', # Model variety, moderate limits 'evaluation': 'google', # Gemini quality, low volume }
return recommendations.get(use_case, 'groq')When to Use Each Provider
For rapid prototyping: Groq or Cerebras
- 14,400 daily requests let you iterate quickly
- Fast inference speeds (Groq’s LPU is impressive)
- Perfect for development and debugging
For production planning: Mistral AI
- 1B tokens per month gives you breathing room
- Token-based limits are more predictable than request-based
- Clear upgrade path to paid tiers
For hobby projects: OpenRouter or GitHub Models
- Moderate limits fit intermittent usage
- OpenRouter gives you access to multiple models
- GitHub Models integrates with your existing dev workflow
For enterprise evaluation: Cohere or Google Gemini
- Lower limits force focused, deliberate testing
- Good for proof-of-concept before committing budget
- Enterprise tiers available when you need to scale
Common Mistakes I’ve Seen (and Made)
1. Ignoring Daily Limits
I focused on RPM (requests per minute) and ignored RPD (requests per day). Big mistake.
Example: Google Gemini’s 20 RPD vs 10 RPM. In theory, you can make 10 requests per minute. In practice, you’ll hit the daily limit in just 2 minutes of burst usage.
Fix: Always check both RPM and RPD. For sustained usage, RPD is often more restrictive.
2. Not Planning for Scale
I built on free tiers without checking paid pricing. When I needed to scale, I discovered:
- Some providers charge 10x more than competitors
- Upgrade paths aren’t always clear
- Migration can be costly
Fix: Test pricing early. Know what you’ll pay when you need 10x, 100x, 1000x scale.
3. Single Provider Lock-in
I hardcoded one provider’s base URL and model names. When that provider had issues, I was stuck.
Fix: Implement fallback logic from day one. The OpenAI SDK makes this easy—just swap base_url and api_key.
4. Misunderstanding Token Limits
Some providers limit by tokens, not requests. This tripped me up when I sent long prompts.
Example: Mistral’s 1B tokens per month. A single long conversation could eat your entire quota.
Fix: Monitor token usage, not just request counts. Implement token budgeting in your app.
5. Forgetting Cold Starts
Some free tiers have latency penalties for idle models. I built a chatbot that felt slow because of cold starts.
Fix: Test actual response times under realistic conditions. Don’t just measure throughput.
Related Knowledge
OpenAI SDK Compatibility: All major free LLM providers support OpenAI SDK-compatible endpoints. This means you can use the same code with different providers by just changing:
base_url: The provider’s API endpointapi_key: Your provider-specific keymodel: The model identifier (varies by provider)
Rate Limit Headers: Most providers return rate limit info in response headers:
X-RateLimit-Limit: Maximum requests allowedX-RateLimit-Remaining: Requests remaining in current windowX-RateLimit-Reset: When the limit resets (Unix timestamp)
You can parse these headers to implement more sophisticated rate limit handling.
Exponential Backoff: When you hit rate limits, use exponential backoff instead of immediately retrying:
import timeimport random
def exponential_backoff(attempt: int, base_delay: float = 1.0): """Calculate delay with exponential backoff and jitter.""" max_delay = base_delay * (2 ** attempt) jitter = random.uniform(0, 0.1 * max_delay) delay = min(max_delay + jitter, 60) # Cap at 60 seconds time.sleep(delay)Final Thoughts
Free LLM API rate limits span an enormous range—from 20 to 14,400 requests per day. The right provider depends entirely on your use case:
- Usage pattern: Burst vs. sustained, single-threaded vs. parallel
- Growth trajectory: Does the provider offer affordable paid tiers?
- Integration effort: All providers use OpenAI SDK, minimizing switching costs
- Fallback strategy: Implement multi-provider support from day one
Quick recommendations:
- Maximum testing capacity: Groq or Cerebras (14,400 RPD)
- Best for production scaling: Mistral AI (1B tokens/month)
- Simplest for hobbyists: OpenRouter or GitHub Models
- Quality over quantity: Google Gemini (strict limits but strong model)
Remember: Free tiers are for evaluation and prototyping. Plan your upgrade path before hitting rate limits in production.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments