Skip to content

How Do I Reduce Claude API Token Costs When Building Apps?

Problem

When I built my first app with Claude API, my monthly bill hit $600.

I was shocked. I expected maybe $50-100. What went wrong?

I checked my usage dashboard and found I was sending my entire codebase with every request. Each API call included 500,000+ tokens of context. No caching. No context management. Just brute force.

When I asked on Reddit, I found others making the same mistake:

“You’re manually pasting your app into project knowledge every time, that’s burning tokens like crazy because you’re sending the entire codebase in context on every message, plus extended thinking on Opus.” - Reddit user (131 upvotes)

This post shows how I reduced my costs from $600/month to under $100/month.

What I Was Doing Wrong

My API usage pattern looked like this:

My expensive approach
Request 1: Send full codebase (500k tokens) + question → $15
Request 2: Send full codebase (500k tokens) + question → $15
Request 3: Send full codebase (500k tokens) + question → $15
...
Total: $600+/month

I was making three critical mistakes:

  1. Resending everything: My entire codebase went with every request
  2. Using Opus for everything: Even simple tasks used the most expensive model
  3. No caching: The same context was processed fresh every time

Solution 1: Use Prompt Caching (90% Reduction)

The biggest win came from prompt caching. When I cache repeated content, I pay full price once, then up to 90% less for subsequent calls.

Here’s how I implemented it:

prompt_caching.py
import anthropic
client = anthropic.Anthropic()
# My large system prompt with project context
system_prompt = """
You are a code assistant for my e-commerce app.
Project structure:
- src/
- api/
- models/
- utils/
[... 50,000 tokens of project documentation ...]
"""
# First call: pay full price
message = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system=[
{
"type": "text",
"text": system_prompt,
"cache_control": {"type": "ephemeral"}
}
],
messages=[
{"role": "user", "content": "Add a login function"}
]
)
# Subsequent calls: up to 90% cheaper on cached tokens
# The system prompt gets cached for ~5 minutes

The cost difference is dramatic:

Cost comparison
Without caching:
- 10 requests x 50k system prompt = 500k tokens processed
- Cost: ~$7.50
With caching:
- 1st request: 50k tokens (full price)
- 9 cached requests: 50k tokens at 10% price
- Total: ~50k + 45k = 95k effective tokens
- Cost: ~$1.43
Savings: 81%

Solution 2: Pick the Right Model

I used Claude Opus for everything. That was wasteful.

Here’s what I learned about model selection:

Model selection guide
Task type | Model | Cost savings
-----------------------------|----------|-------------
Complex architecture design | Opus | 0% (baseline)
Bug fixing in existing code | Sonnet | 60%
Code completion | Sonnet | 60%
Simple refactoring | Haiku | 90%+
Documentation generation | Haiku | 90%+

I now use this logic in my app:

model_selection.py
def get_model_for_task(task_type):
"""Choose the cheapest model that gets the job done."""
if task_type == "architecture_decision":
# Complex reasoning needs Opus
return "claude-opus-4"
elif task_type in ["bug_fix", "feature_implementation", "code_review"]:
# Sonnet handles most coding tasks
return "claude-sonnet-4" # 60% cheaper than Opus
else:
# Simple tasks don't need advanced models
return "claude-haiku-4" # 90%+ cheaper

Solution 3: Manage Context Intelligently

The Reddit discussion highlighted my biggest mistake:

“Consider that a token is 4 characters and usage is per token. Ask Claude to create an onboarding document and separate markdown files for different areas. Created MCP server for larger files.” - Reddit user

Instead of sending my entire codebase, I now:

  1. Create focused context:
context_management.py
# BAD: Send everything
full_codebase = read_entire_project() # 500k tokens
# GOOD: Send only what's relevant
def get_relevant_context(question, codebase_index):
"""Find only the files related to the question."""
relevant_files = semantic_search(question, codebase_index)
return build_context(relevant_files) # ~5k tokens
  1. Use structured documentation:
Project structure
/docs
/api.md # API endpoints documentation
/models.md # Database models
/utils.md # Helper functions
/architecture.md # System design overview
  1. Keep conversations focused:

Claude Code automatically manages context by pruning older parts. For custom implementations, I track token usage:

token_tracking.py
MAX_CONTEXT = 100000 # 100k tokens
def manage_context(conversation_history, new_message):
current_tokens = count_tokens(conversation_history)
if current_tokens + count_tokens(new_message) > MAX_CONTEXT:
# Summarize or prune older messages
conversation_history = summarize_older_messages(
conversation_history,
keep_last_n=5
)
return conversation_history + [new_message]

Token Economics: What Counts

Understanding what consumes tokens helped me optimize:

Token breakdown
1 token = ~4 characters (roughly 3/4 of a word)
What counts toward your API bill:
- Your input text (questions, code snippets)
- System prompts (instructions, context)
- Claude's output (responses, code)
- Conversation history (previous messages)
- Tool outputs (if using MCP or function calling)

This means:

Optimization impact
Optimization | Token reduction
--------------------------|----------------
Remove unused imports | 5-10%
Delete commented code | 5-15%
Use shorter variable names| 2-5% (but hurts readability)
Split large files | Better context targeting
Use caching | 90% on repeated content

My New Cost Structure

After implementing these changes:

Before and after comparison
| Before | After | Savings
--------------------|-----------|-----------|--------
Model mix | 100% Opus | 20% Opus | 60%
| | 60% Sonnet|
| | 20% Haiku |
Caching | None | Enabled | 90%*
Context sent | 500k | ~10k | 98%
Monthly cost | $600 | $80 | 87%

*90% savings on cached tokens, not total cost

Common Mistakes I See

Mistake 1: Sending the entire codebase every time

This is the #1 cost driver. Only send relevant files.

Mistake 2: Using Opus for simple tasks

Opus is for complex reasoning. Sonnet handles 90% of coding tasks at 60% lower cost.

Mistake 3: Ignoring prompt caching

If you send the same context repeatedly, caching it. The first call pays full price, subsequent calls are nearly free.

Mistake 4: Long conversations without pruning

Each message in a conversation includes all previous messages. Long chats get expensive fast.

Mistake 5: Extended thinking for routine tasks

Extended thinking mode uses more tokens. Reserve it for problems that actually need deep reasoning.

Quick Reference: Cost Optimization Checklist

Cost reduction checklist
[ ] Enable prompt caching for repeated context
[ ] Use Sonnet (not Opus) for standard coding tasks
[ ] Use Haiku for simple operations
[ ] Send only relevant code files
[ ] Create onboarding docs instead of sending full codebase
[ ] Prune conversation history when it gets long
[ ] Monitor token usage in Anthropic dashboard
[ ] Set budget alerts at 50% and 75% of your limit

Summary

In this post, I showed how I reduced my Claude API costs from $600/month to under $100/month.

The three key strategies are:

  1. Prompt caching: Cache repeated content for up to 90% cost reduction
  2. Model selection: Use Sonnet for most tasks (60% cheaper), Haiku for simple ones (90%+ cheaper)
  3. Context management: Send only relevant code, not entire repositories

The Reddit discussion made it clear: sending entire codebases repeatedly is the most common mistake. With proper caching and context management, costs drop dramatically without sacrificing code quality.

Start by checking your API dashboard. If you’re burning tokens on repeated context, implement caching today. It’s the fastest way to cut costs.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments