Skip to content

How to Reduce Token Costs with History Compression in Koog AI Agents

Purpose/Problem

I was building a long-running AI agent in Java using Koog, and I noticed something troubling: the more tasks my agent completed, the slower and more expensive each LLM call became.

Here’s what happened. My agent was designed to handle multiple user requests in a session. Every LLM call, every tool invocation, every response - they all accumulated in the conversation history. After about 20-30 interactions, my costs had doubled. After 50 interactions, the agent started failing with context window limit errors.

The root cause? Token bloat from unmanaged conversation history.

The Discovery

I dug into the Koog documentation and found exactly what I needed: history compression. This feature intelligently summarizes or extracts key information from conversation history to keep token counts manageable.

The problem is real: longer context means slower LLM responses, higher costs (you pay per token), and eventually hitting context window limits where your agent just stops working.

How It Works

Koog’s approach is straightforward. Instead of keeping every message in history, you can compress it - either summarizing entire conversations or extracting just the essential facts.

You call ctx.compressHistory() in your functional strategy when the history gets too long. The key is choosing the right compression strategy for your use case.

Here are the available strategies:

CompressionStrategies.java
// 1. WholeHistory - compress entire history into a summary
HistoryCompressionStrategy.WholeHistory
// 2. FromLastNMessages - only compress last N messages
HistoryCompressionStrategy.FromLastNMessages(100)
// 3. Chunked - compress in chunks of N messages
HistoryCompressionStrategy.Chunked(20)
// 4. RetrieveFactsFromHistory - extract specific facts
// Example: "What's the user's name?" or "Which operations were performed?"

The Implementation

Here’s how I integrated history compression into my agent:

AgentWithCompression.java
var agentWithCompression = AIAgent.builder()
.promptExecutor(promptExecutor)
.llmModel(OpenAIModels.Chat.GPT5_2)
.functionalStrategy("compressed", (ctx, userInput) -> {
var response = ctx.requestLLM(userInput);
// Agent logic here...
// When history gets long, compress it
ctx.compressHistory();
return response;
})
.build();

The first time I tried this, I made a mistake: I called compressHistory() after every single LLM call. That’s overkill. The compression itself has some overhead, so you want to compress strategically - maybe every 10-20 messages, or when you detect the history growing beyond a certain size.

Choosing the Right Strategy

Each strategy has its place:

WholeHistory - Use this when you want a complete summary of everything that happened. Good for agents that need to maintain a high-level understanding of the entire conversation.

FromLastNMessages(N) - Use this to compress only recent messages while keeping older ones intact. Good when early context matters but recent exchanges are verbose.

Chunked(N) - Use this to compress in batches. Good for very long conversations where you want gradual, predictable compression.

RetrieveFactsFromHistory - Use this when you need specific information extracted. Instead of summarizing, it answers questions like “What’s the user’s name?” or “Which files were modified?”

The Cost Impact

Token costs scale linearly with context size. For agents handling many interactions, I’ve seen compression reduce costs by 50-80% while maintaining agent effectiveness.

But here’s the catch: compress too aggressively and you lose critical context. I learned this the hard way when my agent forgot important user preferences after a too-aggressive compression. Test your agent behavior after implementing compression.

Common Mistakes to Avoid

  1. Waiting until you hit context limits - Compress proactively, not reactively. By the time you hit the limit, your costs are already inflated.

  2. Compressing too aggressively - You might lose important context. Start with conservative strategies and adjust based on your agent’s needs.

  3. Not testing after compression - Your agent might behave differently after history compression. Test thoroughly.

Final Thoughts

History compression in Koog is a practical solution to a real problem. If you’re building long-running agents, you’ll hit this wall eventually. The solution is simple: call ctx.compressHistory() with an appropriate strategy before your history becomes unmanageable.

For domain-specific needs, you can implement custom compression strategies. But start with the built-in ones - they cover most use cases.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments