Skip to content

How to Add Persistence and Fault Tolerance to Java AI Agents with Koog

Purpose/Problem

I was building a banking assistant AI agent that handles multi-step tasks—checking account balances, processing transactions, sending notifications. These operations sometimes take several minutes to complete.

Then my server crashed mid-execution.

When I restarted the agent, it began from scratch. All those expensive LLM API calls I’d already made? Wasted. I had to pay for them again.

This is the persistence problem: AI agents execute long-running workflows, and without a way to save their state, any crash means starting over from the beginning.

The Solution: Koog’s Persistence Feature

Koog provides a built-in persistence feature that saves your agent’s state after each step. When configured correctly, your agent can resume from exactly where it stopped after a crash—no repeated LLM calls.

Let me show you how to set this up.

Configuring Persistence Storage

First, you need a place to store checkpoints. Koog supports several options:

  • PostgreSQL - Good for production with multiple agents
  • S3 - For distributed systems
  • Local disk - For development and single-instance deployments

Here’s how to configure PostgreSQL storage:

PostgresPersistenceConfig.java
import javax.sql.DataSource;
import ai.koog.persistence.PostgresJdbcPersistenceStorageProvider;
// Configure your Postgres connection
DataSource dataSource = createDataSource(
"jdbc:postgresql://localhost:5432/agent_db",
"username",
"password"
);
// Create storage provider for checkpoints
var storage = new PostgresJdbcPersistenceStorageProvider(
dataSource,
"banking_agent_checkpoints" // Table name for storing state
);

The storage provider handles serialization and deserialization of agent state. You don’t need to write any SQL—the provider creates the table schema automatically.

Installing the Persistence Feature

With storage configured, install the Persistence feature on your agent:

RecoverableAgent.java
import ai.koog.agent.AIAgent;
import ai.koog.persistence.Persistence;
import ai.koog.prompt.executor.PromptExecutor;
import ai.koog.llm.OpenAIModels;
// Build your agent with persistence enabled
var recoverableAgent = AIAgent.builder()
.promptExecutor(promptExecutor)
.llmModel(OpenAIModels.Chat.GPT5_2)
.systemPrompt("You're a banking assistant that helps customers with account inquiries.")
.toolRegistry(toolRegistry)
.install(Persistence.Feature, config -> {
config.setStorage(storage);
config.setEnableAutomaticPersistence(true); // Key setting!
})
.build();

The critical line is setEnableAutomaticPersistence(true). This tells Koog to save the agent’s state after each node in your graph executes. If the agent crashes during node 5, it resumes from node 5—not from the beginning.

Using Session IDs for Recovery

Here’s where I made a mistake initially: I didn’t use session IDs.

When you run your agent, you need to pass a session ID. This ID ties the checkpoint data to a specific user session. Without it, multiple concurrent users would conflict with each other.

AgentExecution.java
// First run - starts fresh
String sessionId = "user-session-0123";
recoverableAgent.run("Help me check my account balance", sessionId);
// Server crashes mid-execution...
// System restarts...
// Second run with SAME session ID - automatically recovers
// Resumes from the exact node where it stopped
recoverableAgent.run("Help me check my account balance", sessionId);

The session ID must be consistent between runs for recovery to work. I use user IDs combined with timestamps or UUIDs.

What Gets Saved

When automatic persistence is enabled, Koog saves:

  • Current node position in the graph
  • Variables and their values
  • Conversation history
  • Tool call results

This means if your agent already called an expensive LLM or made a bank API call before crashing, those results are preserved. The agent won’t repeat them.

Storage Choice Trade-offs

I experimented with different storage options:

Storage Comparison
PostgreSQL:
+ ACID guarantees
+ Concurrent access support
+ Works with existing infrastructure
- Requires database setup
- Network latency
S3:
+ Infinite scale
+ Works across regions
+ No database maintenance
- Higher latency
- Eventual consistency
Local Disk:
+ Zero latency
+ No external dependencies
+ Simple setup
- Not shared across instances
- Lost if machine fails

For production with multiple agent instances, PostgreSQL or S3 are the practical choices. Local disk works fine for development or single-instance deployments.

Testing Recovery

I strongly recommend testing recovery scenarios. Here’s my approach:

RecoveryTest.java
// 1. Start agent with a session ID
agent.run("Process my transaction", "test-session-001");
// 2. Simulate crash by killing the process mid-execution
// (Use a breakpoint or kill -9)
// 3. Restart and run with same session ID
agent.run("Process my transaction", "test-session-001");
// 4. Verify agent resumes from checkpoint
// Check logs for "Resuming from checkpoint" message

I’ve caught several bugs this way—mostly related to non-serializable objects in my agent state.

Common Mistakes

I made these mistakes so you don’t have to:

Not using session IDs: Multiple users will overwrite each other’s checkpoints. Always use unique session IDs.

Non-serializable state: If your agent holds references to threads, sockets, or file handles, persistence will fail. Keep your state simple and serializable.

Wrong storage for scale: Local disk doesn’t work when you have multiple agent instances. Plan for your deployment architecture.

Not testing recovery: Recovery only works if you test it. Add crash simulation to your test suite.

How It Works Under the Hood

Koog implements checkpointing by wrapping your graph nodes. After each node executes:

  1. Serialize the current graph state
  2. Write to configured storage with session ID as key
  3. Move to next node

On restart, the agent:

  1. Checks storage for existing state with given session ID
  2. If found, deserializes and restores state
  3. Continues execution from the saved node

This happens transparently—you don’t need to modify your graph logic.

When Persistence Matters

Not every agent needs persistence. It’s most valuable when:

  • Your agent makes expensive API calls (LLM, third-party services)
  • Execution takes more than a few seconds
  • Your service has frequent deployments
  • You’re running in an unstable environment

For simple, fast agents that complete in under a second, persistence adds overhead without much benefit.

Summary

Koog’s persistence feature solves the crash recovery problem for AI agents. Configure storage, install the feature with automatic persistence enabled, and use consistent session IDs. Your agents will recover from failures without repeating expensive operations.

The key is setEnableAutomaticPersistence(true)—this single setting turns a fragile agent into one that can survive server restarts, deployments, and crashes.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments