How Do I Prevent My AI Agent from Getting Stuck in an Infinite Loop?
Purpose
This post shows how to prevent AI agents from getting stuck in infinite loops.
Problem
I built my first autonomous AI agent last month. It worked great in testing. Then I deployed it to production and went to bed.
The next morning, I woke up to a $47 phone notification from my API provider.
Date: March 10, 2026Tokens Used: 127,843,291Cost: $847.23My agent had been running in circles all night. It kept “thinking” and calling the same API over and over. I had forgotten one thing: infinite loop prevention.
Environment
- Python 3.12
- LangChain for agent framework
- OpenAI/Claude API
- Redis for state tracking
What Went Wrong
My agent code looked like this:
class BadAgent: async def run(self, goal: str): while not self.goal_reached: action = await self.decide_next_action() result = await self.execute(action) self.context = self.update(result) return self.resultThis looks fine. But what happens when the agent can’t reach the goal? It loops forever.
Here’s what I saw in the logs:
[02:15:33] Action: search_web("weather tokyo")[02:15:34] Result: No results found, retrying...[02:15:35] Action: search_web("weather tokyo")[02:15:36] Result: No results found, retrying...[02:15:37] Action: search_web("weather tokyo")... (repeated 15,847 times until I killed it)The agent hit an API error and kept retrying the same action. No timeout. No iteration limit. No loop detection. Just an endless reasoning cycle burning my API credits.
Why This Happens
AI agents get stuck in loops for three main reasons:
- No hard stop - The loop has no maximum iteration count
- State confusion - The agent forgets what it already tried
- No progress tracking - The agent keeps trying the same failed action
A Reddit thread about OpenClaw described this perfectly: “Millions of ghost agents running 24/7 in infinite reasoning loops, slamming 128k context windows into APIs like a punching bag.”
These runaway agents can accidentally DDoS cloud providers and drain your budget in hours.
Solution: Three-Layer Defense
I implemented three safeguards to prevent this:
+---------------------------+| Layer 1: Max Iterations | <-- Hard stop after N loops+---------------------------+| Layer 2: Timeouts | <-- Stop after X seconds+---------------------------+| Layer 3: Loop Detection | <-- Detect repeated states+---------------------------+Layer 1: Maximum Iterations
The first layer is a hard iteration limit. LangChain makes this easy:
from langchain.agents import AgentExecutor
agent_executor = AgentExecutor( agent=agent, tools=tools, max_iterations=15, # Stop after 15 iterations max_execution_time=300, # Or 5 minutes, whichever comes first early_stopping_method="generate", # Return partial result handle_parsing_errors=True)Now my agent stops after 15 iterations. But what if each iteration takes 2 minutes? That’s still 30 minutes of wasted time. I needed timeouts too.
Layer 2: Timeouts
The second layer is time-based protection. I built a timeout manager:
import asynciofrom datetime import datetime
class AgentTimeoutManager: def __init__( self, total_timeout: int = 600, # 10 minutes max iteration_timeout: int = 30, # 30 seconds per step idle_timeout: int = 120 # 2 minutes without progress ): self.total_timeout = total_timeout self.iteration_timeout = iteration_timeout self.idle_timeout = idle_timeout self.start_time = datetime.now() self.last_progress_time = datetime.now()
async def run_with_timeout(self, agent_step, context: dict) -> dict: """Execute agent step with timeout protection"""
# Check total time budget elapsed = (datetime.now() - self.start_time).total_seconds() if elapsed > self.total_timeout: raise TimeoutError(f"Total timeout exceeded: {elapsed:.1f}s")
# Check idle time (no progress) idle_time = (datetime.now() - self.last_progress_time).total_seconds() if idle_time > self.idle_timeout: raise TimeoutError(f"No progress for {idle_time:.1f}s")
# Execute step with iteration timeout try: result = await asyncio.wait_for( agent_step(context), timeout=self.iteration_timeout ) self.last_progress_time = datetime.now() return result except asyncio.TimeoutError: raise TimeoutError(f"Step exceeded {self.iteration_timeout}s")Now my agent has multiple time limits:
- Total runtime: 10 minutes max
- Per-step timeout: 30 seconds
- Idle detection: Stop if no progress for 2 minutes
But there was still one problem. What if the agent keeps doing the same thing but slightly differently?
Layer 3: Loop Detection
The third layer detects when the agent revisits the same state or action:
from collections import dequefrom hashlib import md5import json
class LoopDetector: """Detect when agent is revisiting same states/actions"""
def __init__(self, window_size: int = 5): self.state_history = deque(maxlen=window_size) self.action_history = deque(maxlen=window_size)
def check_for_loop(self, current_state: dict, current_action: str) -> tuple[bool, str]: """Returns: (is_loop_detected, diagnostic_message)"""
# Create hash of current state state_hash = md5(json.dumps(current_state, sort_keys=True).encode()).hexdigest()
# Check for exact state repetition if state_hash in self.state_history: return True, f"Exact state repeated"
# Check for action repetition (3+ times) if list(self.action_history).count(current_action) >= 3: return True, f"Action '{current_action}' repeated 3+ times"
# Update histories self.state_history.append(state_hash) self.action_history.append(current_action)
return False, ""This detects when my agent keeps trying the same thing. Let me show you how I use it:
class SafeAgent: def __init__(self): self.timeout_manager = AgentTimeoutManager() self.loop_detector = LoopDetector() self.iteration_count = 0 self.max_iterations = 20
async def run(self, goal: str) -> dict: context = {'goal': goal}
while True: try: # Check iteration limit self.iteration_count += 1 if self.iteration_count > self.max_iterations: return self._generate_partial_result("Max iterations reached")
# Get next action with timeout action = await self.timeout_manager.run_with_timeout( self._decide_action, context )
# Check for loops is_loop, message = self.loop_detector.check_for_loop( context, action['name'] ) if is_loop: return self._generate_partial_result(f"Loop detected: {message}")
# Execute action result = await self.timeout_manager.run_with_timeout( self._execute_action, {'action': action, 'context': context} )
context = self._update(context, result)
if self._is_goal_reached(context): return {'success': True, 'result': context}
except TimeoutError as e: return self._generate_partial_result(str(e))Now my agent has three layers of protection. Let me test it:
python safe_agent.py "Find the best pizza in Tokyo"
# Output[00:00:15] Action: search("best pizza tokyo")[00:00:16] Action: read_reviews(3 results)[00:00:18] Action: compare_ratings()[00:00:20] Success: Found top-rated pizza placeTotal iterations: 4Total time: 5.2 secondsAnd when something goes wrong:
python safe_agent.py "Find impossible thing"
# Output[00:00:15] Action: search("impossible thing")[00:00:16] Action: search("impossible thing")[00:00:17] Action: search("impossible thing")[00:00:18] Loop detected: Action 'search' repeated 3+ timesReturning partial result...Total iterations: 3Total time: 3.1 secondsCost saved: ~$50Common Mistakes I Made
Mistake 1: Only Using max_iterations
# WRONG: Only iteration limitagent = AgentExecutor(agent=llm, tools=tools, max_iterations=10)# Problem: Each iteration can take 5 minutes = 50 minutes total!Mistake 2: No Progress Tracking
# WRONG: Binary goal checkwhile not goal_reached: # Agent might make zero progress for 100 iterations pass
# RIGHT: Track progressclass ProgressTracker: def __init__(self, patience: int = 5): self.no_progress_count = 0 self.patience = patience
def check(self, old_state, new_state) -> bool: if old_state == new_state: self.no_progress_count += 1 if self.no_progress_count >= self.patience: return False # No progress, should stop else: self.no_progress_count = 0 return TrueMistake 3: Letting Context Grow Forever
# WRONG: Unbounded contextcontext += f"\nAction: {action}\nResult: {result}"# Context grows until you hit token limits and costs explode
# RIGHT: Sliding window with summarizationclass ContextManager: def __init__(self, max_messages: int = 20): self.messages = deque(maxlen=max_messages)
def add(self, message: str): self.messages.append(message) if len(self.messages) == self.max_messages: # Summarize old messages, keep recent ones self._summarize_old()Quick Reference
Here’s the minimal code I use now for every agent:
from langchain.agents import AgentExecutor
# Minimal safeguards - always include theseagent_executor = AgentExecutor( agent=agent, tools=tools, max_iterations=15, # 1. Iteration limit max_execution_time=300, # 2. Time limit early_stopping_method="generate", # 3. Graceful exit handle_parsing_errors=True)
# For custom agents, use this pattern:class SafeLoop: def __init__(self, max_iter=20, timeout=300): self.max_iter = max_iter self.timeout = timeout self.iterations = 0 self.seen_actions = set()
def should_continue(self, action: str) -> bool: self.iterations += 1 if self.iterations > self.max_iter: return False if action in self.seen_actions: return False # Already tried this self.seen_actions.add(action) return TrueSummary
In this post, I showed how to prevent AI agents from getting stuck in infinite loops. The key is implementing three safeguards:
- Iteration limits - Stop after N loops (use
max_iterations) - Timeouts - Stop after X seconds (use
max_execution_time) - Loop detection - Detect repeated states and actions
Without these safeguards, a single runaway agent can cost hundreds of dollars and accidentally DDoS cloud providers. With them, your agent fails gracefully and returns a partial result.
Start with LangChain’s built-in max_iterations and max_execution_time. Add custom loop detection for production systems. The few lines of defensive code will save you debugging time and API costs.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: OpenClaw DDOS Thread
- 👨💻 LangChain AgentExecutor Documentation
- 👨💻 Circuit Breaker Pattern
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments