How Do I Prevent My AI Agent from Getting Stuck in an Infinite Loop?

Mar 11, 2026

Purpose

This post shows how to prevent AI agents from getting stuck in infinite loops.

Problem

I built my first autonomous AI agent last month. It worked great in testing. Then I deployed it to production and went to bed.

The next morning, I woke up to a $47 phone notification from my API provider.

Date: March 10, 2026
Tokens Used: 127,843,291
Cost: $847.23

My agent had been running in circles all night. It kept “thinking” and calling the same API over and over. I had forgotten one thing: infinite loop prevention.

Environment

Python 3.12
LangChain for agent framework
OpenAI/Claude API
Redis for state tracking

What Went Wrong

My agent code looked like this:

class BadAgent:
    async def run(self, goal: str):
        while not self.goal_reached:
            action = await self.decide_next_action()
            result = await self.execute(action)
            self.context = self.update(result)
        return self.result

This looks fine. But what happens when the agent can’t reach the goal? It loops forever.

Here’s what I saw in the logs:

[02:15:33] Action: search_web("weather tokyo")
[02:15:34] Result: No results found, retrying...
[02:15:35] Action: search_web("weather tokyo")
[02:15:36] Result: No results found, retrying...
[02:15:37] Action: search_web("weather tokyo")
... (repeated 15,847 times until I killed it)

The agent hit an API error and kept retrying the same action. No timeout. No iteration limit. No loop detection. Just an endless reasoning cycle burning my API credits.

Why This Happens

AI agents get stuck in loops for three main reasons:

No hard stop - The loop has no maximum iteration count
State confusion - The agent forgets what it already tried
No progress tracking - The agent keeps trying the same failed action

A Reddit thread about OpenClaw described this perfectly: “Millions of ghost agents running 24/7 in infinite reasoning loops, slamming 128k context windows into APIs like a punching bag.”

These runaway agents can accidentally DDoS cloud providers and drain your budget in hours.

Solution: Three-Layer Defense

I implemented three safeguards to prevent this:

+---------------------------+
|     Layer 1: Max Iterations    |  <-- Hard stop after N loops
+---------------------------+
|     Layer 2: Timeouts          |  <-- Stop after X seconds
+---------------------------+
|     Layer 3: Loop Detection    |  <-- Detect repeated states
+---------------------------+

Layer 1: Maximum Iterations

The first layer is a hard iteration limit. LangChain makes this easy:

from langchain.agents import AgentExecutor

agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    max_iterations=15,           # Stop after 15 iterations
    max_execution_time=300,      # Or 5 minutes, whichever comes first
    early_stopping_method="generate",  # Return partial result
    handle_parsing_errors=True
)

Now my agent stops after 15 iterations. But what if each iteration takes 2 minutes? That’s still 30 minutes of wasted time. I needed timeouts too.

Layer 2: Timeouts

The second layer is time-based protection. I built a timeout manager:

import asyncio
from datetime import datetime

class AgentTimeoutManager:
    def __init__(
        self,
        total_timeout: int = 600,      # 10 minutes max
        iteration_timeout: int = 30,   # 30 seconds per step
        idle_timeout: int = 120        # 2 minutes without progress
    ):
        self.total_timeout = total_timeout
        self.iteration_timeout = iteration_timeout
        self.idle_timeout = idle_timeout
        self.start_time = datetime.now()
        self.last_progress_time = datetime.now()

    async def run_with_timeout(self, agent_step, context: dict) -> dict:
        """Execute agent step with timeout protection"""

        # Check total time budget
        elapsed = (datetime.now() - self.start_time).total_seconds()
        if elapsed > self.total_timeout:
            raise TimeoutError(f"Total timeout exceeded: {elapsed:.1f}s")

        # Check idle time (no progress)
        idle_time = (datetime.now() - self.last_progress_time).total_seconds()
        if idle_time > self.idle_timeout:
            raise TimeoutError(f"No progress for {idle_time:.1f}s")

        # Execute step with iteration timeout
        try:
            result = await asyncio.wait_for(
                agent_step(context),
                timeout=self.iteration_timeout
            )
            self.last_progress_time = datetime.now()
            return result
        except asyncio.TimeoutError:
            raise TimeoutError(f"Step exceeded {self.iteration_timeout}s")

Now my agent has multiple time limits:

Total runtime: 10 minutes max
Per-step timeout: 30 seconds
Idle detection: Stop if no progress for 2 minutes

But there was still one problem. What if the agent keeps doing the same thing but slightly differently?

Layer 3: Loop Detection

The third layer detects when the agent revisits the same state or action:

from collections import deque
from hashlib import md5
import json

class LoopDetector:
    """Detect when agent is revisiting same states/actions"""

    def __init__(self, window_size: int = 5):
        self.state_history = deque(maxlen=window_size)
        self.action_history = deque(maxlen=window_size)

    def check_for_loop(self, current_state: dict, current_action: str) -> tuple[bool, str]:
        """Returns: (is_loop_detected, diagnostic_message)"""

        # Create hash of current state
        state_hash = md5(json.dumps(current_state, sort_keys=True).encode()).hexdigest()

        # Check for exact state repetition
        if state_hash in self.state_history:
            return True, f"Exact state repeated"

        # Check for action repetition (3+ times)
        if list(self.action_history).count(current_action) >= 3:
            return True, f"Action '{current_action}' repeated 3+ times"

        # Update histories
        self.state_history.append(state_hash)
        self.action_history.append(current_action)

        return False, ""

This detects when my agent keeps trying the same thing. Let me show you how I use it:

class SafeAgent:
    def __init__(self):
        self.timeout_manager = AgentTimeoutManager()
        self.loop_detector = LoopDetector()
        self.iteration_count = 0
        self.max_iterations = 20

    async def run(self, goal: str) -> dict:
        context = {'goal': goal}

        while True:
            try:
                # Check iteration limit
                self.iteration_count += 1
                if self.iteration_count > self.max_iterations:
                    return self._generate_partial_result("Max iterations reached")

                # Get next action with timeout
                action = await self.timeout_manager.run_with_timeout(
                    self._decide_action,
                    context
                )

                # Check for loops
                is_loop, message = self.loop_detector.check_for_loop(
                    context,
                    action['name']
                )
                if is_loop:
                    return self._generate_partial_result(f"Loop detected: {message}")

                # Execute action
                result = await self.timeout_manager.run_with_timeout(
                    self._execute_action,
                    {'action': action, 'context': context}
                )

                context = self._update(context, result)

                if self._is_goal_reached(context):
                    return {'success': True, 'result': context}

            except TimeoutError as e:
                return self._generate_partial_result(str(e))

Now my agent has three layers of protection. Let me test it:

python safe_agent.py "Find the best pizza in Tokyo"

# Output
[00:00:15] Action: search("best pizza tokyo")
[00:00:16] Action: read_reviews(3 results)
[00:00:18] Action: compare_ratings()
[00:00:20] Success: Found top-rated pizza place
Total iterations: 4
Total time: 5.2 seconds

And when something goes wrong:

python safe_agent.py "Find impossible thing"

# Output
[00:00:15] Action: search("impossible thing")
[00:00:16] Action: search("impossible thing")
[00:00:17] Action: search("impossible thing")
[00:00:18] Loop detected: Action 'search' repeated 3+ times
Returning partial result...
Total iterations: 3
Total time: 3.1 seconds
Cost saved: ~$50

Common Mistakes I Made

Mistake 1: Only Using max_iterations

# WRONG: Only iteration limit
agent = AgentExecutor(agent=llm, tools=tools, max_iterations=10)
# Problem: Each iteration can take 5 minutes = 50 minutes total!

Mistake 2: No Progress Tracking

# WRONG: Binary goal check
while not goal_reached:
    # Agent might make zero progress for 100 iterations
    pass

# RIGHT: Track progress
class ProgressTracker:
    def __init__(self, patience: int = 5):
        self.no_progress_count = 0
        self.patience = patience

    def check(self, old_state, new_state) -> bool:
        if old_state == new_state:
            self.no_progress_count += 1
            if self.no_progress_count >= self.patience:
                return False  # No progress, should stop
        else:
            self.no_progress_count = 0
        return True

Mistake 3: Letting Context Grow Forever

# WRONG: Unbounded context
context += f"\nAction: {action}\nResult: {result}"
# Context grows until you hit token limits and costs explode

# RIGHT: Sliding window with summarization
class ContextManager:
    def __init__(self, max_messages: int = 20):
        self.messages = deque(maxlen=max_messages)

    def add(self, message: str):
        self.messages.append(message)
        if len(self.messages) == self.max_messages:
            # Summarize old messages, keep recent ones
            self._summarize_old()

Quick Reference

Here’s the minimal code I use now for every agent:

from langchain.agents import AgentExecutor

# Minimal safeguards - always include these
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    max_iterations=15,           # 1. Iteration limit
    max_execution_time=300,      # 2. Time limit
    early_stopping_method="generate",  # 3. Graceful exit
    handle_parsing_errors=True
)

# For custom agents, use this pattern:
class SafeLoop:
    def __init__(self, max_iter=20, timeout=300):
        self.max_iter = max_iter
        self.timeout = timeout
        self.iterations = 0
        self.seen_actions = set()

    def should_continue(self, action: str) -> bool:
        self.iterations += 1
        if self.iterations > self.max_iter:
            return False
        if action in self.seen_actions:
            return False  # Already tried this
        self.seen_actions.add(action)
        return True

Summary

In this post, I showed how to prevent AI agents from getting stuck in infinite loops. The key is implementing three safeguards:

Iteration limits - Stop after N loops (use max_iterations)
Timeouts - Stop after X seconds (use max_execution_time)
Loop detection - Detect repeated states and actions

Without these safeguards, a single runaway agent can cost hundreds of dollars and accidentally DDoS cloud providers. With them, your agent fails gracefully and returns a partial result.

Start with LangChain’s built-in max_iterations and max_execution_time. Add custom loop detection for production systems. The few lines of defensive code will save you debugging time and API costs.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: OpenClaw DDOS Thread
👨‍💻 LangChain AgentExecutor Documentation
👨‍💻 Circuit Breaker Pattern

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!