How Do You Get an AI Agent to Run for Several Hours?

Mar 22, 2026

Problem

I tried asking Codex to build a full-stack app. I got excited about the possibilities. I typed:

Build me a full-stack app with user authentication, a dashboard, and API endpoints.

Twenty minutes later, it stopped. I got some boilerplate code, a basic login page, and nothing else. No tests. No error handling. No deployment config.

I wanted hours of work, but I got minutes.

So I asked on Reddit: “Has anyone tried running Codex for several hours?” The answers showed me what I was doing wrong.

What happened?

My prompt was too vague. I gave Codex a destination (“build an app”) but no map. It did what it could in a reasonable time, then stopped.

A highly-upvoted comment explained the difference:

20-minute request: "Build me xyz app to do abc thing"

10-hour request: "Build this app by completing all 30 of these
sequential tasks, and use a Planner, Builder, and QA subagent
one at a time to ensure each task is of maximally high quality.
Be patient with each subagent, and be precise in your instructions
to each one. Do not stop orchestrating until all 30 tasks have
been completed."

The difference is clear:

20-minute prompt: High-level goal, no structure
10-hour prompt: Decomposed tasks, subagent pattern, exit criteria

I needed to change my approach.

How to solve it?

I broke down the solution into three components: decomposition, subagent pattern, and exit criteria.

Component 1: Task Decomposition

Instead of one big task, I split it into 20-50 small tasks. Each task should be completable in 10-30 minutes.

Task 1: Create project structure with folders for src, tests, config
Task 2: Set up package.json with required dependencies
Task 3: Create database schema for users table
Task 4: Write migration script for user table
Task 5: Create User model with CRUD methods
Task 6: Write unit tests for User model
Task 7: Create authentication middleware
Task 8: Write tests for authentication middleware
Task 9: Create login endpoint /api/auth/login
Task 10: Write tests for login endpoint
... (continue to Task 30)

The key is: each task is atomic, testable, and has a clear definition of done.

Component 2: Subagent Pattern

I use three subagents in sequence:

+----------+     +----------+     +----------+
|  Planner | --> |  Builder | --> |    QA    |
+----------+     +----------+     +----------+
     |                |                |
     v                v                v
   Plan task      Implement task    Verify task
   Break down      Write code       Run tests
   Define specs    Handle errors    Check quality

Planner reads the task and creates a detailed plan with:

Files to create or modify
Functions to implement
Expected behavior
Test cases to write

Builder implements the plan:

Writes the code
Handles errors
Follows the specifications

QA verifies the implementation:

Runs all tests
Checks edge cases
Validates against the plan

Component 3: Exit Criteria

Each task has explicit success conditions:

Task 5: Create User model with CRUD methods

Exit Criteria:
1. User.create() inserts a user and returns user ID
2. User.findById() returns user object or null
3. User.update() modifies user and returns true
4. User.delete() removes user and returns true
5. All methods have error handling for invalid input
6. Unit tests cover all methods with >80% coverage
7. No linting errors in the file

DO NOT proceed to Task 6 until ALL criteria pass.

The “do not proceed” instruction is critical. Without it, the agent might skip to the next task when tests fail.

The result

Here’s my prompt now:

Build a full-stack user management app by completing these 30 tasks.

For each task, follow this workflow:
1. PLANNER: Read task, create detailed implementation plan
2. BUILDER: Implement the plan
3. QA: Run tests and verify all exit criteria pass

Task 1: Project setup
- Create folder structure: src/, tests/, config/
- Initialize package.json with express, pg, jest
- Create .env.example with DATABASE_URL, PORT
Exit criteria: npm install runs without errors

Task 2: Database configuration
- Create config/database.js
- Connect to PostgreSQL using pg library
- Export connection pool
Exit criteria: Connection test passes

Task 3: User table schema
- Create migrations/001_create_users.sql
- Columns: id, email, password_hash, created_at, updated_at
- Add unique constraint on email
Exit criteria: Migration runs successfully

[... continue for all 30 tasks ...]

Task 30: Deployment config
- Create Dockerfile with node:20-alpine
- Create docker-compose.yml with app and postgres
- Add health check endpoint
Exit criteria: docker-compose up runs the app

IMPORTANT: Do not stop until all 30 tasks are complete.
If a task fails, debug and retry until it passes.
Only move to next task when all exit criteria are verified.

This prompt gives the agent:

Clear structure (30 tasks)
A workflow to follow (Planner-Builder-QA)
Explicit stop condition (“do not stop until all 30 tasks are complete”)
Retry behavior (“if a task fails, debug and retry”)

Why this works

Token efficiency: The agent doesn’t waste tokens figuring out what to do next. Each task is pre-defined.

Quality assurance: The QA phase catches issues before moving on. No accumulating technical debt.

Autonomy: With clear exit criteria, the agent knows when a task is truly done. No guessing.

Reproducibility: Anyone can run the same prompt and get similar results.

The Reddit thread had another approach worth mentioning:

My workflow:
1. Create detailed PRD (Product Requirements Document)
2. Create TECH_SPEC.md with architecture decisions
3. Create TODO.md with numbered tasks
4. Ask Codex to complete TODOs one by one

This works too. The key is: you provide the structure upfront, so the agent can focus on execution.

Common mistakes

I made these mistakes before I learned the right approach:

Mistake 1: Vague scope

Build me an app that does user management.

No task count. No exit criteria. No structure. Result: 20 minutes of work.

Mistake 2: No verification

Complete these tasks:
1. Set up database
2. Create models
3. Build API

Each task is done when… what? Without tests or criteria, the agent decides arbitrarily.

Mistake 3: Single-phase thinking

Write all the code for a user management system.

No separation between planning, coding, and testing. Quality suffers.

Mistake 4: Missing “continue” instruction

Complete tasks 1-30.

After task 1 fails, should it retry? Move to task 2? Stop? Without explicit instruction, it might do any of these.

Summary

In this post, I showed how to structure prompts for long-running AI coding tasks. The key point is decomposing work into 20-50 sequential tasks, using the Planner-Builder-QA pattern, and defining explicit exit criteria.

A 20-minute prompt like “build an app” gives you 20 minutes of work. A 10-hour prompt with 30 tasks, subagent workflows, and exit criteria gives you 10 hours of work.

The difference is not magic. It is structure. The agent needs to know what to do, how to do it, and when it is done.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit Discussion: Running Codex for several hours

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!