Skip to content

How Do You Get an AI Agent to Run for Several Hours?

Problem

I tried asking Codex to build a full-stack app. I got excited about the possibilities. I typed:

before-prompt.txt
Build me a full-stack app with user authentication, a dashboard, and API endpoints.

Twenty minutes later, it stopped. I got some boilerplate code, a basic login page, and nothing else. No tests. No error handling. No deployment config.

I wanted hours of work, but I got minutes.

So I asked on Reddit: “Has anyone tried running Codex for several hours?” The answers showed me what I was doing wrong.

What happened?

My prompt was too vague. I gave Codex a destination (“build an app”) but no map. It did what it could in a reasonable time, then stopped.

A highly-upvoted comment explained the difference:

reddit-insight.txt
20-minute request: "Build me xyz app to do abc thing"
10-hour request: "Build this app by completing all 30 of these
sequential tasks, and use a Planner, Builder, and QA subagent
one at a time to ensure each task is of maximally high quality.
Be patient with each subagent, and be precise in your instructions
to each one. Do not stop orchestrating until all 30 tasks have
been completed."

The difference is clear:

  • 20-minute prompt: High-level goal, no structure
  • 10-hour prompt: Decomposed tasks, subagent pattern, exit criteria

I needed to change my approach.

How to solve it?

I broke down the solution into three components: decomposition, subagent pattern, and exit criteria.

Component 1: Task Decomposition

Instead of one big task, I split it into 20-50 small tasks. Each task should be completable in 10-30 minutes.

task-list-example.txt
Task 1: Create project structure with folders for src, tests, config
Task 2: Set up package.json with required dependencies
Task 3: Create database schema for users table
Task 4: Write migration script for user table
Task 5: Create User model with CRUD methods
Task 6: Write unit tests for User model
Task 7: Create authentication middleware
Task 8: Write tests for authentication middleware
Task 9: Create login endpoint /api/auth/login
Task 10: Write tests for login endpoint
... (continue to Task 30)

The key is: each task is atomic, testable, and has a clear definition of done.

Component 2: Subagent Pattern

I use three subagents in sequence:

+----------+ +----------+ +----------+
| Planner | --> | Builder | --> | QA |
+----------+ +----------+ +----------+
| | |
v v v
Plan task Implement task Verify task
Break down Write code Run tests
Define specs Handle errors Check quality

Planner reads the task and creates a detailed plan with:

  • Files to create or modify
  • Functions to implement
  • Expected behavior
  • Test cases to write

Builder implements the plan:

  • Writes the code
  • Handles errors
  • Follows the specifications

QA verifies the implementation:

  • Runs all tests
  • Checks edge cases
  • Validates against the plan

Component 3: Exit Criteria

Each task has explicit success conditions:

exit-criteria-example.txt
Task 5: Create User model with CRUD methods
Exit Criteria:
1. User.create() inserts a user and returns user ID
2. User.findById() returns user object or null
3. User.update() modifies user and returns true
4. User.delete() removes user and returns true
5. All methods have error handling for invalid input
6. Unit tests cover all methods with >80% coverage
7. No linting errors in the file
DO NOT proceed to Task 6 until ALL criteria pass.

The “do not proceed” instruction is critical. Without it, the agent might skip to the next task when tests fail.

The result

Here’s my prompt now:

after-prompt.txt
Build a full-stack user management app by completing these 30 tasks.
For each task, follow this workflow:
1. PLANNER: Read task, create detailed implementation plan
2. BUILDER: Implement the plan
3. QA: Run tests and verify all exit criteria pass
Task 1: Project setup
- Create folder structure: src/, tests/, config/
- Initialize package.json with express, pg, jest
- Create .env.example with DATABASE_URL, PORT
Exit criteria: npm install runs without errors
Task 2: Database configuration
- Create config/database.js
- Connect to PostgreSQL using pg library
- Export connection pool
Exit criteria: Connection test passes
Task 3: User table schema
- Create migrations/001_create_users.sql
- Columns: id, email, password_hash, created_at, updated_at
- Add unique constraint on email
Exit criteria: Migration runs successfully
[... continue for all 30 tasks ...]
Task 30: Deployment config
- Create Dockerfile with node:20-alpine
- Create docker-compose.yml with app and postgres
- Add health check endpoint
Exit criteria: docker-compose up runs the app
IMPORTANT: Do not stop until all 30 tasks are complete.
If a task fails, debug and retry until it passes.
Only move to next task when all exit criteria are verified.

This prompt gives the agent:

  • Clear structure (30 tasks)
  • A workflow to follow (Planner-Builder-QA)
  • Explicit stop condition (“do not stop until all 30 tasks are complete”)
  • Retry behavior (“if a task fails, debug and retry”)

Why this works

Token efficiency: The agent doesn’t waste tokens figuring out what to do next. Each task is pre-defined.

Quality assurance: The QA phase catches issues before moving on. No accumulating technical debt.

Autonomy: With clear exit criteria, the agent knows when a task is truly done. No guessing.

Reproducibility: Anyone can run the same prompt and get similar results.

The Reddit thread had another approach worth mentioning:

alternative-workflow.txt
My workflow:
1. Create detailed PRD (Product Requirements Document)
2. Create TECH_SPEC.md with architecture decisions
3. Create TODO.md with numbered tasks
4. Ask Codex to complete TODOs one by one

This works too. The key is: you provide the structure upfront, so the agent can focus on execution.

Common mistakes

I made these mistakes before I learned the right approach:

Mistake 1: Vague scope

mistake-vague.txt
Build me an app that does user management.

No task count. No exit criteria. No structure. Result: 20 minutes of work.

Mistake 2: No verification

mistake-no-verify.txt
Complete these tasks:
1. Set up database
2. Create models
3. Build API

Each task is done when… what? Without tests or criteria, the agent decides arbitrarily.

Mistake 3: Single-phase thinking

mistake-single-phase.txt
Write all the code for a user management system.

No separation between planning, coding, and testing. Quality suffers.

Mistake 4: Missing “continue” instruction

mistake-no-continue.txt
Complete tasks 1-30.

After task 1 fails, should it retry? Move to task 2? Stop? Without explicit instruction, it might do any of these.

Summary

In this post, I showed how to structure prompts for long-running AI coding tasks. The key point is decomposing work into 20-50 sequential tasks, using the Planner-Builder-QA pattern, and defining explicit exit criteria.

A 20-minute prompt like “build an app” gives you 20 minutes of work. A 10-hour prompt with 30 tasks, subagent workflows, and exit criteria gives you 10 hours of work.

The difference is not magic. It is structure. The agent needs to know what to do, how to do it, and when it is done.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments