How Do You Get an AI Agent to Run for Several Hours?
Problem
I tried asking Codex to build a full-stack app. I got excited about the possibilities. I typed:
Build me a full-stack app with user authentication, a dashboard, and API endpoints.Twenty minutes later, it stopped. I got some boilerplate code, a basic login page, and nothing else. No tests. No error handling. No deployment config.
I wanted hours of work, but I got minutes.
So I asked on Reddit: “Has anyone tried running Codex for several hours?” The answers showed me what I was doing wrong.
What happened?
My prompt was too vague. I gave Codex a destination (“build an app”) but no map. It did what it could in a reasonable time, then stopped.
A highly-upvoted comment explained the difference:
20-minute request: "Build me xyz app to do abc thing"
10-hour request: "Build this app by completing all 30 of thesesequential tasks, and use a Planner, Builder, and QA subagentone at a time to ensure each task is of maximally high quality.Be patient with each subagent, and be precise in your instructionsto each one. Do not stop orchestrating until all 30 tasks havebeen completed."The difference is clear:
- 20-minute prompt: High-level goal, no structure
- 10-hour prompt: Decomposed tasks, subagent pattern, exit criteria
I needed to change my approach.
How to solve it?
I broke down the solution into three components: decomposition, subagent pattern, and exit criteria.
Component 1: Task Decomposition
Instead of one big task, I split it into 20-50 small tasks. Each task should be completable in 10-30 minutes.
Task 1: Create project structure with folders for src, tests, configTask 2: Set up package.json with required dependenciesTask 3: Create database schema for users tableTask 4: Write migration script for user tableTask 5: Create User model with CRUD methodsTask 6: Write unit tests for User modelTask 7: Create authentication middlewareTask 8: Write tests for authentication middlewareTask 9: Create login endpoint /api/auth/loginTask 10: Write tests for login endpoint... (continue to Task 30)The key is: each task is atomic, testable, and has a clear definition of done.
Component 2: Subagent Pattern
I use three subagents in sequence:
+----------+ +----------+ +----------+| Planner | --> | Builder | --> | QA |+----------+ +----------+ +----------+ | | | v v v Plan task Implement task Verify task Break down Write code Run tests Define specs Handle errors Check qualityPlanner reads the task and creates a detailed plan with:
- Files to create or modify
- Functions to implement
- Expected behavior
- Test cases to write
Builder implements the plan:
- Writes the code
- Handles errors
- Follows the specifications
QA verifies the implementation:
- Runs all tests
- Checks edge cases
- Validates against the plan
Component 3: Exit Criteria
Each task has explicit success conditions:
Task 5: Create User model with CRUD methods
Exit Criteria:1. User.create() inserts a user and returns user ID2. User.findById() returns user object or null3. User.update() modifies user and returns true4. User.delete() removes user and returns true5. All methods have error handling for invalid input6. Unit tests cover all methods with >80% coverage7. No linting errors in the file
DO NOT proceed to Task 6 until ALL criteria pass.The “do not proceed” instruction is critical. Without it, the agent might skip to the next task when tests fail.
The result
Here’s my prompt now:
Build a full-stack user management app by completing these 30 tasks.
For each task, follow this workflow:1. PLANNER: Read task, create detailed implementation plan2. BUILDER: Implement the plan3. QA: Run tests and verify all exit criteria pass
Task 1: Project setup- Create folder structure: src/, tests/, config/- Initialize package.json with express, pg, jest- Create .env.example with DATABASE_URL, PORTExit criteria: npm install runs without errors
Task 2: Database configuration- Create config/database.js- Connect to PostgreSQL using pg library- Export connection poolExit criteria: Connection test passes
Task 3: User table schema- Create migrations/001_create_users.sql- Columns: id, email, password_hash, created_at, updated_at- Add unique constraint on emailExit criteria: Migration runs successfully
[... continue for all 30 tasks ...]
Task 30: Deployment config- Create Dockerfile with node:20-alpine- Create docker-compose.yml with app and postgres- Add health check endpointExit criteria: docker-compose up runs the app
IMPORTANT: Do not stop until all 30 tasks are complete.If a task fails, debug and retry until it passes.Only move to next task when all exit criteria are verified.This prompt gives the agent:
- Clear structure (30 tasks)
- A workflow to follow (Planner-Builder-QA)
- Explicit stop condition (“do not stop until all 30 tasks are complete”)
- Retry behavior (“if a task fails, debug and retry”)
Why this works
Token efficiency: The agent doesn’t waste tokens figuring out what to do next. Each task is pre-defined.
Quality assurance: The QA phase catches issues before moving on. No accumulating technical debt.
Autonomy: With clear exit criteria, the agent knows when a task is truly done. No guessing.
Reproducibility: Anyone can run the same prompt and get similar results.
The Reddit thread had another approach worth mentioning:
My workflow:1. Create detailed PRD (Product Requirements Document)2. Create TECH_SPEC.md with architecture decisions3. Create TODO.md with numbered tasks4. Ask Codex to complete TODOs one by oneThis works too. The key is: you provide the structure upfront, so the agent can focus on execution.
Common mistakes
I made these mistakes before I learned the right approach:
Mistake 1: Vague scope
Build me an app that does user management.No task count. No exit criteria. No structure. Result: 20 minutes of work.
Mistake 2: No verification
Complete these tasks:1. Set up database2. Create models3. Build APIEach task is done when… what? Without tests or criteria, the agent decides arbitrarily.
Mistake 3: Single-phase thinking
Write all the code for a user management system.No separation between planning, coding, and testing. Quality suffers.
Mistake 4: Missing “continue” instruction
Complete tasks 1-30.After task 1 fails, should it retry? Move to task 2? Stop? Without explicit instruction, it might do any of these.
Summary
In this post, I showed how to structure prompts for long-running AI coding tasks. The key point is decomposing work into 20-50 sequential tasks, using the Planner-Builder-QA pattern, and defining explicit exit criteria.
A 20-minute prompt like “build an app” gives you 20 minutes of work. A 10-hour prompt with 30 tasks, subagent workflows, and exit criteria gives you 10 hours of work.
The difference is not magic. It is structure. The agent needs to know what to do, how to do it, and when it is done.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments