Skip to content

How to Build a Deep Research Agent from Scratch in Python

I wanted to build a deep research agent - something like Perplexity or OpenAI’s deep research feature. As a side project to showcase my skills for a founding engineer role, I needed to decide: should I use LangGraph, or build it with raw Python?

After posting on r/LangChain, the community response surprised me. The consensus was clear: for a portfolio project targeting startup roles, raw Python with minimal abstraction is actually the stronger signal. It shows I understand what the framework is doing underneath.

Problem

I started with LangGraph because everyone said it’s the “modern” way to build agents. Here’s what I tried first:

langgraph-attempt.py
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from typing import TypedDict
class AgentState(TypedDict):
query: str
sources: list
findings: list
def search_node(state: AgentState):
# LangGraph handles state management
results = search_api.query(state["query"])
return {"sources": results}
def extract_node(state: AgentState):
# But what if I need custom caching?
# What if I want to track visited URLs differently?
# The framework fights me at every turn.
pass
# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("search", search_node)
workflow.add_node("extract", extract_node)
# ... more nodes

I got frustrated quickly. Every time I wanted to customize something - like my own caching strategy or URL tracking - I had to work around the framework. The abstraction was helping with simple cases but hurting with real complexity.

The Realization

One comment on my Reddit post hit home:

“Agents run tools in loop to achieve a goal, that’s it. You will need to handroll some of the scaffolding… but I promise you that it’s worthwhile (and you will have a much better understanding of how these things work)”

Another pointed out:

“For a side project showcasing skills, raw Python with minimal abstraction is actually the stronger signal to a founding engineer hiring manager. It shows you understand what the framework is doing underneath.”

That’s when I realized: I was optimizing for the wrong thing. I should optimize for understanding and control, not framework adoption.

Solution: Raw Python Architecture

I rebuilt the agent from scratch. The architecture is simple:

  1. Search - Query search APIs, get URLs
  2. Extract - Fetch URLs, extract text content
  3. Remember - Track visited URLs, cache results
  4. Synthesize - Combine findings, cite sources

Core Data Structures

First, I defined the data structures:

models.py
from dataclasses import dataclass, field
from typing import List, Optional
import hashlib
@dataclass
class Source:
url: str
title: str
content: str
relevance_score: float = 0.0
@dataclass
class ResearchMemory:
visited_urls: set = field(default_factory=set)
content_cache: dict = field(default_factory=dict)
findings: List[Source] = field(default_factory=list)
def has_visited(self, url: str) -> bool:
return url in self.visited_urls
def get_cached(self, url: str) -> Optional[str]:
content_hash = hashlib.md5(url.encode()).hexdigest()
return self.content_cache.get(content_hash)
def cache_content(self, url: str, content: str):
content_hash = hashlib.md5(url.encode()).hexdigest()
self.content_cache[content_hash] = content

These dataclasses give me exactly what I need: tracking for visited URLs, caching for extracted content, and a list of findings.

The Search Component

For search, I chose Tavily because it’s purpose-built for AI agents. It returns structured results with relevance scores:

search.py
import os
from typing import List
from tavily import TavilyClient
from models import Source
class SearchEngine:
def __init__(self, api_key: str = None):
self.client = TavilyClient(api_key=api_key or os.getenv("TAVILY_API_KEY"))
def search(self, query: str, max_results: int = 10) -> List[Source]:
"""Search for relevant sources."""
response = self.client.search(
query=query,
search_depth="advanced",
max_results=max_results
)
sources = []
for result in response.get("results", []):
sources.append(Source(
url=result["url"],
title=result["title"],
content=result.get("content", ""),
relevance_score=result.get("score", 0.0)
))
return sources

I tried SerpAPI first, but Tavily’s structured output saved me parsing time.

Content Extraction

This is where things got interesting. I needed to extract text from URLs, handle different content types, and deal with JavaScript-rendered pages.

extractor.py
import hashlib
from bs4 import BeautifulSoup
import requests
from typing import Optional
from models import ResearchMemory
class ContentExtractor:
def __init__(self, memory: ResearchMemory, timeout: int = 10):
self.memory = memory
self.timeout = timeout
def fetch(self, url: str) -> Optional[str]:
"""Fetch and extract text from a URL."""
# Check cache first
cached = self.memory.get_cached(url)
if cached:
return cached
try:
response = requests.get(
url,
timeout=self.timeout,
headers={"User-Agent": "Mozilla/5.0 Research Agent"}
)
response.raise_for_status()
# Parse HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Remove scripts and styles
for element in soup(["script", "style", "nav", "footer", "header"]):
element.decompose()
# Extract text from paragraphs
paragraphs = soup.find_all('p')
text = ' '.join(p.get_text(strip=True) for p in paragraphs)
# Cache the result
self.memory.cache_content(url, text)
return text
except requests.RequestException as e:
print(f"Failed to fetch {url}: {e}")
return None

The key insight: remove navigation, scripts, and styles before extracting text. Otherwise you get menu items mixed with content.

For JavaScript-rendered pages, I added Playwright as a fallback:

js_extractor.py
from playwright.sync_api import sync_playwright
class JSContentExtractor:
def fetch(self, url: str, wait_time: int = 2000) -> Optional[str]:
"""Fetch JavaScript-rendered pages."""
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
try:
page.goto(url, timeout=30000)
page.wait_for_timeout(wait_time)
# Get text content
text = page.inner_text("body")
return text
except Exception as e:
print(f"JS extraction failed: {e}")
return None
finally:
browser.close()

I don’t use this for every page - only when the static extractor returns empty content.

Relevance Scoring

Not every search result is relevant. I needed to filter out noise:

scorer.py
from openai import OpenAI
from models import Source
class RelevanceScorer:
def __init__(self, model: str = "gpt-4o-mini"):
self.client = OpenAI()
self.model = model
def score(self, content: str, query: str) -> float:
"""Score content relevance to query (0.0 to 1.0)."""
prompt = f"""Rate the relevance of this content to the research query.
Query: {query}
Content: {content[:1000]}
Return only a number from 0.0 to 1.0 where:
- 1.0 = Directly answers the query
- 0.5 = Partially relevant
- 0.0 = Not relevant"""
response = self.client.chat.completions.create(
model=self.model,
messages=[{"role": "user", "content": prompt}],
temperature=0
)
try:
return float(response.choices[0].message.content.strip())
except ValueError:
return 0.0

Using GPT-4o-mini keeps costs down. I only send the first 1000 characters - enough to judge relevance without burning tokens.

The Main Agent

Now I can put it all together:

agent.py
from typing import List
from search import SearchEngine
from extractor import ContentExtractor
from scorer import RelevanceScorer
from models import ResearchMemory, Source
class DeepResearchAgent:
def __init__(
self,
search_api_key: str = None,
openai_api_key: str = None,
min_relevance: float = 0.6,
max_sources: int = 10
):
self.memory = ResearchMemory()
self.search = SearchEngine(api_key=search_api_key)
self.extractor = ContentExtractor(self.memory)
self.scorer = RelevanceScorer()
self.min_relevance = min_relevance
self.max_sources = max_sources
def research(self, query: str) -> str:
"""Conduct deep research on a query."""
print(f"Researching: {query}")
# 1. Search for sources
candidates = self.search.search(query, max_results=self.max_sources * 2)
print(f"Found {len(candidates)} candidate sources")
# 2. Extract and score each source
for candidate in candidates:
if self.memory.has_visited(candidate.url):
continue
content = self.extractor.fetch(candidate.url)
if not content:
continue
# Score relevance
relevance = self.scorer.score(content, query)
if relevance >= self.min_relevance:
self.memory.findings.append(Source(
url=candidate.url,
title=candidate.title,
content=content,
relevance_score=relevance
))
print(f" + {candidate.title} (relevance: {relevance:.2f})")
self.memory.visited_urls.add(candidate.url)
if len(self.memory.findings) >= self.max_sources:
break
# 3. Synthesize findings
return self._synthesize(query)
def _synthesize(self, query: str) -> str:
"""Combine findings into a coherent response."""
if not self.memory.findings:
return "No relevant sources found."
# Build context from findings
context_parts = []
for i, source in enumerate(self.memory.findings, 1):
context_parts.append(
f"[{i}] {source.title}\n"
f"URL: {source.url}\n"
f"Content: {source.content[:500]}...\n"
)
context = "\n".join(context_parts)
# Generate synthesis
prompt = f"""Research question: {query}
Sources:
{context}
Please provide a comprehensive answer with citations [1], [2], etc."""
# Using OpenAI for synthesis
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.3
)
return response.choices[0].message.content

Using the Agent

Here’s how I use it:

main.py
from agent import DeepResearchAgent
# Initialize
agent = DeepResearchAgent(
min_relevance=0.6, # Only include sources with 60%+ relevance
max_sources=10 # Maximum sources to include
)
# Research
result = agent.research("What are the best practices for building AI agents?")
print(result)
print("\nSources used:")
for i, source in enumerate(agent.memory.findings, 1):
print(f"[{i}] {source.title} - {source.url}")

Running this gives me:

output.txt
Researching: What are the best practices for building AI agents?
Found 20 candidate sources
+ Building Production AI Agents (relevance: 0.92)
+ AI Agent Architecture Patterns (relevance: 0.88)
+ LangGraph vs Raw Python (relevance: 0.85)
+ ...
[Synthesized response with citations]
Sources used:
[1] Building Production AI Agents - https://example.com/ai-agents
[2] AI Agent Architecture Patterns - https://example.com/patterns
...

What I Learned

Building this from scratch taught me more than using LangGraph ever would:

  1. State management is simple - A few dataclasses handle everything
  2. Caching matters - Without it, I’d hit rate limits fast
  3. Relevance filtering is critical - Raw search results are noisy
  4. The loop is the framework - Search, extract, score, synthesize - that’s it

The total code is under 300 lines. Compare that to learning LangGraph’s concepts: StateGraph, Node, Edge, ConditionalEdge, checkpointer, memory…

When to Use Frameworks

Frameworks like LangGraph aren’t bad. They’re great for:

  • Teams that need standardization
  • Complex multi-agent systems
  • When you need persistence and human-in-the-loop
  • Quick prototyping without understanding internals

But for a portfolio project showing deep understanding? Raw Python wins.

Next Steps

I’m extending this agent with:

  1. Recursive link following - Extract links from pages, follow relevant ones
  2. Semantic deduplication - Use embeddings to avoid similar content
  3. Parallel extraction - Use asyncio to fetch multiple URLs simultaneously
  4. Cost tracking - Monitor LLM API costs per research session

The beauty of raw Python: I can add any of these without fighting a framework.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments