How to Build a Deep Research Agent from Scratch in Python
I wanted to build a deep research agent - something like Perplexity or OpenAI’s deep research feature. As a side project to showcase my skills for a founding engineer role, I needed to decide: should I use LangGraph, or build it with raw Python?
After posting on r/LangChain, the community response surprised me. The consensus was clear: for a portfolio project targeting startup roles, raw Python with minimal abstraction is actually the stronger signal. It shows I understand what the framework is doing underneath.
Problem
I started with LangGraph because everyone said it’s the “modern” way to build agents. Here’s what I tried first:
from langgraph.graph import StateGraph, ENDfrom langchain_openai import ChatOpenAIfrom typing import TypedDict
class AgentState(TypedDict): query: str sources: list findings: list
def search_node(state: AgentState): # LangGraph handles state management results = search_api.query(state["query"]) return {"sources": results}
def extract_node(state: AgentState): # But what if I need custom caching? # What if I want to track visited URLs differently? # The framework fights me at every turn. pass
# Build the graphworkflow = StateGraph(AgentState)workflow.add_node("search", search_node)workflow.add_node("extract", extract_node)# ... more nodesI got frustrated quickly. Every time I wanted to customize something - like my own caching strategy or URL tracking - I had to work around the framework. The abstraction was helping with simple cases but hurting with real complexity.
The Realization
One comment on my Reddit post hit home:
“Agents run tools in loop to achieve a goal, that’s it. You will need to handroll some of the scaffolding… but I promise you that it’s worthwhile (and you will have a much better understanding of how these things work)”
Another pointed out:
“For a side project showcasing skills, raw Python with minimal abstraction is actually the stronger signal to a founding engineer hiring manager. It shows you understand what the framework is doing underneath.”
That’s when I realized: I was optimizing for the wrong thing. I should optimize for understanding and control, not framework adoption.
Solution: Raw Python Architecture
I rebuilt the agent from scratch. The architecture is simple:
- Search - Query search APIs, get URLs
- Extract - Fetch URLs, extract text content
- Remember - Track visited URLs, cache results
- Synthesize - Combine findings, cite sources
Core Data Structures
First, I defined the data structures:
from dataclasses import dataclass, fieldfrom typing import List, Optionalimport hashlib
@dataclassclass Source: url: str title: str content: str relevance_score: float = 0.0
@dataclassclass ResearchMemory: visited_urls: set = field(default_factory=set) content_cache: dict = field(default_factory=dict) findings: List[Source] = field(default_factory=list)
def has_visited(self, url: str) -> bool: return url in self.visited_urls
def get_cached(self, url: str) -> Optional[str]: content_hash = hashlib.md5(url.encode()).hexdigest() return self.content_cache.get(content_hash)
def cache_content(self, url: str, content: str): content_hash = hashlib.md5(url.encode()).hexdigest() self.content_cache[content_hash] = contentThese dataclasses give me exactly what I need: tracking for visited URLs, caching for extracted content, and a list of findings.
The Search Component
For search, I chose Tavily because it’s purpose-built for AI agents. It returns structured results with relevance scores:
import osfrom typing import Listfrom tavily import TavilyClientfrom models import Source
class SearchEngine: def __init__(self, api_key: str = None): self.client = TavilyClient(api_key=api_key or os.getenv("TAVILY_API_KEY"))
def search(self, query: str, max_results: int = 10) -> List[Source]: """Search for relevant sources.""" response = self.client.search( query=query, search_depth="advanced", max_results=max_results )
sources = [] for result in response.get("results", []): sources.append(Source( url=result["url"], title=result["title"], content=result.get("content", ""), relevance_score=result.get("score", 0.0) ))
return sourcesI tried SerpAPI first, but Tavily’s structured output saved me parsing time.
Content Extraction
This is where things got interesting. I needed to extract text from URLs, handle different content types, and deal with JavaScript-rendered pages.
import hashlibfrom bs4 import BeautifulSoupimport requestsfrom typing import Optionalfrom models import ResearchMemory
class ContentExtractor: def __init__(self, memory: ResearchMemory, timeout: int = 10): self.memory = memory self.timeout = timeout
def fetch(self, url: str) -> Optional[str]: """Fetch and extract text from a URL.""" # Check cache first cached = self.memory.get_cached(url) if cached: return cached
try: response = requests.get( url, timeout=self.timeout, headers={"User-Agent": "Mozilla/5.0 Research Agent"} ) response.raise_for_status()
# Parse HTML soup = BeautifulSoup(response.text, 'html.parser')
# Remove scripts and styles for element in soup(["script", "style", "nav", "footer", "header"]): element.decompose()
# Extract text from paragraphs paragraphs = soup.find_all('p') text = ' '.join(p.get_text(strip=True) for p in paragraphs)
# Cache the result self.memory.cache_content(url, text)
return text
except requests.RequestException as e: print(f"Failed to fetch {url}: {e}") return NoneThe key insight: remove navigation, scripts, and styles before extracting text. Otherwise you get menu items mixed with content.
For JavaScript-rendered pages, I added Playwright as a fallback:
from playwright.sync_api import sync_playwright
class JSContentExtractor: def fetch(self, url: str, wait_time: int = 2000) -> Optional[str]: """Fetch JavaScript-rendered pages.""" with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page()
try: page.goto(url, timeout=30000) page.wait_for_timeout(wait_time)
# Get text content text = page.inner_text("body") return text
except Exception as e: print(f"JS extraction failed: {e}") return None
finally: browser.close()I don’t use this for every page - only when the static extractor returns empty content.
Relevance Scoring
Not every search result is relevant. I needed to filter out noise:
from openai import OpenAIfrom models import Source
class RelevanceScorer: def __init__(self, model: str = "gpt-4o-mini"): self.client = OpenAI() self.model = model
def score(self, content: str, query: str) -> float: """Score content relevance to query (0.0 to 1.0).""" prompt = f"""Rate the relevance of this content to the research query.
Query: {query}
Content: {content[:1000]}
Return only a number from 0.0 to 1.0 where:- 1.0 = Directly answers the query- 0.5 = Partially relevant- 0.0 = Not relevant"""
response = self.client.chat.completions.create( model=self.model, messages=[{"role": "user", "content": prompt}], temperature=0 )
try: return float(response.choices[0].message.content.strip()) except ValueError: return 0.0Using GPT-4o-mini keeps costs down. I only send the first 1000 characters - enough to judge relevance without burning tokens.
The Main Agent
Now I can put it all together:
from typing import Listfrom search import SearchEnginefrom extractor import ContentExtractorfrom scorer import RelevanceScorerfrom models import ResearchMemory, Source
class DeepResearchAgent: def __init__( self, search_api_key: str = None, openai_api_key: str = None, min_relevance: float = 0.6, max_sources: int = 10 ): self.memory = ResearchMemory() self.search = SearchEngine(api_key=search_api_key) self.extractor = ContentExtractor(self.memory) self.scorer = RelevanceScorer() self.min_relevance = min_relevance self.max_sources = max_sources
def research(self, query: str) -> str: """Conduct deep research on a query.""" print(f"Researching: {query}")
# 1. Search for sources candidates = self.search.search(query, max_results=self.max_sources * 2) print(f"Found {len(candidates)} candidate sources")
# 2. Extract and score each source for candidate in candidates: if self.memory.has_visited(candidate.url): continue
content = self.extractor.fetch(candidate.url) if not content: continue
# Score relevance relevance = self.scorer.score(content, query)
if relevance >= self.min_relevance: self.memory.findings.append(Source( url=candidate.url, title=candidate.title, content=content, relevance_score=relevance )) print(f" + {candidate.title} (relevance: {relevance:.2f})")
self.memory.visited_urls.add(candidate.url)
if len(self.memory.findings) >= self.max_sources: break
# 3. Synthesize findings return self._synthesize(query)
def _synthesize(self, query: str) -> str: """Combine findings into a coherent response.""" if not self.memory.findings: return "No relevant sources found."
# Build context from findings context_parts = [] for i, source in enumerate(self.memory.findings, 1): context_parts.append( f"[{i}] {source.title}\n" f"URL: {source.url}\n" f"Content: {source.content[:500]}...\n" )
context = "\n".join(context_parts)
# Generate synthesis prompt = f"""Research question: {query}
Sources:{context}
Please provide a comprehensive answer with citations [1], [2], etc."""
# Using OpenAI for synthesis from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], temperature=0.3 )
return response.choices[0].message.contentUsing the Agent
Here’s how I use it:
from agent import DeepResearchAgent
# Initializeagent = DeepResearchAgent( min_relevance=0.6, # Only include sources with 60%+ relevance max_sources=10 # Maximum sources to include)
# Researchresult = agent.research("What are the best practices for building AI agents?")
print(result)print("\nSources used:")for i, source in enumerate(agent.memory.findings, 1): print(f"[{i}] {source.title} - {source.url}")Running this gives me:
Researching: What are the best practices for building AI agents?Found 20 candidate sources + Building Production AI Agents (relevance: 0.92) + AI Agent Architecture Patterns (relevance: 0.88) + LangGraph vs Raw Python (relevance: 0.85) + ...
[Synthesized response with citations]
Sources used:[1] Building Production AI Agents - https://example.com/ai-agents[2] AI Agent Architecture Patterns - https://example.com/patterns...What I Learned
Building this from scratch taught me more than using LangGraph ever would:
- State management is simple - A few dataclasses handle everything
- Caching matters - Without it, I’d hit rate limits fast
- Relevance filtering is critical - Raw search results are noisy
- The loop is the framework - Search, extract, score, synthesize - that’s it
The total code is under 300 lines. Compare that to learning LangGraph’s concepts: StateGraph, Node, Edge, ConditionalEdge, checkpointer, memory…
When to Use Frameworks
Frameworks like LangGraph aren’t bad. They’re great for:
- Teams that need standardization
- Complex multi-agent systems
- When you need persistence and human-in-the-loop
- Quick prototyping without understanding internals
But for a portfolio project showing deep understanding? Raw Python wins.
Next Steps
I’m extending this agent with:
- Recursive link following - Extract links from pages, follow relevant ones
- Semantic deduplication - Use embeddings to avoid similar content
- Parallel extraction - Use asyncio to fetch multiple URLs simultaneously
- Cost tracking - Monitor LLM API costs per research session
The beauty of raw Python: I can add any of these without fighting a framework.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Raw Python vs LangGraph for Deep Research Agent
- 👨💻 Tavily Search API
- 👨💻 Beautiful Soup Documentation
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments