How to Build a Deep Research Agent from Scratch in Python

Mar 18, 2026

I wanted to build a deep research agent - something like Perplexity or OpenAI’s deep research feature. As a side project to showcase my skills for a founding engineer role, I needed to decide: should I use LangGraph, or build it with raw Python?

After posting on r/LangChain, the community response surprised me. The consensus was clear: for a portfolio project targeting startup roles, raw Python with minimal abstraction is actually the stronger signal. It shows I understand what the framework is doing underneath.

Problem

I started with LangGraph because everyone said it’s the “modern” way to build agents. Here’s what I tried first:

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from typing import TypedDict

class AgentState(TypedDict):
    query: str
    sources: list
    findings: list

def search_node(state: AgentState):
    # LangGraph handles state management
    results = search_api.query(state["query"])
    return {"sources": results}

def extract_node(state: AgentState):
    # But what if I need custom caching?
    # What if I want to track visited URLs differently?
    # The framework fights me at every turn.
    pass

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("search", search_node)
workflow.add_node("extract", extract_node)
# ... more nodes

I got frustrated quickly. Every time I wanted to customize something - like my own caching strategy or URL tracking - I had to work around the framework. The abstraction was helping with simple cases but hurting with real complexity.

The Realization

One comment on my Reddit post hit home:

“Agents run tools in loop to achieve a goal, that’s it. You will need to handroll some of the scaffolding… but I promise you that it’s worthwhile (and you will have a much better understanding of how these things work)”

Another pointed out:

“For a side project showcasing skills, raw Python with minimal abstraction is actually the stronger signal to a founding engineer hiring manager. It shows you understand what the framework is doing underneath.”

That’s when I realized: I was optimizing for the wrong thing. I should optimize for understanding and control, not framework adoption.

Solution: Raw Python Architecture

I rebuilt the agent from scratch. The architecture is simple:

Search - Query search APIs, get URLs
Extract - Fetch URLs, extract text content
Remember - Track visited URLs, cache results
Synthesize - Combine findings, cite sources

Core Data Structures

First, I defined the data structures:

from dataclasses import dataclass, field
from typing import List, Optional
import hashlib

@dataclass
class Source:
    url: str
    title: str
    content: str
    relevance_score: float = 0.0

@dataclass
class ResearchMemory:
    visited_urls: set = field(default_factory=set)
    content_cache: dict = field(default_factory=dict)
    findings: List[Source] = field(default_factory=list)

    def has_visited(self, url: str) -> bool:
        return url in self.visited_urls

    def get_cached(self, url: str) -> Optional[str]:
        content_hash = hashlib.md5(url.encode()).hexdigest()
        return self.content_cache.get(content_hash)

    def cache_content(self, url: str, content: str):
        content_hash = hashlib.md5(url.encode()).hexdigest()
        self.content_cache[content_hash] = content

These dataclasses give me exactly what I need: tracking for visited URLs, caching for extracted content, and a list of findings.

The Search Component

For search, I chose Tavily because it’s purpose-built for AI agents. It returns structured results with relevance scores:

import os
from typing import List
from tavily import TavilyClient
from models import Source

class SearchEngine:
    def __init__(self, api_key: str = None):
        self.client = TavilyClient(api_key=api_key or os.getenv("TAVILY_API_KEY"))

    def search(self, query: str, max_results: int = 10) -> List[Source]:
        """Search for relevant sources."""
        response = self.client.search(
            query=query,
            search_depth="advanced",
            max_results=max_results
        )

        sources = []
        for result in response.get("results", []):
            sources.append(Source(
                url=result["url"],
                title=result["title"],
                content=result.get("content", ""),
                relevance_score=result.get("score", 0.0)
            ))

        return sources

I tried SerpAPI first, but Tavily’s structured output saved me parsing time.

Content Extraction

This is where things got interesting. I needed to extract text from URLs, handle different content types, and deal with JavaScript-rendered pages.

import hashlib
from bs4 import BeautifulSoup
import requests
from typing import Optional
from models import ResearchMemory

class ContentExtractor:
    def __init__(self, memory: ResearchMemory, timeout: int = 10):
        self.memory = memory
        self.timeout = timeout

    def fetch(self, url: str) -> Optional[str]:
        """Fetch and extract text from a URL."""
        # Check cache first
        cached = self.memory.get_cached(url)
        if cached:
            return cached

        try:
            response = requests.get(
                url,
                timeout=self.timeout,
                headers={"User-Agent": "Mozilla/5.0 Research Agent"}
            )
            response.raise_for_status()

            # Parse HTML
            soup = BeautifulSoup(response.text, 'html.parser')

            # Remove scripts and styles
            for element in soup(["script", "style", "nav", "footer", "header"]):
                element.decompose()

            # Extract text from paragraphs
            paragraphs = soup.find_all('p')
            text = ' '.join(p.get_text(strip=True) for p in paragraphs)

            # Cache the result
            self.memory.cache_content(url, text)

            return text

        except requests.RequestException as e:
            print(f"Failed to fetch {url}: {e}")
            return None

The key insight: remove navigation, scripts, and styles before extracting text. Otherwise you get menu items mixed with content.

For JavaScript-rendered pages, I added Playwright as a fallback:

from playwright.sync_api import sync_playwright

class JSContentExtractor:
    def fetch(self, url: str, wait_time: int = 2000) -> Optional[str]:
        """Fetch JavaScript-rendered pages."""
        with sync_playwright() as p:
            browser = p.chromium.launch(headless=True)
            page = browser.new_page()

            try:
                page.goto(url, timeout=30000)
                page.wait_for_timeout(wait_time)

                # Get text content
                text = page.inner_text("body")
                return text

            except Exception as e:
                print(f"JS extraction failed: {e}")
                return None

            finally:
                browser.close()

I don’t use this for every page - only when the static extractor returns empty content.

Relevance Scoring

Not every search result is relevant. I needed to filter out noise:

from openai import OpenAI
from models import Source

class RelevanceScorer:
    def __init__(self, model: str = "gpt-4o-mini"):
        self.client = OpenAI()
        self.model = model

    def score(self, content: str, query: str) -> float:
        """Score content relevance to query (0.0 to 1.0)."""
        prompt = f"""Rate the relevance of this content to the research query.

Query: {query}

Content: {content[:1000]}

Return only a number from 0.0 to 1.0 where:
- 1.0 = Directly answers the query
- 0.5 = Partially relevant
- 0.0 = Not relevant"""

        response = self.client.chat.completions.create(
            model=self.model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0
        )

        try:
            return float(response.choices[0].message.content.strip())
        except ValueError:
            return 0.0

Using GPT-4o-mini keeps costs down. I only send the first 1000 characters - enough to judge relevance without burning tokens.

The Main Agent

Now I can put it all together:

from typing import List
from search import SearchEngine
from extractor import ContentExtractor
from scorer import RelevanceScorer
from models import ResearchMemory, Source

class DeepResearchAgent:
    def __init__(
        self,
        search_api_key: str = None,
        openai_api_key: str = None,
        min_relevance: float = 0.6,
        max_sources: int = 10
    ):
        self.memory = ResearchMemory()
        self.search = SearchEngine(api_key=search_api_key)
        self.extractor = ContentExtractor(self.memory)
        self.scorer = RelevanceScorer()
        self.min_relevance = min_relevance
        self.max_sources = max_sources

    def research(self, query: str) -> str:
        """Conduct deep research on a query."""
        print(f"Researching: {query}")

        # 1. Search for sources
        candidates = self.search.search(query, max_results=self.max_sources * 2)
        print(f"Found {len(candidates)} candidate sources")

        # 2. Extract and score each source
        for candidate in candidates:
            if self.memory.has_visited(candidate.url):
                continue

            content = self.extractor.fetch(candidate.url)
            if not content:
                continue

            # Score relevance
            relevance = self.scorer.score(content, query)

            if relevance >= self.min_relevance:
                self.memory.findings.append(Source(
                    url=candidate.url,
                    title=candidate.title,
                    content=content,
                    relevance_score=relevance
                ))
                print(f"  + {candidate.title} (relevance: {relevance:.2f})")

            self.memory.visited_urls.add(candidate.url)

            if len(self.memory.findings) >= self.max_sources:
                break

        # 3. Synthesize findings
        return self._synthesize(query)

    def _synthesize(self, query: str) -> str:
        """Combine findings into a coherent response."""
        if not self.memory.findings:
            return "No relevant sources found."

        # Build context from findings
        context_parts = []
        for i, source in enumerate(self.memory.findings, 1):
            context_parts.append(
                f"[{i}] {source.title}\n"
                f"URL: {source.url}\n"
                f"Content: {source.content[:500]}...\n"
            )

        context = "\n".join(context_parts)

        # Generate synthesis
        prompt = f"""Research question: {query}

Sources:
{context}

Please provide a comprehensive answer with citations [1], [2], etc."""

        # Using OpenAI for synthesis
        from openai import OpenAI
        client = OpenAI()
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.3
        )

        return response.choices[0].message.content

Using the Agent

Here’s how I use it:

from agent import DeepResearchAgent

# Initialize
agent = DeepResearchAgent(
    min_relevance=0.6,  # Only include sources with 60%+ relevance
    max_sources=10       # Maximum sources to include
)

# Research
result = agent.research("What are the best practices for building AI agents?")

print(result)
print("\nSources used:")
for i, source in enumerate(agent.memory.findings, 1):
    print(f"[{i}] {source.title} - {source.url}")

Running this gives me:

Researching: What are the best practices for building AI agents?
Found 20 candidate sources
  + Building Production AI Agents (relevance: 0.92)
  + AI Agent Architecture Patterns (relevance: 0.88)
  + LangGraph vs Raw Python (relevance: 0.85)
  + ...

[Synthesized response with citations]

Sources used:
[1] Building Production AI Agents - https://example.com/ai-agents
[2] AI Agent Architecture Patterns - https://example.com/patterns
...

What I Learned

Building this from scratch taught me more than using LangGraph ever would:

State management is simple - A few dataclasses handle everything
Caching matters - Without it, I’d hit rate limits fast
Relevance filtering is critical - Raw search results are noisy
The loop is the framework - Search, extract, score, synthesize - that’s it

The total code is under 300 lines. Compare that to learning LangGraph’s concepts: StateGraph, Node, Edge, ConditionalEdge, checkpointer, memory…

When to Use Frameworks

Frameworks like LangGraph aren’t bad. They’re great for:

Teams that need standardization
Complex multi-agent systems
When you need persistence and human-in-the-loop
Quick prototyping without understanding internals

But for a portfolio project showing deep understanding? Raw Python wins.

Next Steps

I’m extending this agent with:

Recursive link following - Extract links from pages, follow relevant ones
Semantic deduplication - Use embeddings to avoid similar content
Parallel extraction - Use asyncio to fetch multiple URLs simultaneously
Cost tracking - Monitor LLM API costs per research session

The beauty of raw Python: I can add any of these without fighting a framework.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Reddit: Raw Python vs LangGraph for Deep Research Agent
👨‍💻 Tavily Search API
👨‍💻 Beautiful Soup Documentation

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!