How Do You Chunk Documents Correctly for RAG Systems? A Practical Guide
Problem
When I built my first RAG system, I thought chunking was simple—just split text into pieces and embed them. But then I started getting weird results. My retrieval would return chunks that seemed related but missed the actual answer. Or it would return fragments without enough context to be useful.
The real problem hit me when a user asked about a specific policy, and my RAG system returned three different chunks—each containing part of the answer, but none complete enough to be helpful.
Environment
- Python 3.11
- LangChain for text splitting
- OpenAI embeddings (text-embedding-3-small)
- Pinecone vector database
What happened?
I used the default approach from most tutorials:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=0)chunks = text_splitter.split_text(document)This approach treated all documents the same way. Technical documentation, legal contracts, conversational logs—everything got split at 1000 characters with no overlap.
The results were predictable in hindsight:
-
Semantic meaning destroyed: A sentence like “The refund policy allows returns within 30 days, but only for items in original packaging” got split across two chunks. Neither chunk alone explained the full policy.
-
Context lost: Chunks had no connection to their source document structure. A chunk from section 3.2 had no indication it was part of a larger policy discussion.
-
Wrong retrieval: Users searching for “refund policy” would get chunks mentioning “refund” or “policy” but not necessarily the actual policy section.
A comment from u/Lucky-Duck-2968 on Reddit captured it well: “Chunking sounds simple until you realize bad splits destroy meaning.”
How to solve it?
I learned that chunking requires matching the strategy to document type. There’s no universal approach.
Step 1: Identify Your Document Type
┌─────────────────┐│ Structured │ → Code, Markdown, JSON, XML└─────────────────┘ ↓┌─────────────────┐│ Semi-structured │ → PDFs with sections, emails, chat logs└─────────────────┘ ↓┌─────────────────┐│ Unstructured │ → Plain text articles, transcripts└─────────────────┘Step 2: Choose the Right Strategy
For structured documents (like Markdown), I now use structure-aware chunking:
from langchain.text_splitter import MarkdownHeaderTextSplitter
markdown_splitter = MarkdownHeaderTextSplitter( headers_to_split_on=[("#", "h1"), ("##", "h2"), ("###", "h3")])chunks = markdown_splitter.split_text(document)This preserves section headers in metadata, so each chunk knows its context.
For semi-structured documents, I use overlapping chunks:
text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, # 20% overlap separators=["\n\n", "\n", ". ", " ", ""])chunks = text_splitter.split_text(document)The overlap ensures information at boundaries doesn’t get lost.
For unstructured text, I consider semantic chunking:
from semantic_text_splitter import TextSplitter
splitter = TextSplitter(max_chunk_size=500)# Uses sentence embeddings to find natural breakpointschunks = splitter.split_text(document)Step 3: Use Parent-Child Chunking for Context
For documents where context matters, I implemented parent-child chunking:
class ParentChildChunker: def __init__(self, parent_size=2000, child_size=400, overlap=50): self.parent_size = parent_size self.child_size = child_size self.overlap = overlap
def chunk(self, document): # Large parent chunks for context parents = self._split_by_size(document, self.parent_size) # Small child chunks for retrieval children = [] for parent in parents: child_chunks = self._split_by_size(parent, self.child_size, self.overlap) for child in child_chunks: children.append({ "child_text": child, "parent_text": parent # Full context available }) return childrenThe retrieval happens on small child chunks, but the LLM gets the full parent context.
Step 4: Test Your Chunking Strategy
I found the chunky tool helpful for analyzing chunking before deploying:
pip install chunkychunky analyze my_document.pdf --strategy recursive --chunk-size 1000u/Just-Message-9899 recommended this on Reddit, and it saved me from deploying a bad chunking strategy.
The reason
I think the key reason chunking fails is that tutorials treat it as a preprocessing step rather than a core design decision. u/yafitzdev on Reddit said it well: “I figured each doc type needed a different retrieval harness altogether.”
The problems I had:
- One-size-fits-all approach: I used the same strategy for all documents.
- No overlap: Information at chunk boundaries got lost.
- Ignoring structure: I split purely by character count, not document organization.
- No testing: I deployed without verifying retrieval quality.
Summary
In this post, I showed how to match chunking strategies to document types. The key point is that chunking isn’t a preprocessing step—it’s a core design decision that affects your entire RAG system.
For structured documents, respect their structure. For unstructured text, use semantic boundaries. And always test your retrieval quality before deploying.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit Discussion: Most RAG tutorials are misleading
- 👨💻 chunky - Data extraction and chunking analysis tool
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments