How to Fix RAG Chunking Breaking at Wrong Boundaries
My RAG system was returning garbage. The retrieved chunks contained half-sentences, broken context, and incomplete thoughts. When I asked about a specific feature, the LLM gave me answers that made no sense.
The problem wasn’t the embedding model. It wasn’t the retrieval algorithm. It was the chunking.
The Problem: Dumb Chunking
I had started with the simplest approach—just split text every N characters:
# DON'T DO THISdef split_text(text, chunk_size=1000): return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]This is what I got:
Chunk 1: "...the most important feature of this system is the ability to process"Chunk 2: " large documents efficiently. However, when the documents contain"Chunk 3: " complex structures, you need to use a smarter approach..."See the problem? The chunk broke right in the middle of “process large documents.” The semantic meaning was destroyed.
The Solution: RecursiveCharacterTextSplitter
LangChain’s RecursiveCharacterTextSplitter solves this by trying to split at natural boundaries, in order of preference:
- Paragraph breaks (
\n\n) — highest priority - Line breaks (
\n) - Sentence boundaries (
.) - Word boundaries (
) - Character level (
"") — fallback only
Here’s the correct setup:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=[ "\n\n", # Paragraph breaks (try this first) "\n", # Line breaks (then this) ". ", # Sentence boundaries (then this) " ", # Word boundaries (then this) "" # Character-level (last resort) ], length_function=len,)The splitter tries each separator in order. If a chunk is still too big after splitting by paragraphs, it tries line breaks, then sentences, and so on. This preserves semantic meaning at each level.
Why Chunk Overlap is Critical
The chunk_overlap=200 parameter was the game-changer I almost skipped.
Without overlap, a key sentence might straddle two chunks:
Chunk 1: "...the system uses vector embeddings for semantic"Chunk 2: "search to find relevant documents..."With 200-character overlap:
Chunk 1: "...the system uses vector embeddings for semantic search to find relevant documents..."Chunk 2: "search to find relevant documents. The embeddings are stored in..."Now both chunks contain the complete thought. The overlap ensures context continuity at boundaries.
Rule of thumb: Set overlap to about 20% of chunk_size. For chunk_size=1000, use chunk_overlap=200.
Token-Based Splitting for OpenAI Models
If you’re using OpenAI embeddings or models, character-based splitting isn’t accurate. You should count tokens instead:
import tiktokenfrom langchain.text_splitter import RecursiveCharacterTextSplitter
def tiktoken_len(text: str) -> int: encoder = tiktoken.encoding_for_model("gpt-4") return len(encoder.encode(text))
text_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, # Now measured in tokens, not characters chunk_overlap=200, length_function=tiktoken_len, separators=["\n\n", "\n", ". ", " ", ""],)This ensures your chunks fit within model context limits exactly.
Full Example: PDF Document Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitterfrom langchain_community.document_loaders import PyPDFLoader
# Load PDFloader = PyPDFLoader("technical_manual.pdf")documents = loader.load()
# Configure splittertext_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=["\n\n", "\n", ". ", " ", ""], length_function=len,)
# Split documentschunks = text_splitter.split_documents(documents)
print(f"Original: {len(documents)} pages")print(f"Split into: {len(chunks)} chunks")
# Inspect first few chunksfor i, chunk in enumerate(chunks[:3]): print(f"\n--- Chunk {i} ({len(chunk.page_content)} chars) ---") print(chunk.page_content[:200] + "...")Common Mistakes I Made
1. Too little overlap
I started with chunk_overlap=50 for a 1000-character chunk. That’s only 5%. Information kept getting lost at boundaries. Going to 200 (20%) fixed it.
2. Wrong separator order
I had ". " before "\n". This meant the splitter tried to break at sentences before trying line breaks. For code or markdown documents, this destroyed formatting. Put "\n\n" and "\n" first.
3. Not tuning for document type
Markdown documents need different separators than plain text or code:
# For Markdownmarkdown_splitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200, separators=[ "\n##", # H2 headers "\n###", # H3 headers "\n\n", # Paragraphs "\n", # Lines ". ", # Sentences " ", # Words "" # Characters ],)How I Verified the Fix
After implementing the fix, I tested with a simple query:
Q: "What is the main benefit of vector embeddings?"Before (bad chunking): “The system uses vector… [incomplete chunk]”
After (good chunking): “Vector embeddings capture semantic meaning, allowing the system to find relevant documents based on conceptual similarity rather than keyword matching.”
The retrieved chunks now contained complete, meaningful context. The LLM could generate accurate answers.
Summary
The fix was straightforward:
- Use
RecursiveCharacterTextSplitter, not simple character splitting - Set separators to
["\n\n", "\n", ". ", " ", ""]for prose documents - Set
chunk_overlapto 20% ofchunk_size - Use token-based length function for OpenAI models
- Adjust separators for document type (markdown, code, etc.)
The key insight: chunk at natural boundaries, not arbitrary character positions. Your retrieval quality depends on it.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 LangChain Text Splitters Documentation
- 👨💻 RecursiveCharacterTextSplitter API Reference
- 👨💻 Reddit Discussion: RAG Chunking Best Practices
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments