Skip to content

How to Fix RAG Chunking Breaking at Wrong Boundaries

My RAG system was returning garbage. The retrieved chunks contained half-sentences, broken context, and incomplete thoughts. When I asked about a specific feature, the LLM gave me answers that made no sense.

The problem wasn’t the embedding model. It wasn’t the retrieval algorithm. It was the chunking.

The Problem: Dumb Chunking

I had started with the simplest approach—just split text every N characters:

bad_chunker.py
# DON'T DO THIS
def split_text(text, chunk_size=1000):
return [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]

This is what I got:

chunk_output.txt
Chunk 1: "...the most important feature of this system is the ability to process"
Chunk 2: " large documents efficiently. However, when the documents contain"
Chunk 3: " complex structures, you need to use a smarter approach..."

See the problem? The chunk broke right in the middle of “process large documents.” The semantic meaning was destroyed.

The Solution: RecursiveCharacterTextSplitter

LangChain’s RecursiveCharacterTextSplitter solves this by trying to split at natural boundaries, in order of preference:

  1. Paragraph breaks (\n\n) — highest priority
  2. Line breaks (\n)
  3. Sentence boundaries (. )
  4. Word boundaries ( )
  5. Character level ("") — fallback only

Here’s the correct setup:

good_chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=[
"\n\n", # Paragraph breaks (try this first)
"\n", # Line breaks (then this)
". ", # Sentence boundaries (then this)
" ", # Word boundaries (then this)
"" # Character-level (last resort)
],
length_function=len,
)

The splitter tries each separator in order. If a chunk is still too big after splitting by paragraphs, it tries line breaks, then sentences, and so on. This preserves semantic meaning at each level.

Why Chunk Overlap is Critical

The chunk_overlap=200 parameter was the game-changer I almost skipped.

Without overlap, a key sentence might straddle two chunks:

without_overlap.txt
Chunk 1: "...the system uses vector embeddings for semantic"
Chunk 2: "search to find relevant documents..."

With 200-character overlap:

with_overlap.txt
Chunk 1: "...the system uses vector embeddings for semantic search to find relevant documents..."
Chunk 2: "search to find relevant documents. The embeddings are stored in..."

Now both chunks contain the complete thought. The overlap ensures context continuity at boundaries.

Rule of thumb: Set overlap to about 20% of chunk_size. For chunk_size=1000, use chunk_overlap=200.

Token-Based Splitting for OpenAI Models

If you’re using OpenAI embeddings or models, character-based splitting isn’t accurate. You should count tokens instead:

token_chunker.py
import tiktoken
from langchain.text_splitter import RecursiveCharacterTextSplitter
def tiktoken_len(text: str) -> int:
encoder = tiktoken.encoding_for_model("gpt-4")
return len(encoder.encode(text))
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Now measured in tokens, not characters
chunk_overlap=200,
length_function=tiktoken_len,
separators=["\n\n", "\n", ". ", " ", ""],
)

This ensures your chunks fit within model context limits exactly.

Full Example: PDF Document Chunking

pdf_chunker.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import PyPDFLoader
# Load PDF
loader = PyPDFLoader("technical_manual.pdf")
documents = loader.load()
# Configure splitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=["\n\n", "\n", ". ", " ", ""],
length_function=len,
)
# Split documents
chunks = text_splitter.split_documents(documents)
print(f"Original: {len(documents)} pages")
print(f"Split into: {len(chunks)} chunks")
# Inspect first few chunks
for i, chunk in enumerate(chunks[:3]):
print(f"\n--- Chunk {i} ({len(chunk.page_content)} chars) ---")
print(chunk.page_content[:200] + "...")

Common Mistakes I Made

1. Too little overlap

I started with chunk_overlap=50 for a 1000-character chunk. That’s only 5%. Information kept getting lost at boundaries. Going to 200 (20%) fixed it.

2. Wrong separator order

I had ". " before "\n". This meant the splitter tried to break at sentences before trying line breaks. For code or markdown documents, this destroyed formatting. Put "\n\n" and "\n" first.

3. Not tuning for document type

Markdown documents need different separators than plain text or code:

markdown_chunker.py
# For Markdown
markdown_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
separators=[
"\n##", # H2 headers
"\n###", # H3 headers
"\n\n", # Paragraphs
"\n", # Lines
". ", # Sentences
" ", # Words
"" # Characters
],
)

How I Verified the Fix

After implementing the fix, I tested with a simple query:

test_query.txt
Q: "What is the main benefit of vector embeddings?"

Before (bad chunking): “The system uses vector… [incomplete chunk]”

After (good chunking): “Vector embeddings capture semantic meaning, allowing the system to find relevant documents based on conceptual similarity rather than keyword matching.”

The retrieved chunks now contained complete, meaningful context. The LLM could generate accurate answers.

Summary

The fix was straightforward:

  1. Use RecursiveCharacterTextSplitter, not simple character splitting
  2. Set separators to ["\n\n", "\n", ". ", " ", ""] for prose documents
  3. Set chunk_overlap to 20% of chunk_size
  4. Use token-based length function for OpenAI models
  5. Adjust separators for document type (markdown, code, etc.)

The key insight: chunk at natural boundaries, not arbitrary character positions. Your retrieval quality depends on it.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments