Generating Quality Training Pairs for LLM Fine-tuning
I tried to fine-tune a language model on my personal document collection. After hours of training, I tested it:
Me: "What's the error handling pattern in chapter 5?"Model: "Chapter 5 discusses error handling patterns..."Me: "But what IS the pattern?"Model: "Chapter 5 discusses error handling patterns..."The model was just repeating text from my documents. It didn’t understand anything. It was mimicking.
The Problem: 536 Training Pairs
I had auto-generated 536 training pairs from my documents. I thought that was enough. I was wrong.
Here’s what the Reddit commenter said about my approach:
“Auto-generating training pairs is a bold move, but the real test will be the quality and diversity of those pairs, as poor synthetic data can quickly lead to catastrophic forgetting or a model that just mimics source text without actually ‘understanding’ it.”
They were right. My model:
- Repeated phrases verbatim instead of answering
- Couldn’t connect ideas across documents
- Hallucinated when asked about topics NOT in the training data
I had two problems: quantity AND quality.
The Quantity Problem
The first shock: I needed at least 1000 training pairs for decent results.
< 100 pairs → Model barely changes100-500 pairs → Weak, inconsistent results500-1000 pairs → Decent on exact topics, fails on variations1000+ pairs → Good generalization, starts to "understand"My 536 pairs fell into the “weak model” zone. The model learned surface patterns but couldn’t generalize.
But here’s the thing - I couldn’t just generate MORE pairs the same way. That would just give me MORE bad data.
The Quality Problem
I looked at my training pairs:
[ { "instruction": "What does the document say about errors?", "response": "The document says errors should be handled gracefully." }, { "instruction": "What is mentioned about caching?", "response": "Caching is mentioned as a performance optimization." }]These pairs were terrible. Why?
- Single-hop only - Each pair drew from ONE document section
- No refusal pairs - The model learned to ALWAYS answer, even when it shouldn’t
- No thinking chains - Just question-answer, no reasoning
- One mode fits all - Didn’t account for different use cases
The Solution: Multi-Mode Training Pairs
I discovered the PersonalForge project’s approach. They generate training pairs in multiple modes:
| Mode | Focus | When to Use |
|---|---|---|
| Developer/Coder | Code examples, best practices | Building coding assistants |
| Deep Thinker | Multi-angle analysis | Building research assistants |
| Honest/Factual | Cites sources, admits gaps | Building trustworthy AI |
I needed to generate pairs for each mode AND include different types of pairs.
Pair Type 1: Instruction-Response
The basic building block. But with variation:
def generate_instruction_pairs(documents, mode="developer"): pairs = []
for doc in documents: if mode == "developer": # Focus on code examples and implementation pairs.append({ "instruction": f"How do I implement {doc.concept}?", "response": doc.code_example }) elif mode == "thinker": # Focus on analysis and trade-offs pairs.append({ "instruction": f"What are the trade-offs of {doc.concept}?", "response": doc.analysis }) elif mode == "factual": # Focus on facts with citations pairs.append({ "instruction": f"What does the documentation say about {doc.concept}?", "response": f"According to {doc.source}: {doc.facts}" })
return pairsPair Type 2: Multi-hop Pairs
This is where most auto-generation fails. Multi-hop pairs require connecting ideas across documents:
def generate_multihop_pairs(documents): """Generate pairs that connect ideas across documents.""" pairs = []
for doc1 in documents: for doc2 in documents: if doc1 != doc2 and are_related(doc1, doc2): # Ask questions that require both documents pairs.append({ "instruction": f"How does {doc1.concept} relate to {doc2.concept}?", "response": f"{doc1.concept} provides {doc1.key_idea}, " f"which enables {doc2.concept} to {doc2.key_idea}. " f"The connection is through {find_connection(doc1, doc2)}." })
return pairsMulti-hop pairs test whether the model understands relationships, not just individual facts.
Pair Type 3: Refusal Pairs (Critical!)
This is the most important type I was missing. Models need to learn when NOT to answer:
def generate_refusal_pairs(documents, count=60): """Teach the model to say 'I don't know' instead of hallucinating.""" pairs = []
# Generate questions about topics NOT in the documents topics_not_covered = get_external_topics(documents)
for topic in topics_not_covered[:count]: pairs.append({ "instruction": f"What does the documentation say about {topic}?", "response": "I don't have information about that topic in my training data. " "This might be covered in documentation I haven't seen, " "or it might be outside the scope of my knowledge base." })
return pairsPersonalForge uses ~60 refusal pairs per training run. This teaches the model:
User: "What about quantum computing?"Model: "Quantum computing uses qubits and can solve..." (HALLUCINATION)User: "What about quantum computing?"Model: "I don't have information about quantum computing in my training data."Pair Type 4: Thinking Chain Pairs
Instead of just question-answer, include the reasoning:
def generate_thinking_chain_pairs(documents): """Include reasoning steps, not just final answers.""" pairs = []
for doc in documents: pairs.append({ "instruction": f"Explain {doc.concept}", "response": f"Let me think through this step by step.\n\n" f"First, {doc.concept} is defined as {doc.definition}.\n\n" f"Second, the key principle is {doc.principle}.\n\n" f"Third, in practice, this means {doc.practical_application}.\n\n" f"Therefore, {doc.concept} {doc.conclusion}." })
return pairsThinking chains help the model learn problem decomposition, not just pattern matching.
The Complete Generation Pipeline
Here’s my improved training pair generator:
import jsonfrom pathlib import Path
class TrainingPairGenerator: def __init__(self, documents, target_count=1000): self.documents = documents self.target_count = target_count self.pairs = []
def generate_all_pairs(self): """Generate diverse training pairs across all modes and types.""" modes = ["developer", "thinker", "factual"]
for mode in modes: # Type 1: Instruction-response pairs self.pairs.extend( self.generate_instruction_pairs(mode) )
# Type 2: Multi-hop pairs (fewer, but critical) self.pairs.extend( self.generate_multihop_pairs(limit=50) )
# Type 3: Refusal pairs (critical for anti-hallucination) self.pairs.extend( self.generate_refusal_pairs(count=60) )
# Type 4: Thinking chain pairs self.pairs.extend( self.generate_thinking_chain_pairs() )
print(f"Generated {len(self.pairs)} total pairs") self.validate_quality()
return self.pairs
def validate_quality(self): """Check for common quality issues.""" issues = []
# Check minimum count if len(self.pairs) < 1000: issues.append(f"WARNING: Only {len(self.pairs)} pairs. Need 1000+ for decent results.")
# Check for refusal pairs refusal_count = sum(1 for p in self.pairs if "don't have information" in p["response"]) if refusal_count < 30: issues.append(f"WARNING: Only {refusal_count} refusal pairs. Need ~60 to prevent hallucination.")
# Check for multi-hop pairs multihop_count = sum(1 for p in self.pairs if "relates to" in p["instruction"] or "connect" in p["instruction"]) if multihop_count < 20: issues.append(f"WARNING: Only {multihop_count} multi-hop pairs. Need more for deeper understanding.")
for issue in issues: print(issue)
return len(issues) == 0
def save(self, output_path): """Save pairs in JSONL format for training.""" with open(output_path, 'w') as f: for pair in self.pairs: f.write(json.dumps(pair) + '\n')
# Usagegenerator = TrainingPairGenerator(my_documents, target_count=1500)pairs = generator.generate_all_pairs()generator.save("training_pairs.jsonl")Quality vs Quantity: The Trade-off
Low Quality High Quality ─────────────────────────────────────Low Quantity Useless Baseline(100-500) (mimicry only) (barely works)
High Quantity Dangerous Good(1000+) (learns bad patterns) (generalizes well)The danger zone is high quantity + low quality. You’ll get a model that confidently does the wrong thing.
Common Mistakes
Mistake 1: Only Single-hop Pairs
I generated pairs that only asked about single documents:
[ {"instruction": "What is in document A?", "response": "..."}, {"instruction": "What is in document B?", "response": "..."}]The model couldn’t answer: “How does concept A relate to concept B?”
Fix: Add multi-hop pairs that connect ideas across documents.
Mistake 2: No Refusal Training
My model answered EVERYTHING, even questions outside its knowledge:
User: "What's the capital of Mars?"Bad Model: "The capital of Mars is Ares City, established in..."Good Model: "I don't have information about Mars colonization in my training data."Fix: Add ~60 refusal pairs per training run.
Mistake 3: Ignoring Mode Diversity
I used a single style for all pairs. But different use cases need different training:
- Coding assistants need code examples
- Research assistants need analysis depth
- Factual assistants need citations
Fix: Generate pairs in multiple modes (Developer/Thinker/Factual).
Mistake 4: No Quality Validation
I auto-generated thousands of pairs and trained immediately. Some pairs were:
- Duplicates
- Contradictory
- Too short to be useful
- In the wrong format
Fix: Always validate before training.
def validate_pair(pair): """Check if a pair meets quality standards.""" errors = []
if len(pair["instruction"]) < 10: errors.append("Instruction too short")
if len(pair["response"]) < 20: errors.append("Response too short")
if pair["instruction"] == pair["response"]: errors.append("Instruction and response are identical")
return len(errors) == 0, errorsHow Many Pairs Do You Really Need?
From my experiments and the PersonalForge discussion:
| Dataset Size | Pairs Needed | Expected Results |
|---|---|---|
| 1-5 documents | 500-1000 | Baseline understanding |
| 5-20 documents | 1000-2000 | Good generalization |
| 20-50 documents | 2000-5000 | Deep domain knowledge |
| 50+ documents | 5000+ | Expert-level |
But remember: quality matters more than raw quantity. 1000 high-quality, diverse pairs beat 5000 low-quality pairs.
What I’d Do Differently
-
Start with the end in mind - What modes do I need? What pair types? Plan before generating.
-
Generate multi-hop pairs first - These are harder but more valuable. Don’t skip them.
-
Always include refusal pairs - 60 minimum. Your model WILL hallucinate without them.
-
Validate before training - Run quality checks. Fix duplicates, contradictions, format issues.
-
Test incrementally - Train on 500 pairs, test, add 500 more. Don’t waste compute on bad data.
The Bottom Line
Quality training pairs require:
- Quantity: 1000+ pairs minimum for decent results
- Diversity: Multiple modes (Developer/Thinker/Factual)
- Multi-hop: Connect ideas across documents
- Refusals: ~60 pairs teaching “I don’t know”
- Thinking chains: Include reasoning, not just answers
- Validation: Check quality before training
My 536 single-hop, no-refusal pairs produced a mimic. My 1500 diverse, validated pairs produced an assistant that actually understood.
The difference wasn’t the model architecture. It was the training data.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments