Skip to content

Generating Quality Training Pairs for LLM Fine-tuning

I tried to fine-tune a language model on my personal document collection. After hours of training, I tested it:

My first test
Me: "What's the error handling pattern in chapter 5?"
Model: "Chapter 5 discusses error handling patterns..."
Me: "But what IS the pattern?"
Model: "Chapter 5 discusses error handling patterns..."

The model was just repeating text from my documents. It didn’t understand anything. It was mimicking.

The Problem: 536 Training Pairs

I had auto-generated 536 training pairs from my documents. I thought that was enough. I was wrong.

Here’s what the Reddit commenter said about my approach:

“Auto-generating training pairs is a bold move, but the real test will be the quality and diversity of those pairs, as poor synthetic data can quickly lead to catastrophic forgetting or a model that just mimics source text without actually ‘understanding’ it.”

They were right. My model:

  • Repeated phrases verbatim instead of answering
  • Couldn’t connect ideas across documents
  • Hallucinated when asked about topics NOT in the training data

I had two problems: quantity AND quality.

The Quantity Problem

The first shock: I needed at least 1000 training pairs for decent results.

Training pair thresholds
< 100 pairs → Model barely changes
100-500 pairs → Weak, inconsistent results
500-1000 pairs → Decent on exact topics, fails on variations
1000+ pairs → Good generalization, starts to "understand"

My 536 pairs fell into the “weak model” zone. The model learned surface patterns but couldn’t generalize.

But here’s the thing - I couldn’t just generate MORE pairs the same way. That would just give me MORE bad data.

The Quality Problem

I looked at my training pairs:

my_bad_pairs.json
[
{
"instruction": "What does the document say about errors?",
"response": "The document says errors should be handled gracefully."
},
{
"instruction": "What is mentioned about caching?",
"response": "Caching is mentioned as a performance optimization."
}
]

These pairs were terrible. Why?

  1. Single-hop only - Each pair drew from ONE document section
  2. No refusal pairs - The model learned to ALWAYS answer, even when it shouldn’t
  3. No thinking chains - Just question-answer, no reasoning
  4. One mode fits all - Didn’t account for different use cases

The Solution: Multi-Mode Training Pairs

I discovered the PersonalForge project’s approach. They generate training pairs in multiple modes:

ModeFocusWhen to Use
Developer/CoderCode examples, best practicesBuilding coding assistants
Deep ThinkerMulti-angle analysisBuilding research assistants
Honest/FactualCites sources, admits gapsBuilding trustworthy AI

I needed to generate pairs for each mode AND include different types of pairs.

Pair Type 1: Instruction-Response

The basic building block. But with variation:

generate_instruction_pairs.py
def generate_instruction_pairs(documents, mode="developer"):
pairs = []
for doc in documents:
if mode == "developer":
# Focus on code examples and implementation
pairs.append({
"instruction": f"How do I implement {doc.concept}?",
"response": doc.code_example
})
elif mode == "thinker":
# Focus on analysis and trade-offs
pairs.append({
"instruction": f"What are the trade-offs of {doc.concept}?",
"response": doc.analysis
})
elif mode == "factual":
# Focus on facts with citations
pairs.append({
"instruction": f"What does the documentation say about {doc.concept}?",
"response": f"According to {doc.source}: {doc.facts}"
})
return pairs

Pair Type 2: Multi-hop Pairs

This is where most auto-generation fails. Multi-hop pairs require connecting ideas across documents:

generate_multihop_pairs.py
def generate_multihop_pairs(documents):
"""Generate pairs that connect ideas across documents."""
pairs = []
for doc1 in documents:
for doc2 in documents:
if doc1 != doc2 and are_related(doc1, doc2):
# Ask questions that require both documents
pairs.append({
"instruction": f"How does {doc1.concept} relate to {doc2.concept}?",
"response": f"{doc1.concept} provides {doc1.key_idea}, "
f"which enables {doc2.concept} to {doc2.key_idea}. "
f"The connection is through {find_connection(doc1, doc2)}."
})
return pairs

Multi-hop pairs test whether the model understands relationships, not just individual facts.

Pair Type 3: Refusal Pairs (Critical!)

This is the most important type I was missing. Models need to learn when NOT to answer:

generate_refusal_pairs.py
def generate_refusal_pairs(documents, count=60):
"""Teach the model to say 'I don't know' instead of hallucinating."""
pairs = []
# Generate questions about topics NOT in the documents
topics_not_covered = get_external_topics(documents)
for topic in topics_not_covered[:count]:
pairs.append({
"instruction": f"What does the documentation say about {topic}?",
"response": "I don't have information about that topic in my training data. "
"This might be covered in documentation I haven't seen, "
"or it might be outside the scope of my knowledge base."
})
return pairs

PersonalForge uses ~60 refusal pairs per training run. This teaches the model:

Before refusal training
User: "What about quantum computing?"
Model: "Quantum computing uses qubits and can solve..." (HALLUCINATION)
After refusal training
User: "What about quantum computing?"
Model: "I don't have information about quantum computing in my training data."

Pair Type 4: Thinking Chain Pairs

Instead of just question-answer, include the reasoning:

generate_thinking_pairs.py
def generate_thinking_chain_pairs(documents):
"""Include reasoning steps, not just final answers."""
pairs = []
for doc in documents:
pairs.append({
"instruction": f"Explain {doc.concept}",
"response": f"Let me think through this step by step.\n\n"
f"First, {doc.concept} is defined as {doc.definition}.\n\n"
f"Second, the key principle is {doc.principle}.\n\n"
f"Third, in practice, this means {doc.practical_application}.\n\n"
f"Therefore, {doc.concept} {doc.conclusion}."
})
return pairs

Thinking chains help the model learn problem decomposition, not just pattern matching.

The Complete Generation Pipeline

Here’s my improved training pair generator:

complete_pair_generator.py
import json
from pathlib import Path
class TrainingPairGenerator:
def __init__(self, documents, target_count=1000):
self.documents = documents
self.target_count = target_count
self.pairs = []
def generate_all_pairs(self):
"""Generate diverse training pairs across all modes and types."""
modes = ["developer", "thinker", "factual"]
for mode in modes:
# Type 1: Instruction-response pairs
self.pairs.extend(
self.generate_instruction_pairs(mode)
)
# Type 2: Multi-hop pairs (fewer, but critical)
self.pairs.extend(
self.generate_multihop_pairs(limit=50)
)
# Type 3: Refusal pairs (critical for anti-hallucination)
self.pairs.extend(
self.generate_refusal_pairs(count=60)
)
# Type 4: Thinking chain pairs
self.pairs.extend(
self.generate_thinking_chain_pairs()
)
print(f"Generated {len(self.pairs)} total pairs")
self.validate_quality()
return self.pairs
def validate_quality(self):
"""Check for common quality issues."""
issues = []
# Check minimum count
if len(self.pairs) < 1000:
issues.append(f"WARNING: Only {len(self.pairs)} pairs. Need 1000+ for decent results.")
# Check for refusal pairs
refusal_count = sum(1 for p in self.pairs if "don't have information" in p["response"])
if refusal_count < 30:
issues.append(f"WARNING: Only {refusal_count} refusal pairs. Need ~60 to prevent hallucination.")
# Check for multi-hop pairs
multihop_count = sum(1 for p in self.pairs if "relates to" in p["instruction"] or "connect" in p["instruction"])
if multihop_count < 20:
issues.append(f"WARNING: Only {multihop_count} multi-hop pairs. Need more for deeper understanding.")
for issue in issues:
print(issue)
return len(issues) == 0
def save(self, output_path):
"""Save pairs in JSONL format for training."""
with open(output_path, 'w') as f:
for pair in self.pairs:
f.write(json.dumps(pair) + '\n')
# Usage
generator = TrainingPairGenerator(my_documents, target_count=1500)
pairs = generator.generate_all_pairs()
generator.save("training_pairs.jsonl")

Quality vs Quantity: The Trade-off

Training data quality matrix
Low Quality High Quality
─────────────────────────────────────
Low Quantity Useless Baseline
(100-500) (mimicry only) (barely works)
High Quantity Dangerous Good
(1000+) (learns bad patterns) (generalizes well)

The danger zone is high quantity + low quality. You’ll get a model that confidently does the wrong thing.

Common Mistakes

Mistake 1: Only Single-hop Pairs

I generated pairs that only asked about single documents:

single_hop_only.json
[
{"instruction": "What is in document A?", "response": "..."},
{"instruction": "What is in document B?", "response": "..."}
]

The model couldn’t answer: “How does concept A relate to concept B?”

Fix: Add multi-hop pairs that connect ideas across documents.

Mistake 2: No Refusal Training

My model answered EVERYTHING, even questions outside its knowledge:

Hallucination example
User: "What's the capital of Mars?"
Bad Model: "The capital of Mars is Ares City, established in..."
Good Model: "I don't have information about Mars colonization in my training data."

Fix: Add ~60 refusal pairs per training run.

Mistake 3: Ignoring Mode Diversity

I used a single style for all pairs. But different use cases need different training:

  • Coding assistants need code examples
  • Research assistants need analysis depth
  • Factual assistants need citations

Fix: Generate pairs in multiple modes (Developer/Thinker/Factual).

Mistake 4: No Quality Validation

I auto-generated thousands of pairs and trained immediately. Some pairs were:

  • Duplicates
  • Contradictory
  • Too short to be useful
  • In the wrong format

Fix: Always validate before training.

validation.py
def validate_pair(pair):
"""Check if a pair meets quality standards."""
errors = []
if len(pair["instruction"]) < 10:
errors.append("Instruction too short")
if len(pair["response"]) < 20:
errors.append("Response too short")
if pair["instruction"] == pair["response"]:
errors.append("Instruction and response are identical")
return len(errors) == 0, errors

How Many Pairs Do You Really Need?

From my experiments and the PersonalForge discussion:

Dataset SizePairs NeededExpected Results
1-5 documents500-1000Baseline understanding
5-20 documents1000-2000Good generalization
20-50 documents2000-5000Deep domain knowledge
50+ documents5000+Expert-level

But remember: quality matters more than raw quantity. 1000 high-quality, diverse pairs beat 5000 low-quality pairs.

What I’d Do Differently

  1. Start with the end in mind - What modes do I need? What pair types? Plan before generating.

  2. Generate multi-hop pairs first - These are harder but more valuable. Don’t skip them.

  3. Always include refusal pairs - 60 minimum. Your model WILL hallucinate without them.

  4. Validate before training - Run quality checks. Fix duplicates, contradictions, format issues.

  5. Test incrementally - Train on 500 pairs, test, add 500 more. Don’t waste compute on bad data.

The Bottom Line

Quality training pairs require:

  • Quantity: 1000+ pairs minimum for decent results
  • Diversity: Multiple modes (Developer/Thinker/Factual)
  • Multi-hop: Connect ideas across documents
  • Refusals: ~60 pairs teaching “I don’t know”
  • Thinking chains: Include reasoning, not just answers
  • Validation: Check quality before training

My 536 single-hop, no-refusal pairs produced a mimic. My 1500 diverse, validated pairs produced an assistant that actually understood.

The difference wasn’t the model architecture. It was the training data.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments