Generating Quality Training Pairs for LLM Fine-tuning

Mar 21, 2026

I tried to fine-tune a language model on my personal document collection. After hours of training, I tested it:

Me: "What's the error handling pattern in chapter 5?"
Model: "Chapter 5 discusses error handling patterns..."
Me: "But what IS the pattern?"
Model: "Chapter 5 discusses error handling patterns..."

The model was just repeating text from my documents. It didn’t understand anything. It was mimicking.

The Problem: 536 Training Pairs

I had auto-generated 536 training pairs from my documents. I thought that was enough. I was wrong.

Here’s what the Reddit commenter said about my approach:

“Auto-generating training pairs is a bold move, but the real test will be the quality and diversity of those pairs, as poor synthetic data can quickly lead to catastrophic forgetting or a model that just mimics source text without actually ‘understanding’ it.”

They were right. My model:

Repeated phrases verbatim instead of answering
Couldn’t connect ideas across documents
Hallucinated when asked about topics NOT in the training data

I had two problems: quantity AND quality.

The Quantity Problem

The first shock: I needed at least 1000 training pairs for decent results.

< 100 pairs → Model barely changes
100-500 pairs → Weak, inconsistent results
500-1000 pairs → Decent on exact topics, fails on variations
1000+ pairs → Good generalization, starts to "understand"

My 536 pairs fell into the “weak model” zone. The model learned surface patterns but couldn’t generalize.

But here’s the thing - I couldn’t just generate MORE pairs the same way. That would just give me MORE bad data.

The Quality Problem

I looked at my training pairs:

[
  {
    "instruction": "What does the document say about errors?",
    "response": "The document says errors should be handled gracefully."
  },
  {
    "instruction": "What is mentioned about caching?",
    "response": "Caching is mentioned as a performance optimization."
  }
]

These pairs were terrible. Why?

Single-hop only - Each pair drew from ONE document section
No refusal pairs - The model learned to ALWAYS answer, even when it shouldn’t
No thinking chains - Just question-answer, no reasoning
One mode fits all - Didn’t account for different use cases

The Solution: Multi-Mode Training Pairs

I discovered the PersonalForge project’s approach. They generate training pairs in multiple modes:

Mode	Focus	When to Use
Developer/Coder	Code examples, best practices	Building coding assistants
Deep Thinker	Multi-angle analysis	Building research assistants
Honest/Factual	Cites sources, admits gaps	Building trustworthy AI

I needed to generate pairs for each mode AND include different types of pairs.

Pair Type 1: Instruction-Response

The basic building block. But with variation:

def generate_instruction_pairs(documents, mode="developer"):
    pairs = []

    for doc in documents:
        if mode == "developer":
            # Focus on code examples and implementation
            pairs.append({
                "instruction": f"How do I implement {doc.concept}?",
                "response": doc.code_example
            })
        elif mode == "thinker":
            # Focus on analysis and trade-offs
            pairs.append({
                "instruction": f"What are the trade-offs of {doc.concept}?",
                "response": doc.analysis
            })
        elif mode == "factual":
            # Focus on facts with citations
            pairs.append({
                "instruction": f"What does the documentation say about {doc.concept}?",
                "response": f"According to {doc.source}: {doc.facts}"
            })

    return pairs

Pair Type 2: Multi-hop Pairs

This is where most auto-generation fails. Multi-hop pairs require connecting ideas across documents:

def generate_multihop_pairs(documents):
    """Generate pairs that connect ideas across documents."""
    pairs = []

    for doc1 in documents:
        for doc2 in documents:
            if doc1 != doc2 and are_related(doc1, doc2):
                # Ask questions that require both documents
                pairs.append({
                    "instruction": f"How does {doc1.concept} relate to {doc2.concept}?",
                    "response": f"{doc1.concept} provides {doc1.key_idea}, "
                                f"which enables {doc2.concept} to {doc2.key_idea}. "
                                f"The connection is through {find_connection(doc1, doc2)}."
                })

    return pairs

Multi-hop pairs test whether the model understands relationships, not just individual facts.

Pair Type 3: Refusal Pairs (Critical!)

This is the most important type I was missing. Models need to learn when NOT to answer:

def generate_refusal_pairs(documents, count=60):
    """Teach the model to say 'I don't know' instead of hallucinating."""
    pairs = []

    # Generate questions about topics NOT in the documents
    topics_not_covered = get_external_topics(documents)

    for topic in topics_not_covered[:count]:
        pairs.append({
            "instruction": f"What does the documentation say about {topic}?",
            "response": "I don't have information about that topic in my training data. "
                        "This might be covered in documentation I haven't seen, "
                        "or it might be outside the scope of my knowledge base."
        })

    return pairs

PersonalForge uses ~60 refusal pairs per training run. This teaches the model:

User: "What about quantum computing?"
Model: "Quantum computing uses qubits and can solve..." (HALLUCINATION)

User: "What about quantum computing?"
Model: "I don't have information about quantum computing in my training data."

Pair Type 4: Thinking Chain Pairs

Instead of just question-answer, include the reasoning:

def generate_thinking_chain_pairs(documents):
    """Include reasoning steps, not just final answers."""
    pairs = []

    for doc in documents:
        pairs.append({
            "instruction": f"Explain {doc.concept}",
            "response": f"Let me think through this step by step.\n\n"
                        f"First, {doc.concept} is defined as {doc.definition}.\n\n"
                        f"Second, the key principle is {doc.principle}.\n\n"
                        f"Third, in practice, this means {doc.practical_application}.\n\n"
                        f"Therefore, {doc.concept} {doc.conclusion}."
        })

    return pairs

Thinking chains help the model learn problem decomposition, not just pattern matching.

The Complete Generation Pipeline

Here’s my improved training pair generator:

import json
from pathlib import Path

class TrainingPairGenerator:
    def __init__(self, documents, target_count=1000):
        self.documents = documents
        self.target_count = target_count
        self.pairs = []

    def generate_all_pairs(self):
        """Generate diverse training pairs across all modes and types."""
        modes = ["developer", "thinker", "factual"]

        for mode in modes:
            # Type 1: Instruction-response pairs
            self.pairs.extend(
                self.generate_instruction_pairs(mode)
            )

            # Type 2: Multi-hop pairs (fewer, but critical)
            self.pairs.extend(
                self.generate_multihop_pairs(limit=50)
            )

        # Type 3: Refusal pairs (critical for anti-hallucination)
        self.pairs.extend(
            self.generate_refusal_pairs(count=60)
        )

        # Type 4: Thinking chain pairs
        self.pairs.extend(
            self.generate_thinking_chain_pairs()
        )

        print(f"Generated {len(self.pairs)} total pairs")
        self.validate_quality()

        return self.pairs

    def validate_quality(self):
        """Check for common quality issues."""
        issues = []

        # Check minimum count
        if len(self.pairs) < 1000:
            issues.append(f"WARNING: Only {len(self.pairs)} pairs. Need 1000+ for decent results.")

        # Check for refusal pairs
        refusal_count = sum(1 for p in self.pairs if "don't have information" in p["response"])
        if refusal_count < 30:
            issues.append(f"WARNING: Only {refusal_count} refusal pairs. Need ~60 to prevent hallucination.")

        # Check for multi-hop pairs
        multihop_count = sum(1 for p in self.pairs if "relates to" in p["instruction"] or "connect" in p["instruction"])
        if multihop_count < 20:
            issues.append(f"WARNING: Only {multihop_count} multi-hop pairs. Need more for deeper understanding.")

        for issue in issues:
            print(issue)

        return len(issues) == 0

    def save(self, output_path):
        """Save pairs in JSONL format for training."""
        with open(output_path, 'w') as f:
            for pair in self.pairs:
                f.write(json.dumps(pair) + '\n')

# Usage
generator = TrainingPairGenerator(my_documents, target_count=1500)
pairs = generator.generate_all_pairs()
generator.save("training_pairs.jsonl")

Quality vs Quantity: The Trade-off

                Low Quality           High Quality
                ─────────────────────────────────────
Low Quantity    Useless               Baseline
(100-500)       (mimicry only)        (barely works)

High Quantity   Dangerous             Good
(1000+)         (learns bad patterns) (generalizes well)

The danger zone is high quantity + low quality. You’ll get a model that confidently does the wrong thing.

Common Mistakes

Mistake 1: Only Single-hop Pairs

I generated pairs that only asked about single documents:

[
  {"instruction": "What is in document A?", "response": "..."},
  {"instruction": "What is in document B?", "response": "..."}
]

The model couldn’t answer: “How does concept A relate to concept B?”

Fix: Add multi-hop pairs that connect ideas across documents.

Mistake 2: No Refusal Training

My model answered EVERYTHING, even questions outside its knowledge:

User: "What's the capital of Mars?"
Bad Model: "The capital of Mars is Ares City, established in..."
Good Model: "I don't have information about Mars colonization in my training data."

Fix: Add ~60 refusal pairs per training run.

Mistake 3: Ignoring Mode Diversity

I used a single style for all pairs. But different use cases need different training:

Coding assistants need code examples
Research assistants need analysis depth
Factual assistants need citations

Fix: Generate pairs in multiple modes (Developer/Thinker/Factual).

Mistake 4: No Quality Validation

I auto-generated thousands of pairs and trained immediately. Some pairs were:

Duplicates
Contradictory
Too short to be useful
In the wrong format

Fix: Always validate before training.

def validate_pair(pair):
    """Check if a pair meets quality standards."""
    errors = []

    if len(pair["instruction"]) < 10:
        errors.append("Instruction too short")

    if len(pair["response"]) < 20:
        errors.append("Response too short")

    if pair["instruction"] == pair["response"]:
        errors.append("Instruction and response are identical")

    return len(errors) == 0, errors

How Many Pairs Do You Really Need?

From my experiments and the PersonalForge discussion:

Dataset Size	Pairs Needed	Expected Results
1-5 documents	500-1000	Baseline understanding
5-20 documents	1000-2000	Good generalization
20-50 documents	2000-5000	Deep domain knowledge
50+ documents	5000+	Expert-level

But remember: quality matters more than raw quantity. 1000 high-quality, diverse pairs beat 5000 low-quality pairs.

What I’d Do Differently

Start with the end in mind - What modes do I need? What pair types? Plan before generating.
Generate multi-hop pairs first - These are harder but more valuable. Don’t skip them.
Always include refusal pairs - 60 minimum. Your model WILL hallucinate without them.
Validate before training - Run quality checks. Fix duplicates, contradictions, format issues.
Test incrementally - Train on 500 pairs, test, add 500 more. Don’t waste compute on bad data.

The Bottom Line

Quality training pairs require:

Quantity: 1000+ pairs minimum for decent results
Diversity: Multiple modes (Developer/Thinker/Factual)
Multi-hop: Connect ideas across documents
Refusals: ~60 pairs teaching “I don’t know”
Thinking chains: Include reasoning, not just answers
Validation: Check quality before training

My 536 single-hop, no-refusal pairs produced a mimic. My 1500 diverse, validated pairs produced an assistant that actually understood.

The difference wasn’t the model architecture. It was the training data.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!