How to use Rag Architect skill in Claude Code for data-ml development

Feb 14, 2026

Purpose

This post demonstrates how to use the Rag Architect skill in Claude Code to build retrieval-augmented generation systems. I will show you when to trigger it, what patterns work best, and how to avoid common mistakes.

Environment

Claude Code with claude-skills plugin
Python 3.9+
Basic knowledge of embeddings and vector databases

What is Rag Architect?

Rag Architect is a specialized skill that helps design and implement RAG systems. It focuses on the architecture decisions that make retrieval-augmented generation work well.

When I need to build a system that combines document retrieval with LLM generation, I use this skill. It helps me think through:

Document Store → Vector Embeddings → Retrieval → LLM Generation → Response

The skill covers these key areas:

chunking strategy: How to split documents for retrieval
embedding models: Which models to use for vectorization
vector databases: Storage and similarity search options
retrieval strategies: How to find relevant documents
generation prompts: How to format retrieved context for the LLM

When to use Rag Architect

I trigger this skill when I face these scenarios:

Scenario 1: Building a question-answering system over company documents

I need a chatbot that answers questions from our PDF documentation.

Scenario 2: Adding context to LLM responses from a knowledge base

I want GPT-4 to reference our product manuals when answering customer questions.

Scenario 3: Building semantic search over unstructured text

I need to find similar documents based on meaning, not keywords.

I don’t use this skill for:

Simple keyword search (use Elasticsearch instead)
Fixed-context prompts (no retrieval needed)
Pure generation without external knowledge

Installation and Setup

First, I install the claude-skills plugin:

npm install -g claude-skills

Then I verify the Rag Architect skill is available:

claude-skill list | grep rag-architect

I should see:

rag-architect - RAG architecture patterns and best practices

Now I can invoke it in Claude Code by using the /rag-architect command.

How to trigger Rag Architect

I found several ways to activate this skill effectively.

Method 1: Direct command

/rag-architect

Method 2: Natural language trigger

I need to design a RAG system for my documentation

Method 3: Contextual invocation

How should I chunk my technical docs for better retrieval?

When I use any of these, Claude loads the Rag Architect skill and provides specialized guidance for RAG architecture.

Practical Examples

Example 1: Basic RAG for documentation

I needed a simple RAG system for my company’s knowledge base. I started with:

/rag-architect

I have 500 PDF files. How should I build a RAG system to answer questions from them?

The skill guided me through the architecture:

1. Chunking: Use recursive character splitting with overlap
   - Chunk size: 1000 tokens
   - Overlap: 200 tokens
   - Keeps context intact across chunks

2. Embeddings: OpenAI text-embedding-3-small
   - Cost-effective
   - Good performance for technical docs
   - 1536 dimensions

3. Vector Store: ChromaDB
   - Local deployment
   - Easy integration with LangChain
   - Handles 500 docs easily

4. Retrieval: Similarity search with k=4
   - Returns top 4 chunks
   - Balance between context and token limit

Here’s the chunking strategy it recommended:

from langchain.text_splitter import RecursiveCharacterTextSplitter

# Configure splitter for technical docs
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,  # tokens per chunk
    chunk_overlap=200,  # context preservation
    length_function=len,
    separators=["\n\n", "\n", " ", ""]
)

# Split documents
chunks = splitter.split_documents(documents)

I then implemented the full pipeline:

from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI

# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Store vectors
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)

# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4"),
    chain_type="stuff",
    retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)

When I query the system:

response = qa_chain.run("How do I configure the API authentication?")

I get accurate answers based on my documentation.

Example 2: Hybrid search for better accuracy

I noticed that pure semantic search sometimes missed exact technical terms. I tried:

/rag-architect

My RAG system fails when users search for specific error codes. "ERR-504" doesn't match semantic search well.

The skill suggested hybrid search:

The problem:
- Semantic search finds similar concepts
- But exact terms like "ERR-504" need keyword matching

Solution: Hybrid search
- Combine vector similarity with keyword filtering
- Use metadata filtering for error codes
- Score fusion to rank results

Here’s the hybrid approach I implemented:

from langchain.retrievers import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo

# Define metadata fields
metadata_field_info = [
    AttributeInfo(
        name="error_code",
        description="Error code like ERR-504",
        type="string",
    ),
    AttributeInfo(
        name="category",
        description="Document category",
        type="string",
    ),
]

# Create self-query retriever
retriever = SelfQueryRetriever.from_llm(
    llm=ChatOpenAI(model="gpt-4"),
    vectorstore=vectorstore,
    document_contents="Technical documentation",
    metadata_field_info=metadata_field_info,
)

# Query with filter
results = retriever.get_relevant_documents(
    "ERR-504 authentication timeout",
    filter={"error_code": "ERR-504"}
)

Now when I search for “ERR-504”, it first filters by error code, then ranks by semantic similarity.

Example 3: Multi-document reasoning

I needed to answer questions that required information from multiple documents:

/rag-architect

Users ask questions like "Compare the pricing plans across our three products". This requires retrieving from multiple docs.

The skill suggested multi-document retrieval:

Strategy:
1. Retrieve more chunks (k=8 instead of k=4)
2. Use Map-Reduce chain for complex reasoning
3. Let the LLM synthesize across documents

Here’s the implementation:

from langchain.chains import MapReduceDocumentsChain
from langchain.chains import ReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain

# Map each document to an answer
map_prompt = """Answer this question: {question}
Based on this context: {context}
Answer: """

map_chain = LLMChain(
    llm=ChatOpenAI(model="gpt-4"),
    prompt=PromptTemplate(template=map_prompt, input_variables=["question", "context"])
)

# Reduce all answers into one
reduce_prompt = """Combine these answers: {doc summaries}
Into a comprehensive answer to: {question}
Answer: """

reduce_chain = LLMChain(
    llm=ChatOpenAI(model="gpt-4"),
    prompt=PromptTemplate(template=reduce_prompt, input_variables=["doc_summaries", "question"])
)

# Combine
combine_documents_chain = StuffDocumentsChain(
    llm_chain=reduce_chain,
    document_variable_name="doc_summaries"
)

reduce_documents_chain = ReduceDocumentsChain(
    combine_documents_chain=combine_documents_chain,
    collapse_documents_chain=combine_documents_chain
)

map_reduce_chain = MapReduceDocumentsChain(
    llm_chain=map_chain,
    reduce_documents_chain=reduce_documents_chain,
    document_variable_name="context"
)

Now when I ask about pricing across products, it retrieves from multiple docs and synthesizes a comparison.

Best Practices

DO ✓

1. Start with simple chunking

# Good: Start simple
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)

2. Use metadata for filtering

# Good: Add metadata
doc.metadata = {
    "source": "api-docs",
    "version": "2.0",
    "category": "authentication"
}

3. Test retrieval quality

# Good: Evaluate before using
queries = ["How to authenticate?", "What is ERR-504?"]
for query in queries:
    results = vectorstore.similarity_search(query, k=4)
    print(f"\nQuery: {query}")
    print(f"Results: {[r.page_content[:100] for r in results]}")

4. Monitor token usage

# Good: Track costs
from langchain.callbacks import get_openai_callback

with get_openai_callback() as cb:
    response = qa_chain.run(query)
    print(f"Total tokens: {cb.total_tokens}")
    print(f"Total cost: ${cb.total_cost}")

5. Incrementally improve

# Good: Start basic, then enhance
# Step 1: Basic RAG
# Step 2: Add metadata filtering
# Step 3: Implement hybrid search
# Step 4: Add reranking

DON’T ✗

1. Don’t skip overlap

# Bad: No overlap breaks context
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=0  # Sentences get cut off
)

2. Don’t use too small chunks

# Bad: Too small loses context
splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,  # Paragraphs get fragmented
    chunk_overlap=20
)

3. Don’t ignore retrieval metrics

# Bad: Never test if retrieval works
# Just assuming it finds relevant docs
response = qa_chain.run(query)  # Might use irrelevant context

4. Don’t use k=1

# Bad: Too few results
results = vectorstore.similarity_search(query, k=1)  # Not enough context

5. Don’t embed everything at once

# Bad: Embed all 10k docs in one call
# Hits rate limits, costs a lot
embeddings.embed_documents([all_documents])  # Split into batches

Common Mistakes I Made

Mistake 1: I didn’t test retrieval quality before building the full RAG system.

I built the entire pipeline, then realized the chunks were too small and didn’t capture full procedures.

Fix: Now I always test retrieval first:

# Test before building
test_queries = [
    "How to reset password?",
    "What is the API timeout limit?",
    "Database connection errors"
]

for query in test_queries:
    docs = vectorstore.similarity_search(query, k=4)
    print(f"\n{query}")
    for i, doc in enumerate(docs, 1):
        print(f"{i}. {doc.page_content[:150]}...")

Mistake 2: I used fixed chunk sizes for all document types.

Technical manuals need larger chunks than marketing copy.

Fix: Use document-specific chunking:

# Adaptive chunking
if doc.metadata["type"] == "technical_manual":
    chunk_size = 1500
    overlap = 300
elif doc.metadata["type"] == "marketing":
    chunk_size = 500
    overlap = 100

Mistake 3: I forgot to persist the vector store.

Every restart required re-embedding all documents.

Fix: Always persist:

# Save after creating
vectorstore.persist()

# Load on startup
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings
)

Rag Architect works well with these skills:

planning-with-files: Use first to design your RAG architecture
backend-patterns: For API design around your RAG system
security-review: If handling sensitive documents in RAG

Summary

In this post, I showed how to use the Rag Architect skill in Claude Code. I covered when to trigger it, practical examples for documentation QA, hybrid search, and multi-document reasoning, and common mistakes to avoid.

The key point is that Rag Architect helps you think through RAG architecture decisions systematically: chunking, embeddings, retrieval, and generation. Start simple, test retrieval quality early, and incrementally improve based on your specific use case.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!