Skip to content

How to use Rag Architect skill in Claude Code for data-ml development

Purpose

This post demonstrates how to use the Rag Architect skill in Claude Code to build retrieval-augmented generation systems. I will show you when to trigger it, what patterns work best, and how to avoid common mistakes.

Environment

  • Claude Code with claude-skills plugin
  • Python 3.9+
  • Basic knowledge of embeddings and vector databases

What is Rag Architect?

Rag Architect is a specialized skill that helps design and implement RAG systems. It focuses on the architecture decisions that make retrieval-augmented generation work well.

When I need to build a system that combines document retrieval with LLM generation, I use this skill. It helps me think through:

Document Store → Vector Embeddings → Retrieval → LLM Generation → Response

The skill covers these key areas:

  • chunking strategy: How to split documents for retrieval
  • embedding models: Which models to use for vectorization
  • vector databases: Storage and similarity search options
  • retrieval strategies: How to find relevant documents
  • generation prompts: How to format retrieved context for the LLM

When to use Rag Architect

I trigger this skill when I face these scenarios:

Scenario 1: Building a question-answering system over company documents

I need a chatbot that answers questions from our PDF documentation.

Scenario 2: Adding context to LLM responses from a knowledge base

I want GPT-4 to reference our product manuals when answering customer questions.

Scenario 3: Building semantic search over unstructured text

I need to find similar documents based on meaning, not keywords.

I don’t use this skill for:

  • Simple keyword search (use Elasticsearch instead)
  • Fixed-context prompts (no retrieval needed)
  • Pure generation without external knowledge

Installation and Setup

First, I install the claude-skills plugin:

Terminal window
npm install -g claude-skills

Then I verify the Rag Architect skill is available:

Terminal window
claude-skill list | grep rag-architect

I should see:

rag-architect - RAG architecture patterns and best practices

Now I can invoke it in Claude Code by using the /rag-architect command.

How to trigger Rag Architect

I found several ways to activate this skill effectively.

Method 1: Direct command

/rag-architect

Method 2: Natural language trigger

I need to design a RAG system for my documentation

Method 3: Contextual invocation

How should I chunk my technical docs for better retrieval?

When I use any of these, Claude loads the Rag Architect skill and provides specialized guidance for RAG architecture.

Practical Examples

Example 1: Basic RAG for documentation

I needed a simple RAG system for my company’s knowledge base. I started with:

/rag-architect
I have 500 PDF files. How should I build a RAG system to answer questions from them?

The skill guided me through the architecture:

1. Chunking: Use recursive character splitting with overlap
- Chunk size: 1000 tokens
- Overlap: 200 tokens
- Keeps context intact across chunks
2. Embeddings: OpenAI text-embedding-3-small
- Cost-effective
- Good performance for technical docs
- 1536 dimensions
3. Vector Store: ChromaDB
- Local deployment
- Easy integration with LangChain
- Handles 500 docs easily
4. Retrieval: Similarity search with k=4
- Returns top 4 chunks
- Balance between context and token limit

Here’s the chunking strategy it recommended:

chunk_docs.py
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Configure splitter for technical docs
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # tokens per chunk
chunk_overlap=200, # context preservation
length_function=len,
separators=["\n\n", "\n", " ", ""]
)
# Split documents
chunks = splitter.split_documents(documents)

I then implemented the full pipeline:

rag_pipeline.py
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
# Create embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Store vectors
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4"),
chain_type="stuff",
retriever=vectorstore.as_retriever(search_kwargs={"k": 4})
)

When I query the system:

response = qa_chain.run("How do I configure the API authentication?")

I get accurate answers based on my documentation.

Example 2: Hybrid search for better accuracy

I noticed that pure semantic search sometimes missed exact technical terms. I tried:

/rag-architect
My RAG system fails when users search for specific error codes. "ERR-504" doesn't match semantic search well.

The skill suggested hybrid search:

The problem:
- Semantic search finds similar concepts
- But exact terms like "ERR-504" need keyword matching
Solution: Hybrid search
- Combine vector similarity with keyword filtering
- Use metadata filtering for error codes
- Score fusion to rank results

Here’s the hybrid approach I implemented:

hybrid_search.py
from langchain.retrievers import SelfQueryRetriever
from langchain.chains.query_constructor.base import AttributeInfo
# Define metadata fields
metadata_field_info = [
AttributeInfo(
name="error_code",
description="Error code like ERR-504",
type="string",
),
AttributeInfo(
name="category",
description="Document category",
type="string",
),
]
# Create self-query retriever
retriever = SelfQueryRetriever.from_llm(
llm=ChatOpenAI(model="gpt-4"),
vectorstore=vectorstore,
document_contents="Technical documentation",
metadata_field_info=metadata_field_info,
)
# Query with filter
results = retriever.get_relevant_documents(
"ERR-504 authentication timeout",
filter={"error_code": "ERR-504"}
)

Now when I search for “ERR-504”, it first filters by error code, then ranks by semantic similarity.

Example 3: Multi-document reasoning

I needed to answer questions that required information from multiple documents:

/rag-architect
Users ask questions like "Compare the pricing plans across our three products". This requires retrieving from multiple docs.

The skill suggested multi-document retrieval:

Strategy:
1. Retrieve more chunks (k=8 instead of k=4)
2. Use Map-Reduce chain for complex reasoning
3. Let the LLM synthesize across documents

Here’s the implementation:

multi_doc_rag.py
from langchain.chains import MapReduceDocumentsChain
from langchain.chains import ReduceDocumentsChain
from langchain.chains.combine_documents.stuff import StuffDocumentsChain
# Map each document to an answer
map_prompt = """Answer this question: {question}
Based on this context: {context}
Answer: """
map_chain = LLMChain(
llm=ChatOpenAI(model="gpt-4"),
prompt=PromptTemplate(template=map_prompt, input_variables=["question", "context"])
)
# Reduce all answers into one
reduce_prompt = """Combine these answers: {doc summaries}
Into a comprehensive answer to: {question}
Answer: """
reduce_chain = LLMChain(
llm=ChatOpenAI(model="gpt-4"),
prompt=PromptTemplate(template=reduce_prompt, input_variables=["doc_summaries", "question"])
)
# Combine
combine_documents_chain = StuffDocumentsChain(
llm_chain=reduce_chain,
document_variable_name="doc_summaries"
)
reduce_documents_chain = ReduceDocumentsChain(
combine_documents_chain=combine_documents_chain,
collapse_documents_chain=combine_documents_chain
)
map_reduce_chain = MapReduceDocumentsChain(
llm_chain=map_chain,
reduce_documents_chain=reduce_documents_chain,
document_variable_name="context"
)

Now when I ask about pricing across products, it retrieves from multiple docs and synthesizes a comparison.

Best Practices

DO ✓

1. Start with simple chunking

# Good: Start simple
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200
)

2. Use metadata for filtering

# Good: Add metadata
doc.metadata = {
"source": "api-docs",
"version": "2.0",
"category": "authentication"
}

3. Test retrieval quality

# Good: Evaluate before using
queries = ["How to authenticate?", "What is ERR-504?"]
for query in queries:
results = vectorstore.similarity_search(query, k=4)
print(f"\nQuery: {query}")
print(f"Results: {[r.page_content[:100] for r in results]}")

4. Monitor token usage

# Good: Track costs
from langchain.callbacks import get_openai_callback
with get_openai_callback() as cb:
response = qa_chain.run(query)
print(f"Total tokens: {cb.total_tokens}")
print(f"Total cost: ${cb.total_cost}")

5. Incrementally improve

# Good: Start basic, then enhance
# Step 1: Basic RAG
# Step 2: Add metadata filtering
# Step 3: Implement hybrid search
# Step 4: Add reranking

DON’T ✗

1. Don’t skip overlap

# Bad: No overlap breaks context
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0 # Sentences get cut off
)

2. Don’t use too small chunks

# Bad: Too small loses context
splitter = RecursiveCharacterTextSplitter(
chunk_size=100, # Paragraphs get fragmented
chunk_overlap=20
)

3. Don’t ignore retrieval metrics

# Bad: Never test if retrieval works
# Just assuming it finds relevant docs
response = qa_chain.run(query) # Might use irrelevant context

4. Don’t use k=1

# Bad: Too few results
results = vectorstore.similarity_search(query, k=1) # Not enough context

5. Don’t embed everything at once

# Bad: Embed all 10k docs in one call
# Hits rate limits, costs a lot
embeddings.embed_documents([all_documents]) # Split into batches

Common Mistakes I Made

Mistake 1: I didn’t test retrieval quality before building the full RAG system.

I built the entire pipeline, then realized the chunks were too small and didn’t capture full procedures.

Fix: Now I always test retrieval first:

# Test before building
test_queries = [
"How to reset password?",
"What is the API timeout limit?",
"Database connection errors"
]
for query in test_queries:
docs = vectorstore.similarity_search(query, k=4)
print(f"\n{query}")
for i, doc in enumerate(docs, 1):
print(f"{i}. {doc.page_content[:150]}...")

Mistake 2: I used fixed chunk sizes for all document types.

Technical manuals need larger chunks than marketing copy.

Fix: Use document-specific chunking:

# Adaptive chunking
if doc.metadata["type"] == "technical_manual":
chunk_size = 1500
overlap = 300
elif doc.metadata["type"] == "marketing":
chunk_size = 500
overlap = 100

Mistake 3: I forgot to persist the vector store.

Every restart required re-embedding all documents.

Fix: Always persist:

# Save after creating
vectorstore.persist()
# Load on startup
vectorstore = Chroma(
persist_directory="./chroma_db",
embedding_function=embeddings
)

Rag Architect works well with these skills:

  • planning-with-files: Use first to design your RAG architecture
  • backend-patterns: For API design around your RAG system
  • security-review: If handling sensitive documents in RAG

Summary

In this post, I showed how to use the Rag Architect skill in Claude Code. I covered when to trigger it, practical examples for documentation QA, hybrid search, and multi-document reasoning, and common mistakes to avoid.

The key point is that Rag Architect helps you think through RAG architecture decisions systematically: chunking, embeddings, retrieval, and generation. Start simple, test retrieval quality early, and incrementally improve based on your specific use case.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments