How to use Rag Architect skill in Claude Code for data-ml development
Purpose
This post demonstrates how to use the Rag Architect skill in Claude Code to build retrieval-augmented generation systems. I will show you when to trigger it, what patterns work best, and how to avoid common mistakes.
Environment
- Claude Code with claude-skills plugin
- Python 3.9+
- Basic knowledge of embeddings and vector databases
What is Rag Architect?
Rag Architect is a specialized skill that helps design and implement RAG systems. It focuses on the architecture decisions that make retrieval-augmented generation work well.
When I need to build a system that combines document retrieval with LLM generation, I use this skill. It helps me think through:
Document Store → Vector Embeddings → Retrieval → LLM Generation → ResponseThe skill covers these key areas:
chunking strategy: How to split documents for retrievalembedding models: Which models to use for vectorizationvector databases: Storage and similarity search optionsretrieval strategies: How to find relevant documentsgeneration prompts: How to format retrieved context for the LLM
When to use Rag Architect
I trigger this skill when I face these scenarios:
Scenario 1: Building a question-answering system over company documents
I need a chatbot that answers questions from our PDF documentation.Scenario 2: Adding context to LLM responses from a knowledge base
I want GPT-4 to reference our product manuals when answering customer questions.Scenario 3: Building semantic search over unstructured text
I need to find similar documents based on meaning, not keywords.I don’t use this skill for:
- Simple keyword search (use Elasticsearch instead)
- Fixed-context prompts (no retrieval needed)
- Pure generation without external knowledge
Installation and Setup
First, I install the claude-skills plugin:
npm install -g claude-skillsThen I verify the Rag Architect skill is available:
claude-skill list | grep rag-architectI should see:
rag-architect - RAG architecture patterns and best practicesNow I can invoke it in Claude Code by using the /rag-architect command.
How to trigger Rag Architect
I found several ways to activate this skill effectively.
Method 1: Direct command
/rag-architectMethod 2: Natural language trigger
I need to design a RAG system for my documentationMethod 3: Contextual invocation
How should I chunk my technical docs for better retrieval?When I use any of these, Claude loads the Rag Architect skill and provides specialized guidance for RAG architecture.
Practical Examples
Example 1: Basic RAG for documentation
I needed a simple RAG system for my company’s knowledge base. I started with:
/rag-architect
I have 500 PDF files. How should I build a RAG system to answer questions from them?The skill guided me through the architecture:
1. Chunking: Use recursive character splitting with overlap - Chunk size: 1000 tokens - Overlap: 200 tokens - Keeps context intact across chunks
2. Embeddings: OpenAI text-embedding-3-small - Cost-effective - Good performance for technical docs - 1536 dimensions
3. Vector Store: ChromaDB - Local deployment - Easy integration with LangChain - Handles 500 docs easily
4. Retrieval: Similarity search with k=4 - Returns top 4 chunks - Balance between context and token limitHere’s the chunking strategy it recommended:
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Configure splitter for technical docssplitter = RecursiveCharacterTextSplitter( chunk_size=1000, # tokens per chunk chunk_overlap=200, # context preservation length_function=len, separators=["\n\n", "\n", " ", ""])
# Split documentschunks = splitter.split_documents(documents)I then implemented the full pipeline:
from langchain.vectorstores import Chromafrom langchain.embeddings import OpenAIEmbeddingsfrom langchain.chains import RetrievalQAfrom langchain.chat_models import ChatOpenAI
# Create embeddingsembeddings = OpenAIEmbeddings(model="text-embedding-3-small")
# Store vectorsvectorstore = Chroma.from_documents( documents=chunks, embedding=embeddings, persist_directory="./chroma_db")
# Build RAG chainqa_chain = RetrievalQA.from_chain_type( llm=ChatOpenAI(model="gpt-4"), chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 4}))When I query the system:
response = qa_chain.run("How do I configure the API authentication?")I get accurate answers based on my documentation.
Example 2: Hybrid search for better accuracy
I noticed that pure semantic search sometimes missed exact technical terms. I tried:
/rag-architect
My RAG system fails when users search for specific error codes. "ERR-504" doesn't match semantic search well.The skill suggested hybrid search:
The problem:- Semantic search finds similar concepts- But exact terms like "ERR-504" need keyword matching
Solution: Hybrid search- Combine vector similarity with keyword filtering- Use metadata filtering for error codes- Score fusion to rank resultsHere’s the hybrid approach I implemented:
from langchain.retrievers import SelfQueryRetrieverfrom langchain.chains.query_constructor.base import AttributeInfo
# Define metadata fieldsmetadata_field_info = [ AttributeInfo( name="error_code", description="Error code like ERR-504", type="string", ), AttributeInfo( name="category", description="Document category", type="string", ),]
# Create self-query retrieverretriever = SelfQueryRetriever.from_llm( llm=ChatOpenAI(model="gpt-4"), vectorstore=vectorstore, document_contents="Technical documentation", metadata_field_info=metadata_field_info,)
# Query with filterresults = retriever.get_relevant_documents( "ERR-504 authentication timeout", filter={"error_code": "ERR-504"})Now when I search for “ERR-504”, it first filters by error code, then ranks by semantic similarity.
Example 3: Multi-document reasoning
I needed to answer questions that required information from multiple documents:
/rag-architect
Users ask questions like "Compare the pricing plans across our three products". This requires retrieving from multiple docs.The skill suggested multi-document retrieval:
Strategy:1. Retrieve more chunks (k=8 instead of k=4)2. Use Map-Reduce chain for complex reasoning3. Let the LLM synthesize across documentsHere’s the implementation:
from langchain.chains import MapReduceDocumentsChainfrom langchain.chains import ReduceDocumentsChainfrom langchain.chains.combine_documents.stuff import StuffDocumentsChain
# Map each document to an answermap_prompt = """Answer this question: {question}Based on this context: {context}Answer: """
map_chain = LLMChain( llm=ChatOpenAI(model="gpt-4"), prompt=PromptTemplate(template=map_prompt, input_variables=["question", "context"]))
# Reduce all answers into onereduce_prompt = """Combine these answers: {doc summaries}Into a comprehensive answer to: {question}Answer: """
reduce_chain = LLMChain( llm=ChatOpenAI(model="gpt-4"), prompt=PromptTemplate(template=reduce_prompt, input_variables=["doc_summaries", "question"]))
# Combinecombine_documents_chain = StuffDocumentsChain( llm_chain=reduce_chain, document_variable_name="doc_summaries")
reduce_documents_chain = ReduceDocumentsChain( combine_documents_chain=combine_documents_chain, collapse_documents_chain=combine_documents_chain)
map_reduce_chain = MapReduceDocumentsChain( llm_chain=map_chain, reduce_documents_chain=reduce_documents_chain, document_variable_name="context")Now when I ask about pricing across products, it retrieves from multiple docs and synthesizes a comparison.
Best Practices
DO ✓
1. Start with simple chunking
# Good: Start simplesplitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=200)2. Use metadata for filtering
# Good: Add metadatadoc.metadata = { "source": "api-docs", "version": "2.0", "category": "authentication"}3. Test retrieval quality
# Good: Evaluate before usingqueries = ["How to authenticate?", "What is ERR-504?"]for query in queries: results = vectorstore.similarity_search(query, k=4) print(f"\nQuery: {query}") print(f"Results: {[r.page_content[:100] for r in results]}")4. Monitor token usage
# Good: Track costsfrom langchain.callbacks import get_openai_callback
with get_openai_callback() as cb: response = qa_chain.run(query) print(f"Total tokens: {cb.total_tokens}") print(f"Total cost: ${cb.total_cost}")5. Incrementally improve
# Good: Start basic, then enhance# Step 1: Basic RAG# Step 2: Add metadata filtering# Step 3: Implement hybrid search# Step 4: Add rerankingDON’T ✗
1. Don’t skip overlap
# Bad: No overlap breaks contextsplitter = RecursiveCharacterTextSplitter( chunk_size=1000, chunk_overlap=0 # Sentences get cut off)2. Don’t use too small chunks
# Bad: Too small loses contextsplitter = RecursiveCharacterTextSplitter( chunk_size=100, # Paragraphs get fragmented chunk_overlap=20)3. Don’t ignore retrieval metrics
# Bad: Never test if retrieval works# Just assuming it finds relevant docsresponse = qa_chain.run(query) # Might use irrelevant context4. Don’t use k=1
# Bad: Too few resultsresults = vectorstore.similarity_search(query, k=1) # Not enough context5. Don’t embed everything at once
# Bad: Embed all 10k docs in one call# Hits rate limits, costs a lotembeddings.embed_documents([all_documents]) # Split into batchesCommon Mistakes I Made
Mistake 1: I didn’t test retrieval quality before building the full RAG system.
I built the entire pipeline, then realized the chunks were too small and didn’t capture full procedures.
Fix: Now I always test retrieval first:
# Test before buildingtest_queries = [ "How to reset password?", "What is the API timeout limit?", "Database connection errors"]
for query in test_queries: docs = vectorstore.similarity_search(query, k=4) print(f"\n{query}") for i, doc in enumerate(docs, 1): print(f"{i}. {doc.page_content[:150]}...")Mistake 2: I used fixed chunk sizes for all document types.
Technical manuals need larger chunks than marketing copy.
Fix: Use document-specific chunking:
# Adaptive chunkingif doc.metadata["type"] == "technical_manual": chunk_size = 1500 overlap = 300elif doc.metadata["type"] == "marketing": chunk_size = 500 overlap = 100Mistake 3: I forgot to persist the vector store.
Every restart required re-embedding all documents.
Fix: Always persist:
# Save after creatingvectorstore.persist()
# Load on startupvectorstore = Chroma( persist_directory="./chroma_db", embedding_function=embeddings)Related Skills
Rag Architect works well with these skills:
- planning-with-files: Use first to design your RAG architecture
- backend-patterns: For API design around your RAG system
- security-review: If handling sensitive documents in RAG
Summary
In this post, I showed how to use the Rag Architect skill in Claude Code. I covered when to trigger it, practical examples for documentation QA, hybrid search, and multi-document reasoning, and common mistakes to avoid.
The key point is that Rag Architect helps you think through RAG architecture decisions systematically: chunking, embeddings, retrieval, and generation. Start simple, test retrieval quality early, and incrementally improve based on your specific use case.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Claude Skills Documentation
- 👨💻 Claude Skills GitHub Repository
- 👨💻 RAG Architecture Best Practices
- 👨💻 Vector Database Guide
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments