How to Handle RAG Over CSV/Excel Files with Spring Boot and Spring AI
Problem
When I tried to apply my existing RAG pipeline to CSV files, the results were unreliable.
I had a working RAG system for PDF and Markdown documents. The chunking strategy worked fine: split by paragraphs, embed with OpenAI, store in vector database, retrieve with semantic search. Users could ask questions and get accurate answers.
But when I tried loading a CSV file of customer data and running the same pipeline, I got nonsensical answers.
Query: "Which customers have Platinum tier?"Result: "Based on the data, customers with high revenue include..."The response missed the exact tier filter. When I looked at the retrieved chunks, I saw the problem: traditional token-based chunking had split rows across multiple chunks. A customer’s name might be in chunk 3, but their tier and revenue were in chunk 4.
Environment
- Spring Boot 3.2
- Spring AI 1.0.0-M4
- OpenAI embeddings (text-embedding-3-small)
- PostgreSQL with pgvector extension
- Apache Commons CSV for CSV parsing
- Apache POI for Excel files
What happened?
I tried loading a CSV file with 10,000 customer records using the same TokenTextSplitter I used for PDFs:
TokenTextSplitter splitter = new TokenTextSplitter();List <Document> chunks = splitter.apply(csvDocuments);The problem is that CSV data has structure that matters. Each row is a complete unit:
name,email,tier,revenueJohn Smith,[email protected],Platinum,150000Jane Doe,[email protected],Gold,75000When the splitter tokenizes this purely by character count, it might create a chunk like:
Chunk 3: "John Smith,[email protected],Platinum,150000\nJane Doe,jane@exa"Chunk 4: "mple.com,Gold,75000\n..."Jane’s email is split across chunks. Now when a user asks “What’s Jane Doe’s email?”, the vector search retrieves chunk 4, but it only contains “mple.com” without context.
I also lost column metadata. In PDFs, I can store page numbers as metadata and it’s useful. But in CSV files, the column names ARE the context. Without knowing that “Platinum” belongs to the “tier” column, the LLM can’t filter accurately.
So I had two core issues:
- Row完整性 was broken by chunking
- Column context was lost without metadata
How to solve it?
I tried three approaches before finding the right pattern.
First attempt: Row-level documents
I stopped chunking entirely. Instead, I treated each CSV row as a separate Document:
Reader reader = new BufferedReader(new FileReader("customers.csv"));CSVParser parser = CSVFormat.DEFAULT.withHeader().parse(reader);
List <Document> documents = parser.getRecords().stream() .map(record -> { String content = String.format( "Customer: %s, Email: %s, Tier: %s, Revenue: %s", record.get("name"), record.get("email"), record.get("tier"), record.get("revenue") );
Map <String, Object> metadata = new HashMap <>(); metadata.put("source", "customers.csv"); metadata.put("row_number", record.getRecordNumber()); metadata.put("tier", record.get("tier")); metadata.put("revenue", record.get("revenue")); metadata.put("data_type", "structured_customer");
return new Document(content, metadata); }) .collect(Collectors.toList());This worked better. Each row stayed intact as a semantic unit. The vector search could now retrieve complete customer records.
But I still had the second problem: when users asked “Show me Platinum customers”, the vector search alone wasn’t enough. The embeddings captured semantic similarity, but couldn’t enforce exact matches on the tier column.
Second attempt: Hybrid retrieval
I needed to combine vector similarity with structured filters. Spring AI supports filter expressions on metadata:
@Servicepublic class StructuredDataRAGService {
private final VectorStore vectorStore; private final ChatClient chatClient;
public String query(String userQuery, String tierFilter, Double minRevenue) { // Build search request with metadata filters SearchRequest.Builder requestBuilder = SearchRequest.query(userQuery) .withTopK(5) .withSimilarityThreshold(0.7);
// Add exact filters if (tierFilter != null) { requestBuilder.withFilterExpression( Expression.FilterExpression.builder() .eq("tier", tierFilter) .build() ); }
if (minRevenue != null) { requestBuilder.withFilterExpression( Expression.FilterExpression.builder() .gte("revenue", minRevenue) .build() ); }
List <Document> relevantDocs = vectorStore.similaritySearch(requestBuilder.build());
// Build context from filtered results String context = relevantDocs.stream() .map(Document::getContent) .collect(Collectors.joining("\n\n"));
String prompt = String.format( "User question: %s\n\nRelevant customer data:\n%s\n\nPlease answer based on this data.", userQuery, context );
return chatClient.call(prompt); }}Now when I query:
ragService.query("high-value customers", "Platinum", 100000.0);The system:
- Uses vector search to find semantically similar rows
- Filters to only rows where tier=Platinum AND revenue>=100000
- Augments the prompt with filtered results
- Generates accurate response
Third attempt: Table-aware chunking for wide tables
The row-level approach works for narrow tables. But what about wide tables with 50+ columns? A single row might exceed token limits if I serialize all columns.
I tried a table-aware chunker that groups multiple rows while preserving headers:
public class TableAwareChunker {
private final int maxRowsPerChunk; private final List <String> headers;
public List <Document> chunkRows(List <CSVRecord> records) { List <Document> chunks = new ArrayList <>();
for (int i = 0; i < records.size(); i += maxRowsPerChunk) { int end = Math.min(i + maxRowsPerChunk, records.size()); List <CSVRecord> chunkRecords = records.subList(i, end);
// Create content with headers repeated for context String content = chunkRecords.stream() .map(record -> headers.stream() .map(col -> String.format("%s=%s", col, record.get(col))) .collect(Collectors.joining(", "))) .collect(Collectors.joining("\n"));
Map <String, Object> metadata = Map.of( "source", "customers.csv", "row_start", i + 1, "row_end", end, "headers", headers.toString() );
chunks.add(new Document(content, metadata)); }
return chunks; }}For a 50-column table, I might set maxRowsPerChunk=3 to stay within token limits while keeping column names visible in each chunk.
The reason
I think the key difference between structured and unstructured RAG is完整性 preservation.
With PDFs or Markdown, splitting a paragraph across chunks still preserves semantic meaning. The LLM can understand sentence fragments and reconstruct context.
But with CSV/Excel:
- Row完整性 is critical - a split row loses the relationship between columns
- Column names are metadata - without them, values lose meaning
- Queries are often structured - “tier=Platinum” needs exact filtering, not semantic similarity
The solution requires:
- Treating each row (or small row groups) as a semantic unit
- Storing column values as metadata for filtering
- Using hybrid retrieval: vector search finds similar rows, metadata filters enforce constraints
Architecture pattern
Here’s the flow that works:
CSV/Excel Files ↓Apache Commons CSV / Apache POI (parse rows) ↓Row-Level Documents (each row = 1 Document) ↓Metadata Enrichment (add columns as metadata) ↓Vector Embedding (OpenAI text-embedding-3-small) ↓Vector Database (PostgreSQL pgvector with indexed metadata) ↓Hybrid Retrieval (vector similarity + metadata filters) ↓LLM Generation (ChatGPT with filtered context)For large Excel files (100K+ rows), I use streaming with Apache POI to avoid loading the entire file into memory:
try (InputStream is = new FileInputStream("large.xlsx"); Workbook workbook = WorkbookFactory.create(is)) {
Sheet sheet = workbook.getSheetAt(0); Row headerRow = sheet.getRow(0);
for (int i = 1; i <= sheet.getLastRowNum(); i++) { Row row = sheet.getRow(i); // Process one row at a time Document doc = createRowDocument(row, headerRow); vectorStore.add(List.of(doc)); }}Summary
In this post, I showed how to implement RAG over structured CSV/Excel data with Spring Boot and Spring AI. The key point is to treat each row as a semantic unit with column metadata, then use hybrid retrieval combining vector search with structured filters.
The main difference from unstructured RAG:
- Create row-level documents instead of chunking by tokens
- Store column values as metadata for exact filtering
- Use
SearchRequest.withFilterExpression()for hybrid queries - Stream large Excel files to avoid memory issues
This approach preserves the row/column relationships that make structured data meaningful, while still leveraging semantic search for natural language queries.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Spring AI Reference Documentation
- 👨💻 Apache Commons CSV
- 👨💻 Apache POI - Java API to Handle Microsoft Office Formats
- 👨💻 Retrieval-Augmented Generation (RAG) on Wikipedia
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments