Skip to content

How to Handle RAG Over CSV/Excel Files with Spring Boot and Spring AI

Problem

When I tried to apply my existing RAG pipeline to CSV files, the results were unreliable.

I had a working RAG system for PDF and Markdown documents. The chunking strategy worked fine: split by paragraphs, embed with OpenAI, store in vector database, retrieve with semantic search. Users could ask questions and get accurate answers.

But when I tried loading a CSV file of customer data and running the same pipeline, I got nonsensical answers.

Terminal window
Query: "Which customers have Platinum tier?"
Result: "Based on the data, customers with high revenue include..."

The response missed the exact tier filter. When I looked at the retrieved chunks, I saw the problem: traditional token-based chunking had split rows across multiple chunks. A customer’s name might be in chunk 3, but their tier and revenue were in chunk 4.

Environment

  • Spring Boot 3.2
  • Spring AI 1.0.0-M4
  • OpenAI embeddings (text-embedding-3-small)
  • PostgreSQL with pgvector extension
  • Apache Commons CSV for CSV parsing
  • Apache POI for Excel files

What happened?

I tried loading a CSV file with 10,000 customer records using the same TokenTextSplitter I used for PDFs:

TokenTextSplitter splitter = new TokenTextSplitter();
List <Document> chunks = splitter.apply(csvDocuments);

The problem is that CSV data has structure that matters. Each row is a complete unit:

name,email,tier,revenue
John Smith,[email protected],Platinum,150000
Jane Doe,[email protected],Gold,75000

When the splitter tokenizes this purely by character count, it might create a chunk like:

Chunk 3: "John Smith,[email protected],Platinum,150000\nJane Doe,jane@exa"
Chunk 4: "mple.com,Gold,75000\n..."

Jane’s email is split across chunks. Now when a user asks “What’s Jane Doe’s email?”, the vector search retrieves chunk 4, but it only contains “mple.com” without context.

I also lost column metadata. In PDFs, I can store page numbers as metadata and it’s useful. But in CSV files, the column names ARE the context. Without knowing that “Platinum” belongs to the “tier” column, the LLM can’t filter accurately.

So I had two core issues:

  1. Row完整性 was broken by chunking
  2. Column context was lost without metadata

How to solve it?

I tried three approaches before finding the right pattern.

First attempt: Row-level documents

I stopped chunking entirely. Instead, I treated each CSV row as a separate Document:

Reader reader = new BufferedReader(new FileReader("customers.csv"));
CSVParser parser = CSVFormat.DEFAULT.withHeader().parse(reader);
List <Document> documents = parser.getRecords().stream()
.map(record -> {
String content = String.format(
"Customer: %s, Email: %s, Tier: %s, Revenue: %s",
record.get("name"),
record.get("email"),
record.get("tier"),
record.get("revenue")
);
Map <String, Object> metadata = new HashMap <>();
metadata.put("source", "customers.csv");
metadata.put("row_number", record.getRecordNumber());
metadata.put("tier", record.get("tier"));
metadata.put("revenue", record.get("revenue"));
metadata.put("data_type", "structured_customer");
return new Document(content, metadata);
})
.collect(Collectors.toList());

This worked better. Each row stayed intact as a semantic unit. The vector search could now retrieve complete customer records.

But I still had the second problem: when users asked “Show me Platinum customers”, the vector search alone wasn’t enough. The embeddings captured semantic similarity, but couldn’t enforce exact matches on the tier column.

Second attempt: Hybrid retrieval

I needed to combine vector similarity with structured filters. Spring AI supports filter expressions on metadata:

@Service
public class StructuredDataRAGService {
private final VectorStore vectorStore;
private final ChatClient chatClient;
public String query(String userQuery, String tierFilter, Double minRevenue) {
// Build search request with metadata filters
SearchRequest.Builder requestBuilder = SearchRequest.query(userQuery)
.withTopK(5)
.withSimilarityThreshold(0.7);
// Add exact filters
if (tierFilter != null) {
requestBuilder.withFilterExpression(
Expression.FilterExpression.builder()
.eq("tier", tierFilter)
.build()
);
}
if (minRevenue != null) {
requestBuilder.withFilterExpression(
Expression.FilterExpression.builder()
.gte("revenue", minRevenue)
.build()
);
}
List <Document> relevantDocs = vectorStore.similaritySearch(requestBuilder.build());
// Build context from filtered results
String context = relevantDocs.stream()
.map(Document::getContent)
.collect(Collectors.joining("\n\n"));
String prompt = String.format(
"User question: %s\n\nRelevant customer data:\n%s\n\nPlease answer based on this data.",
userQuery, context
);
return chatClient.call(prompt);
}
}

Now when I query:

ragService.query("high-value customers", "Platinum", 100000.0);

The system:

  1. Uses vector search to find semantically similar rows
  2. Filters to only rows where tier=Platinum AND revenue>=100000
  3. Augments the prompt with filtered results
  4. Generates accurate response

Third attempt: Table-aware chunking for wide tables

The row-level approach works for narrow tables. But what about wide tables with 50+ columns? A single row might exceed token limits if I serialize all columns.

I tried a table-aware chunker that groups multiple rows while preserving headers:

public class TableAwareChunker {
private final int maxRowsPerChunk;
private final List <String> headers;
public List <Document> chunkRows(List <CSVRecord> records) {
List <Document> chunks = new ArrayList <>();
for (int i = 0; i < records.size(); i += maxRowsPerChunk) {
int end = Math.min(i + maxRowsPerChunk, records.size());
List <CSVRecord> chunkRecords = records.subList(i, end);
// Create content with headers repeated for context
String content = chunkRecords.stream()
.map(record -> headers.stream()
.map(col -> String.format("%s=%s", col, record.get(col)))
.collect(Collectors.joining(", ")))
.collect(Collectors.joining("\n"));
Map <String, Object> metadata = Map.of(
"source", "customers.csv",
"row_start", i + 1,
"row_end", end,
"headers", headers.toString()
);
chunks.add(new Document(content, metadata));
}
return chunks;
}
}

For a 50-column table, I might set maxRowsPerChunk=3 to stay within token limits while keeping column names visible in each chunk.

The reason

I think the key difference between structured and unstructured RAG is完整性 preservation.

With PDFs or Markdown, splitting a paragraph across chunks still preserves semantic meaning. The LLM can understand sentence fragments and reconstruct context.

But with CSV/Excel:

  • Row完整性 is critical - a split row loses the relationship between columns
  • Column names are metadata - without them, values lose meaning
  • Queries are often structured - “tier=Platinum” needs exact filtering, not semantic similarity

The solution requires:

  1. Treating each row (or small row groups) as a semantic unit
  2. Storing column values as metadata for filtering
  3. Using hybrid retrieval: vector search finds similar rows, metadata filters enforce constraints

Architecture pattern

Here’s the flow that works:

CSV/Excel Files
Apache Commons CSV / Apache POI (parse rows)
Row-Level Documents (each row = 1 Document)
Metadata Enrichment (add columns as metadata)
Vector Embedding (OpenAI text-embedding-3-small)
Vector Database (PostgreSQL pgvector with indexed metadata)
Hybrid Retrieval (vector similarity + metadata filters)
LLM Generation (ChatGPT with filtered context)

For large Excel files (100K+ rows), I use streaming with Apache POI to avoid loading the entire file into memory:

try (InputStream is = new FileInputStream("large.xlsx");
Workbook workbook = WorkbookFactory.create(is)) {
Sheet sheet = workbook.getSheetAt(0);
Row headerRow = sheet.getRow(0);
for (int i = 1; i <= sheet.getLastRowNum(); i++) {
Row row = sheet.getRow(i);
// Process one row at a time
Document doc = createRowDocument(row, headerRow);
vectorStore.add(List.of(doc));
}
}

Summary

In this post, I showed how to implement RAG over structured CSV/Excel data with Spring Boot and Spring AI. The key point is to treat each row as a semantic unit with column metadata, then use hybrid retrieval combining vector search with structured filters.

The main difference from unstructured RAG:

  • Create row-level documents instead of chunking by tokens
  • Store column values as metadata for exact filtering
  • Use SearchRequest.withFilterExpression() for hybrid queries
  • Stream large Excel files to avoid memory issues

This approach preserves the row/column relationships that make structured data meaningful, while still leveraging semantic search for natural language queries.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments