How to Handle CSV and Excel Files in RAG Applications
Purpose
This post demonstrates how to handle CSV and Excel files in RAG applications built with Spring Boot and Spring AI, without losing the semantic structure of tabular data.
Environment
- Spring Boot 3.2
- Spring AI 1.0.0
- Apache Tika 2.9
- Java 21
The Problem
When I built my first RAG application, I started with PDFs, DOCX files, and Markdown documents. Spring AI’s TikaDocumentReader worked great for those formats. Text extraction was straightforward, and I could chunk documents arbitrarily.
But when I tried to upload a CSV file with employee data, the RAG responses were terrible. I asked “Who is the oldest employee?” and got nonsense answers. Let me show you what went wrong.
Here’s the CSV file I used:
Name,Age,City,DepartmentJohn,25,NYC,EngineeringJane,30,LA,SalesBob,45,SalesAlice,35,NYC,MarketingAnd here’s my naive processing code:
@Componentclass NaiveCsvProcessor {
List <Document> loadCsv(Resource resource) { TikaDocumentReader reader = new TikaDocumentReader(resource); List <Document> docs = reader.read();
// Generic text splitter TokenTextSplitter splitter = new TokenTextSplitter(); return splitter.apply(docs); }}The problem? After Tika extracted the text, I got:
Name, Age, City, DepartmentJohn, 25, NYC, EngineeringJane, 30, LA, SalesBob, 45, SalesAlice, 35, NYC, MarketingThen the text splitter chunked it arbitrarily:
Chunk 1: "Name, Age, City, Department\nJohn, 25, NYC,"Chunk 2: "Engineering\nJane, 30, LA, Sales\nBob, 45,"Chunk 3: "Sales\nAlice, 35, NYC, Marketing"See the issue? Chunk 1 has “John, 25, NYC” but lost “Engineering”. Chunk 3 has “Sales” but we don’t know whose department it is. The row relationships are broken.
When I asked “Who works in Sales?”, the vector search found “Sales” in chunks 2 and 3, but the context was incomplete. The LLM couldn’t connect Jane or Bob to the Sales department reliably.
The Solution
I realized I needed to preserve table structure during extraction and chunking. Here’s what I did.
Step 1: Parse the table structure
Instead of treating CSV as plain text, I parse it to extract headers and rows:
@Componentclass TableAwareCsvProcessor {
List <Document> loadTableWithStructure(Resource resource) throws Exception { // Extract raw content with Tika TikaDocumentReader reader = new TikaDocumentReader(resource); Document rawDoc = reader.read().get(0);
String content = rawDoc.getContent(); String[] lines = content.split("\n");
if (lines.length == 0) return List.of();
// First row is headers String[] headers = parseCsvLine(lines[0]);
List <Document> rowDocs = new ArrayList <>();
// Create one document per row for (int i = 1; i < lines.length; i++) { String[] cells = parseCsvLine(lines[i]);
// Build semantically rich content StringBuilder rowContent = new StringBuilder(); rowContent.append("Table row ").append(i).append(": ");
for (int j = 0; j < headers.length && j < cells.length; j++) { rowContent.append(headers[j]) .append("=") .append(cells[j]) .append(", "); }
// Add metadata for filtering Map <String, Object> metadata = new HashMap <>(); metadata.put("file_type", "csv"); metadata.put("row_index", i); metadata.put("source_file", resource.getFilename());
Document rowDoc = new Document( rowContent.toString(), metadata );
rowDocs.add(rowDoc); }
return rowDocs; }
private String[] parseCsvLine(String line) { // Use OpenCSV in production return line.split(","); }}Now each row becomes a separate document with content like:
Table row 1: Name=John, Age=25, City=NYC, Department=Engineering,Table row 2: Name=Jane, Age=30, City=LA, Department=Sales,Table row 3: Name=Bob, Age=45, City=null, Department=Sales,The key insight? By prepending column names to values, the embeddings capture semantic meaning. “Age=25” is very different from “Department=Sales”. And each document is a complete row, no broken relationships.
Step 2: Store with rich metadata
The metadata lets me filter searches later:
@Serviceclass TableRagService {
private final VectorStore vectorStore; private final ChatModel chatModel;
public void ingestTableFile(Resource resource) throws Exception { TableAwareCsvProcessor processor = new TableAwareCsvProcessor(); List <Document> rowDocs = processor.loadTableWithStructure(resource);
vectorStore.add(rowDocs);
System.out.println("Added " + rowDocs.size() + " table rows to vector store from " + resource.getFilename()); }
public String queryTable(String userQuery) { SearchRequest request = SearchRequest.builder() .query(userQuery) .topK(5) .similarityThreshold(0.7f) .build();
List <Document> relevantRows = vectorStore.similaritySearch(request);
String context = relevantRows.stream() .map(Document::getContent) .collect(Collectors.joining("\n\n"));
String prompt = """ Context from spreadsheet data: %s
User query: %s
Answer the query based on the table data above. """.formatted(context, userQuery);
return chatModel.generate(prompt); }}Step 3: Test the improvement
When I query “Who works in Sales?”, the vector search now returns complete rows:
Table row 2: Name=Jane, Age=30, City=LA, Department=Sales,Table row 3: Name=Bob, Age=45, City=null, Department=Sales,The LLM can now correctly answer “Jane and Bob work in Sales.”
Why This Works
Let me explain the key differences:
Naive approach:
CSV → Plain text → Arbitrary chunks → Embeddings (loses structure)Table-aware approach:
CSV → Structured rows → Header:Value pairs → Embeddings (preserves relationships)By converting “John, 25” into “Name=John, Age=25”, the embeddings capture the semantic relationship between the column name and its value. When someone searches for “oldest employee”, the vector similarity finds “Age=45” with high relevance.
Handling Excel Files
Excel files are similar but with multiple sheets. Apache Tika can extract sheet metadata:
List <Document> loadExcelWorkbook(Resource resource) { TikaDocumentReader reader = new TikaDocumentReader(resource);
return reader.read().stream() .filter(doc -> doc.getMetadata().containsKey("sheet_name")) .map(doc -> { // Enrich with sheet metadata Map <String, Object> enhancedMeta = new HashMap <>(doc.getMetadata()); enhancedMeta.put("workbook", resource.getFilename()); enhancedMeta.put("sheet", doc.getMetadata().get("sheet_name"));
// Apply same row-aware processing return processSheetRows(doc, enhancedMeta); }) .flatMap(List::stream) .toList();}Now I can filter queries by sheet: “Find employees in Q1 data” by filtering metadata.sheet == "Q1".
Common Mistakes I Made
Mistake 1: Using generic text splitters
I tried TokenTextSplitter, RecursiveCharacterTextSplitter, and other splitters designed for prose. They all broke table rows. The solution? Don’t split tables. One row = one document.
Mistake 2: Ignoring headers Initially I just used cell values: “John, 25, NYC”. Queries like “Who is 25 years old?” worked, but “Who is the oldest?” failed because the LLM didn’t know 25 was an age. Prepending headers fixed this.
Mistake 3: Not adding metadata
I forgot to store row_index and file_type in metadata. Later I wanted to filter results by row number but couldn’t. Always add metadata during ingestion.
Dependencies
Here’s what I added to my pom.xml:
<dependencies> <!-- Spring AI Core --> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-spring-boot-autoconfigure</artifactId> </dependency>
<!-- Apache Tika Document Reader --> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-tika-document-reader</artifactId> </dependency>
<!-- Vector Store (PGVector example) --> <dependency> <groupId>org.springframework.ai</groupId> <artifactId>spring-ai-pgvector-store</artifactId> </dependency>
<!-- For production CSV parsing --> <dependency> <groupId>com.opencsv</groupId> <artifactId>opencsv</artifactId> <version>5.9</version> </dependency></dependencies>Summary
In this post, I showed how to handle CSV and Excel files in RAG applications by preserving table structure during extraction. The key points are:
- Parse tables to extract headers and rows
- Create one document per row with header:value format
- Store rich metadata for filtering
- Don’t use generic text splitters on tabular data
This approach maintains semantic relationships between columns and values, enabling accurate RAG responses for analytical queries on spreadsheet data.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments