How to Handle CSV and Excel Files in RAG Applications

Feb 27, 2026

Purpose

This post demonstrates how to handle CSV and Excel files in RAG applications built with Spring Boot and Spring AI, without losing the semantic structure of tabular data.

Environment

Spring Boot 3.2
Spring AI 1.0.0
Apache Tika 2.9
Java 21

The Problem

When I built my first RAG application, I started with PDFs, DOCX files, and Markdown documents. Spring AI’s TikaDocumentReader worked great for those formats. Text extraction was straightforward, and I could chunk documents arbitrarily.

But when I tried to upload a CSV file with employee data, the RAG responses were terrible. I asked “Who is the oldest employee?” and got nonsense answers. Let me show you what went wrong.

Here’s the CSV file I used:

Name,Age,City,Department
John,25,NYC,Engineering
Jane,30,LA,Sales
Bob,45,Sales
Alice,35,NYC,Marketing

And here’s my naive processing code:

@Component
class NaiveCsvProcessor {

    List <Document> loadCsv(Resource resource) {
        TikaDocumentReader reader = new TikaDocumentReader(resource);
        List <Document> docs = reader.read();

        // Generic text splitter
        TokenTextSplitter splitter = new TokenTextSplitter();
        return splitter.apply(docs);
    }
}

The problem? After Tika extracted the text, I got:

Name, Age, City, Department
John, 25, NYC, Engineering
Jane, 30, LA, Sales
Bob, 45, Sales
Alice, 35, NYC, Marketing

Then the text splitter chunked it arbitrarily:

Chunk 1: "Name, Age, City, Department\nJohn, 25, NYC,"
Chunk 2: "Engineering\nJane, 30, LA, Sales\nBob, 45,"
Chunk 3: "Sales\nAlice, 35, NYC, Marketing"

See the issue? Chunk 1 has “John, 25, NYC” but lost “Engineering”. Chunk 3 has “Sales” but we don’t know whose department it is. The row relationships are broken.

When I asked “Who works in Sales?”, the vector search found “Sales” in chunks 2 and 3, but the context was incomplete. The LLM couldn’t connect Jane or Bob to the Sales department reliably.

The Solution

I realized I needed to preserve table structure during extraction and chunking. Here’s what I did.

Step 1: Parse the table structure

Instead of treating CSV as plain text, I parse it to extract headers and rows:

@Component
class TableAwareCsvProcessor {

    List <Document> loadTableWithStructure(Resource resource) throws Exception {
        // Extract raw content with Tika
        TikaDocumentReader reader = new TikaDocumentReader(resource);
        Document rawDoc = reader.read().get(0);

        String content = rawDoc.getContent();
        String[] lines = content.split("\n");

        if (lines.length == 0) return List.of();

        // First row is headers
        String[] headers = parseCsvLine(lines[0]);

        List <Document> rowDocs = new ArrayList <>();

        // Create one document per row
        for (int i = 1; i < lines.length; i++) {
            String[] cells = parseCsvLine(lines[i]);

            // Build semantically rich content
            StringBuilder rowContent = new StringBuilder();
            rowContent.append("Table row ").append(i).append(": ");

            for (int j = 0; j < headers.length && j < cells.length; j++) {
                rowContent.append(headers[j])
                          .append("=")
                          .append(cells[j])
                          .append(", ");
            }

            // Add metadata for filtering
            Map <String, Object> metadata = new HashMap <>();
            metadata.put("file_type", "csv");
            metadata.put("row_index", i);
            metadata.put("source_file", resource.getFilename());

            Document rowDoc = new Document(
                rowContent.toString(),
                metadata
            );

            rowDocs.add(rowDoc);
        }

        return rowDocs;
    }

    private String[] parseCsvLine(String line) {
        // Use OpenCSV in production
        return line.split(",");
    }
}

Now each row becomes a separate document with content like:

Table row 1: Name=John, Age=25, City=NYC, Department=Engineering,
Table row 2: Name=Jane, Age=30, City=LA, Department=Sales,
Table row 3: Name=Bob, Age=45, City=null, Department=Sales,

The key insight? By prepending column names to values, the embeddings capture semantic meaning. “Age=25” is very different from “Department=Sales”. And each document is a complete row, no broken relationships.

Step 2: Store with rich metadata

The metadata lets me filter searches later:

@Service
class TableRagService {

    private final VectorStore vectorStore;
    private final ChatModel chatModel;

    public void ingestTableFile(Resource resource) throws Exception {
        TableAwareCsvProcessor processor = new TableAwareCsvProcessor();
        List <Document> rowDocs = processor.loadTableWithStructure(resource);

        vectorStore.add(rowDocs);

        System.out.println("Added " + rowDocs.size() +
            " table rows to vector store from " +
            resource.getFilename());
    }

    public String queryTable(String userQuery) {
        SearchRequest request = SearchRequest.builder()
            .query(userQuery)
            .topK(5)
            .similarityThreshold(0.7f)
            .build();

        List <Document> relevantRows = vectorStore.similaritySearch(request);

        String context = relevantRows.stream()
            .map(Document::getContent)
            .collect(Collectors.joining("\n\n"));

        String prompt = """
            Context from spreadsheet data:
            %s

            User query: %s

            Answer the query based on the table data above.
            """.formatted(context, userQuery);

        return chatModel.generate(prompt);
    }
}

Step 3: Test the improvement

When I query “Who works in Sales?”, the vector search now returns complete rows:

Table row 2: Name=Jane, Age=30, City=LA, Department=Sales,
Table row 3: Name=Bob, Age=45, City=null, Department=Sales,

The LLM can now correctly answer “Jane and Bob work in Sales.”

Why This Works

Let me explain the key differences:

Naive approach:

CSV → Plain text → Arbitrary chunks → Embeddings
     (loses structure)

Table-aware approach:

CSV → Structured rows → Header:Value pairs → Embeddings
      (preserves relationships)

By converting “John, 25” into “Name=John, Age=25”, the embeddings capture the semantic relationship between the column name and its value. When someone searches for “oldest employee”, the vector similarity finds “Age=45” with high relevance.

Handling Excel Files

Excel files are similar but with multiple sheets. Apache Tika can extract sheet metadata:

List <Document> loadExcelWorkbook(Resource resource) {
    TikaDocumentReader reader = new TikaDocumentReader(resource);

    return reader.read().stream()
        .filter(doc -> doc.getMetadata().containsKey("sheet_name"))
        .map(doc -> {
            // Enrich with sheet metadata
            Map <String, Object> enhancedMeta = new HashMap <>(doc.getMetadata());
            enhancedMeta.put("workbook", resource.getFilename());
            enhancedMeta.put("sheet", doc.getMetadata().get("sheet_name"));

            // Apply same row-aware processing
            return processSheetRows(doc, enhancedMeta);
        })
        .flatMap(List::stream)
        .toList();
}

Now I can filter queries by sheet: “Find employees in Q1 data” by filtering metadata.sheet == "Q1".

Common Mistakes I Made

Mistake 1: Using generic text splitters I tried TokenTextSplitter, RecursiveCharacterTextSplitter, and other splitters designed for prose. They all broke table rows. The solution? Don’t split tables. One row = one document.

Mistake 2: Ignoring headers Initially I just used cell values: “John, 25, NYC”. Queries like “Who is 25 years old?” worked, but “Who is the oldest?” failed because the LLM didn’t know 25 was an age. Prepending headers fixed this.

Mistake 3: Not adding metadata I forgot to store row_index and file_type in metadata. Later I wanted to filter results by row number but couldn’t. Always add metadata during ingestion.

Dependencies

Here’s what I added to my pom.xml:

<dependencies>
    <!-- Spring AI Core -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-spring-boot-autoconfigure</artifactId>
    </dependency>

    <!-- Apache Tika Document Reader -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-tika-document-reader</artifactId>
    </dependency>

    <!-- Vector Store (PGVector example) -->
    <dependency>
        <groupId>org.springframework.ai</groupId>
        <artifactId>spring-ai-pgvector-store</artifactId>
    </dependency>

    <!-- For production CSV parsing -->
    <dependency>
        <groupId>com.opencsv</groupId>
        <artifactId>opencsv</artifactId>
        <version>5.9</version>
    </dependency>
</dependencies>

Summary

In this post, I showed how to handle CSV and Excel files in RAG applications by preserving table structure during extraction. The key points are:

Parse tables to extract headers and rows
Create one document per row with header:value format
Store rich metadata for filtering
Don’t use generic text splitters on tabular data

This approach maintains semantic relationships between columns and values, enabling accurate RAG responses for analytical queries on spreadsheet data.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!