Skip to content

Is Apache Tika suitable for tabular data in RAG systems?

Problem

When I built a RAG system with Spring Boot and Spring AI that needed to process Excel files, I used Apache Tika’s TikaDocumentReader to extract the data. But when I queried the vector database for “employees over 30 in Engineering”, I got poor results.

The issue wasn’t my embedding model or chunking strategy. It was that Tika had stripped away all the table structure.

What happened?

I was building a RAG pipeline for financial reports stored in Excel files. My setup was simple:

Example.java
TikaDocumentReader reader = new TikaDocumentReader(excelResource);
List<Document> docs = reader.read();
vectorStore.add(docs);

This worked fine for PDFs and Word documents, but not for spreadsheets.

When Tika processed an Excel file like this:

Name Age Department
Alice 30 Engineering
Bob 25 Sales

The output became:

"Name Age Department Alice 30 Engineering Bob 25 Sales"

The column headers were there, but the relationship between “30” and “Age”, or “Engineering” and “Department”, was lost. The text just flowed together.

So when I searched for “departments with employees over 30”, the embeddings didn’t capture that “30” was an age value, not just a random number. The retrieval quality suffered because the structural context was missing.

Why Tika fails at tables

I looked into how Tika works, and I found the root cause: Tika is designed for text extraction, not structure preservation.

Tika’s job is to take 1000+ file formats and give you plain text. It’s amazing at that. But for spreadsheets, it treats cells like words in a paragraph, not like data points in a grid.

The output Tika produces is useful for search indexing (find all files mentioning “Alice”), but not for semantic retrieval (find all rows where Age > 30). The column-row relationships that give tables their meaning get flattened.

When Tika is actually OK

I realized Tika works fine for some cases:

  • Simple CSV files where rows are independent (like log lines)
  • Mixed-content pipelines where you want one parser for everything (PDFs + docs + spreadsheets)
  • Quick prototyping when you just need “any text” to embed

But for my use case—financial data where queries depend on column relationships—Tika wasn’t enough.

What I tried instead

Attempt 1: Apache POI directly

I switched from Tika to Apache POI, which preserves Excel structure:

ExcelReader.java
try (Workbook workbook = WorkbookFactory.create(excelFile)) {
Sheet sheet = workbook.getSheetAt(0);
Row headerRow = sheet.getRow(0);
List<String> headers = new ArrayList<>();
for (Cell cell : headerRow) {
headers.add(cell.getStringCellValue());
}
for (int i = 1; i <= sheet.getLastRowNum(); i++) {
Row row = sheet.getRow(i);
Map<String, Object> rowData = new LinkedHashMap<>();
for (int j = 0; j < headers.size(); j++) {
rowData.put(headers.get(j), getCellValue(row.getCell(j)));
}
String text = String.format(
"Row %d: %s",
i + 1,
rowData.toString()
);
Document doc = new Document(text, metadata);
documents.add(doc);
}
}

Now each row became a self-contained document: “Row 2: Name=‘Alice’, Age=30, Department=‘Engineering’”. The column names stayed with their values.

The retrieval quality improved immediately. Queries like “employees over 30 in Engineering” now found the right documents because the context was preserved.

Attempt 2: OpenCSV for CSV files

For CSV files, I used OpenCSV instead of Tika:

CsvReader.java
try (CSVReader reader = new CSVReader(new FileReader(csvFile))) {
String[] headers = reader.readNext();
String[] row;
while ((row = reader.readNext()) != null) {
Map<String, String> rowData = new LinkedHashMap<>();
for (int i = 0; i < headers.length; i++) {
rowData.put(headers[i], row[i]);
}
String text = rowData.entrySet().stream()
.map(e -> e.getKey() + ": " + e.getValue())
.collect(Collectors.joining(", "));
// "Name: Alice, Age: 30, Department: Engineering"
Document doc = new Document(text, metadata);
documents.add(doc);
}
}

This gave me type-safe parsing and kept the column context intact.

Chunking strategy

I also learned that how you chunk tables matters. Instead of treating the whole table as one document, I chunked by rows:

┌─────────────────────────────────────┐
│ Whole Table (BAD for retrieval) │
│ "Alice 30 Engineering Bob 25 Sales" │
└─────────────────────────────────────┘
Split by rows
┌──────────────────────┐ ┌──────────────────────┐
│ Row 1 (GOOD) │ │ Row 2 (GOOD) │
│ "Name: Alice, │ │ "Name: Bob, │
│ Age: 30, │ │ Age: 25, │
│ Department: Eng" │ │ Department: Sales" │
└──────────────────────┘ └──────────────────────┘

Each row becomes a standalone document with its own embedding. This makes retrieval more precise because queries match against specific rows, not the entire table text.

The tradeoff

Using POI or OpenCSV instead of Tika requires more code. Tika is a one-liner, while POI needs manual iteration. But the extra effort pays off in retrieval quality.

Here’s how the options compare:

Tool Structure Preserved? Code Complexity Best For
Apache Tika No (raw text only) Low (one-liner) Mixed content, quick protos
Apache POI Yes (sheets/rows) High (manual) Complex Excel workbooks
OpenCSV Yes (typed rows) Medium Simple CSV files

Summary

In this post, I showed why Apache Tika isn’t ideal for tabular data in RAG systems. The key point is that Tika extracts text but loses the table structure that gives data its meaning.

For RAG systems where column-row relationships matter:

  • Use Apache POI for Excel to preserve workbook structure
  • Use OpenCSV for CSV files to get type-safe parsing
  • Chunk tables by rows, not as whole documents
  • Keep column names in each chunk for context

Keep Tika for mixed-content pipelines where simplicity matters more than structure. But for structured data where retrieval quality depends on table semantics, specialized parsers work better.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments