Skip to content

How to preserve table structure when generating embeddings from CSV/XLS in RAG applications

The Problem

When I tried to build a RAG application that searches CSV and Excel files, I got this result:

User query: "Find products under $100"
System returns: Nothing matched
But my CSV clearly has:
Product Price Stock
Widget A 50 25
Widget B 500 15

I was using Apache Tika to extract text from my files, then splitting that text into chunks for embeddings. The problem? Tika turned my structured table into flat text:

Widget A 50 25 Widget B 500 15

When I searched for “cheap products” or “items under $100”, the embeddings couldn’t tell which number was the price and which was the stock quantity. The context was gone.

Environment

  • Spring Boot 3.2
  • Spring AI 1.0.0
  • Apache Tika 2.9
  • PostgreSQL with pgvector extension
  • Java 21

What happened?

I was building a RAG system for an e-commerce catalog. The data sat in CSV and Excel files with product information:

Product Name Price Category Stock Rating
Widget Pro 299.99 Electronics 45 4.5
Basic Widget 49.99 Home 120 3.8
Premium Widget 599.99 Electronics 8 4.9

Here’s my initial approach using Tika:

TikaTableExtractor.java
@Component
class TableExtractor {
private final Tika tika = new Tika();
public List <Document> extractTable(File file) throws Exception {
// Extract all text as one big string
String text = tika.parseToString(file);
// Result: "Widget Pro 299.99 Electronics 45 4.5 Basic Widget 49.99..."
// Split into chunks by token count
TokenTextSplitter splitter = new TokenTextSplitter();
return splitter.apply(List.of(new Document(text)));
}
}

I can explain the key parts:

  • Tika.parseToString() converts the entire file to plain text
  • The tokenizer chunks by 500 tokens, regardless of row boundaries
  • No metadata about column names or data types

So when I embedded this text and searched, the system saw “299.99 45 4.5” as just three numbers. It couldn’t tell which was price, stock, or rating.

I tried searching for:

Query: "high-priced electronics"
Expected: Widget Pro ($299.99) and Premium Widget ($599.99)
Actual: Nothing returned
Query: "affordable items under $100"
Expected: Basic Widget ($49.99)
Actual: Nothing returned

The embeddings captured the text, but the semantic meaning was lost.

How to solve it?

I tried three approaches before finding one that worked.

Attempt 1: Better prompts (didn’t work)

I tried making my queries more specific:

"products with price less than 100 dollars"
"items where the first number is under 100"

This didn’t help because the underlying embeddings still contained “49.99” without the label “price”.

Attempt 2: Custom chunking (partial fix)

I tried to preserve row boundaries during chunking:

RowAwareSplitter.java
// Split by newlines first to keep rows together
String[] lines = text.split("\\n");
List <String> rowChunks = new ArrayList <>();
StringBuilder currentChunk = new StringBuilder();
for (String line : lines) {
currentChunk.append(line).append("\n");
if (currentChunk.length() > 400) {
rowChunks.add(currentChunk.toString());
currentChunk = new StringBuilder();
}
}

This kept rows together, but still lost column headers. The chunks looked like:

"Widget Pro 299.99 Electronics 45 4.5"

Better than before, but still not labeled.

Attempt 3: Row-based chunking with metadata (the solution)

I realized I needed to stop using Tika for tables and parse them properly instead.

For CSV files:

CsvEmbeddingService.java
@Component
class CsvEmbeddingService {
private final EmbeddingModel embeddingModel;
private final VectorStore vectorStore;
public List <Document> embedCsv(String tableName, InputStream csvStream) {
var reader = new CsvReader(csvStream);
return reader.stream()
.map(row -> {
// Convert row to semantic text with column labels
String semanticText = buildSemanticText(row);
// Store structural metadata
Map <String, Object> metadata = Map.of(
"table", tableName,
"row_index", row.getRowNumber(),
"product_name", row.get("Product_Name"),
"price", row.get("Price"),
"category", row.get("Category"),
"stock", row.get("Stock"),
"rating", row.get("Rating")
);
return new Document(semanticText, metadata);
})
.toList();
}
private String buildSemanticText(CsvRow row) {
return String.format(
"Product: %s, Price: %s USD, Category: %s, Stock: %s units, Rating: %s stars",
row.get("Product_Name"),
row.get("Price"),
row.get("Category"),
row.get("Stock"),
row.get("Rating")
);
}
}

For Excel files:

ExcelEmbeddingService.java
@Component
class ExcelEmbeddingService {
public List <Document> embedExcel(String tableName, File file) throws Exception {
List <Document> documents = new ArrayList <>();
try (Workbook workbook = WorkbookFactory.create(file)) {
Sheet sheet = workbook.getSheetAt(0);
Row headerRow = sheet.getRow(0);
for (int i = 1; i <= sheet.getLastRowNum(); i++) {
Row row = sheet.getRow(i);
if (row == null) continue;
// Build row data with headers
Map <String, String> rowData = new HashMap <>();
for (Cell cell : row) {
String header = headerRow.getCell(cell.getColumnIndex())
.getStringCellValue();
String value = getCellValueAsString(cell);
rowData.put(header, value);
}
// Create labeled text representation
String semanticText = rowData.entrySet().stream()
.map(e -> e.getKey() + ": " + e.getValue())
.collect(Collectors.joining(", "));
// Enrich with metadata
Map <String, Object> metadata = new HashMap <>();
metadata.put("table", tableName);
metadata.put("row_index", i);
metadata.put("sheet_name", sheet.getSheetName());
// Add column-specific metadata for filtering
rowData.forEach((k, v) ->
metadata.put("col_" + k.toLowerCase().replace(" ", "_"), v)
);
documents.add(new Document(semanticText, metadata));
}
}
return documents;
}
}

Now test the search:

TableRAGService.java
@Service
class TableRAGService {
public List <Document> searchProducts(String query) {
return vectorStore.similaritySearch(
SearchRequest.query(query).withTopK(5)
);
}
// Filter by metadata
public List <Document> searchByPriceRange(String tableName, double maxPrice) {
return vectorStore.similaritySearch(
SearchRequest.query("affordable products low price")
.withTopK(10)
.withFilterExpression("table == :tableName")
.withParameters(Map.of("tableName", tableName))
).stream()
.filter(doc -> {
String priceStr = doc.getMetadata().get("col_price");
double price = Double.parseDouble(priceStr);
return price <= maxPrice;
})
.toList();
}
}

Query results:

Query: "high-priced electronics"
Returns:
- Product: Premium Widget, Price: 599.99 USD, Category: Electronics...
- Product: Widget Pro, Price: 299.99 USD, Category: Electronics...
Query: "affordable items under $100"
Returns:
- Product: Basic Widget, Price: 49.99 USD, Category: Home, Stock: 120 units...

You can see that I succeeded to preserve table context while enabling semantic search.

Why this works

The key insight is that embeddings need labeled context to understand tabular data. Here’s what changed:

Before (Tika):

Text: "Widget Pro 299.99 Electronics 45 4.5"
Metadata: {}

The embedding sees “299.99” as just a number. It could be anything.

After (Row-based with metadata):

Text: "Product: Widget Pro, Price: 299.99 USD, Category: Electronics, Stock: 45 units, Rating: 4.5 stars"
Metadata: {table: products, col_price: 299.99, col_category: Electronics, col_stock: 45, col_rating: 4.5}

Now the embedding understands:

  • The number is a price (not stock or rating)
  • The unit is USD
  • The context is Electronics

This approach gives you four benefits:

  1. Accurate retrieval: Queries like “expensive electronics” match the labeled price field
  2. Metadata filtering: You can filter by price range or category after the vector search
  3. Context preservation: When the LLM generates answers, it has structured metadata to work with
  4. Multi-table support: You distinguish between “Price” in Products vs “Price” in Orders tables

Visual comparison

Here’s the difference in data flow:

Tika Approach (loses structure):
┌─────────┐ ┌──────────┐ ┌────────────┐
│ CSV/XLS │ ──→│ Tika │ ──→│ "Widget A │
│ Table │ │ Extract │ │ 299.99 │
└─────────┘ └──────────┘ │ Elec 45" │
└────────────┘
┌────────────┐
│ Embeddings │
│ (no label) │
└────────────┘
Row-based Approach (preserves structure):
┌─────────┐ ┌──────────┐ ┌─────────────────────┐
│ CSV/XLS │ ──→│ Parse │ ──→│ "Product: Widget A, │
│ Table │ │ Rows │ │ Price: 299.99 USD" │
└─────────┘ └──────────┘ │ Metadata: { │
│ price: 299.99, │
│ category: Elec │
│ } │
└─────────────────────┘
┌────────────┐
│ Embeddings │
│ (labeled) │
└────────────┘

The reason

I think the key reason for the original problem is that Tika was designed for document indexing (full-text search), not structured data extraction. When you feed it a spreadsheet, it gives you the text content, which is exactly what it’s supposed to do. But for RAG applications, you need to preserve the structure.

The metadata enrichment is what makes this work. By storing column names as metadata fields, you can:

  • Filter search results by specific columns
  • Reconstruct the original row context when generating answers
  • Enable the LLM to understand field relationships

Summary

In this post, I showed how to preserve table structure when generating embeddings from CSV/XLS files. The key point is to use row-based chunking with metadata enrichment instead of plain text extraction.

By parsing each row as a semantic unit with labeled column values and storing structural metadata in the Document object, you maintain the relationships that make tabular data meaningful while enabling vector similarity search.

This approach ensures that queries like “high-priced electronics” return accurate results because the embedding model understands the labeled context, not just raw numbers.

If you’re working with multiple tables, consider adding table name to your metadata filters to avoid confusion between similarly named columns across different datasets.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments