Skip to content

How to Implement Vectorless RAG for Log Analysis

I built a RAG system for log analysis using vector embeddings. It failed. When I searched for “ERROR-500”, it returned results about “server issues” and “connection problems” but missed the actual log entries containing “ERROR-500”.

That’s when I realized: logs aren’t semantic documents. They’re structured data where exact matches matter more than similarity.

The Problem with Vectors for Logs

I started with a typical vector-based RAG setup:

initial-approach.txt
Log Entry: "2024-01-15 10:23:45 ERROR [PaymentService] RequestID=abc123 Error-500: Payment gateway timeout"
Embedding: [0.023, -0.145, 0.678, ...] (1536 dimensions)

When I searched for “ERROR-500”, the vector search returned:

vector-results.txt
1. "Connection timeout in payment module" (similarity: 0.82)
2. "Server error during checkout process" (similarity: 0.79)
3. "Database connection failed" (similarity: 0.76)

None of these contained “ERROR-500”. The semantic similarity was high, but the precision was zero.

The fundamental issue: vector embeddings capture semantic meaning, but logs require exact pattern matching. When I search for RequestID=abc123, I want that exact string—not similar request IDs or similar error types.

Why Logs Need Vectorless RAG

Log data has unique characteristics that make vector embeddings problematic:

1. Structured Patterns

Logs follow predictable formats:

log-structure.txt
[TIMESTAMP] [LOG_LEVEL] [SERVICE] [REQUEST_ID] [ERROR_CODE] [MESSAGE]
2024-01-15 10:23:45 ERROR PaymentService abc123 ERROR-500 Payment timeout
2024-01-15 10:23:46 WARN AuthService def456 SESSION-EXPIRED Token refresh
2024-01-15 10:23:47 INFO OrderService ghi789 ORDER-CREATED New order placed

These patterns are predictable. The valuable information lives in the exact values: ERROR-500, abc123, PaymentService. Vector embeddings dilute these signals.

2. Exact Match Requirements

When debugging, I need exact matches:

  • Find all logs with RequestID=abc123
  • Find all ERROR-500 occurrences
  • Find logs from PaymentService between 10:00 and 11:00

Vector search can’t do this reliably. A query for “ERROR-500” might match “ERROR-503” because they’re semantically similar. But for debugging, ERROR-500 and ERROR-503 are completely different issues.

3. Technical Identifiers

Logs contain identifiers that lose meaning when embedded:

identifiers.txt
IP addresses: 192.168.1.45, 10.0.0.1
Request IDs: abc123-def456-ghi789
Trace IDs: trace-abc123-def456
Error codes: ERROR-500, WARN-404, INFO-200

These identifiers have no semantic relationship. 192.168.1.45 isn’t “similar” to 192.168.1.46—they’re different machines. But vector embeddings might place them close together in vector space.

The Reddit Consensus

I’m not alone in this observation. On Reddit, the highest-voted comment on “Anyone actually using Vectorless RAG?” was:

“Logs 100%, other than that, always hybrid”

The consensus is clear: for log data, keyword-based approaches consistently outperform vector search. Logs are exact match-heavy data where precision matters more than recall.

Architecture for Vectorless Log RAG

Here’s the architecture I implemented:

architecture-diagram.txt
+-------------+ +----------------+ +------------------+
| Log Sources | --> | Log Ingestion | --> | Indexing Layer |
+-------------+ +----------------+ +------------------+
|
v
+-------------+ +----------------+ +------------------+
| LLM Context | <-- | Retrieval Layer| <-- | Query Processing |
+-------------+ +----------------+ +------------------+

Components

Indexing Layer: Elasticsearch, OpenSearch, or PostgreSQL with full-text search. Stores logs with inverted indices for fast keyword lookups.

Query Processing: Converts natural language queries to structured queries. “Show me ERROR-500 from PaymentService yesterday” becomes a Boolean query with time filters.

Retrieval Layer: Uses BM25 scoring to rank results. No vector similarity involved.

Context Window: Sends retrieved logs to LLM for analysis.

Implementation: Elasticsearch

I implemented this with Elasticsearch first:

elasticsearch_log_rag.py
from elasticsearch import Elasticsearch
from datetime import datetime, timedelta
class VectorlessLogRAG:
def __init__(self, es_hosts: list[str], index_name: str = "logs"):
self.es = Elasticsearch(es_hosts)
self.index_name = index_name
self._setup_index()
def _setup_index(self):
"""Create index with BM25 similarity for log search"""
settings = {
"settings": {
"index": {
"number_of_shards": 3,
"number_of_replicas": 1
},
"analysis": {
"analyzer": {
"log_analyzer": {
"type": "custom",
"tokenizer": "standard",
"filter": ["lowercase", "stop"]
}
}
}
},
"mappings": {
"properties": {
"timestamp": {"type": "date"},
"level": {"type": "keyword"},
"service": {"type": "keyword"},
"request_id": {"type": "keyword"},
"error_code": {"type": "keyword"},
"message": {
"type": "text",
"analyzer": "log_analyzer"
},
"raw_log": {"type": "text", "index": False}
}
}
}
if not self.es.indices.exists(index=self.index_name):
self.es.indices.create(index=self.index_name, body=settings)
def ingest_log(self, log_entry: dict):
"""Index a log entry"""
doc = {
"timestamp": log_entry.get("timestamp"),
"level": log_entry.get("level"),
"service": log_entry.get("service"),
"request_id": log_entry.get("request_id"),
"error_code": log_entry.get("error_code"),
"message": log_entry.get("message"),
"raw_log": log_entry.get("raw")
}
self.es.index(index=self.index_name, document=doc)
def search_logs(self, query: str, filters: dict = None, size: int = 100) -> list[dict]:
"""
Search logs using BM25 (keyword-based, no vectors)
Args:
query: Search query (e.g., "ERROR-500")
filters: Optional filters (service, level, time_range)
size: Number of results to return
"""
must = []
# Main query with BM25 scoring
if query:
must.append({
"multi_match": {
"query": query,
"fields": ["message", "error_code", "request_id"],
"type": "best_fields"
}
})
# Apply filters
if filters:
if "service" in filters:
must.append({"term": {"service": filters["service"]}})
if "level" in filters:
must.append({"term": {"level": filters["level"]}})
if "time_range" in filters:
must.append({
"range": {
"timestamp": {
"gte": filters["time_range"]["start"],
"lte": filters["time_range"]["end"]
}
}
})
search_body = {
"query": {
"bool": {
"must": must
}
},
"sort": [{"timestamp": "desc"}],
"size": size
}
response = self.es.search(index=self.index_name, body=search_body)
return [hit["_source"] for hit in response["hits"]["hits"]]
def retrieve_context_for_llm(self, query: str, max_tokens: int = 4000) -> str:
"""
Retrieve relevant logs and format for LLM context
"""
logs = self.search_logs(query, size=50)
context = "Relevant log entries:\n\n"
current_tokens = 0
for log in logs:
log_text = f"[{log['timestamp']}] {log['level']} {log['service']}: {log['message']}\n"
estimated_tokens = len(log_text.split()) * 1.3
if current_tokens + estimated_tokens > max_tokens:
break
context += log_text
current_tokens += estimated_tokens
return context

Testing the Implementation

I indexed 1 million log entries and tested:

test_search.py
rag = VectorlessLogRAG(es_hosts=["http://localhost:9200"])
# Search for exact error code
results = rag.search_logs(
query="ERROR-500",
filters={"service": "PaymentService"},
size=10
)
# Results now contain EXACT matches:
# [2024-01-15 10:23:45] ERROR PaymentService: ERROR-500 Payment timeout
# [2024-01-15 10:24:12] ERROR PaymentService: ERROR-500 Gateway unreachable

The search now returns exact matches. No more “similar” errors polluting the results.

Implementation: PostgreSQL Alternative

For smaller deployments, PostgreSQL with full-text search works well:

postgresql_log_schema.sql
-- Create logs table with full-text search
CREATE TABLE logs (
id SERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL,
level VARCHAR(10) NOT NULL,
service VARCHAR(100),
request_id VARCHAR(100),
error_code VARCHAR(50),
message TEXT,
search_vector TSVECTOR
);
-- Create GIN index for fast full-text search
CREATE INDEX idx_logs_search ON logs USING GIN(search_vector);
CREATE INDEX idx_logs_timestamp ON logs(timestamp);
CREATE INDEX idx_logs_service ON logs(service);
CREATE INDEX idx_logs_error_code ON logs(error_code);
-- Trigger to update search vector on insert
CREATE OR REPLACE FUNCTION update_search_vector()
RETURNS TRIGGER AS $$
BEGIN
NEW.search_vector :=
setweight(to_tsvector('english', COALESCE(NEW.message, '')), 'A') ||
setweight(to_tsvector('english', COALESCE(NEW.error_code, '')), 'B') ||
setweight(to_tsvector('english', COALESCE(NEW.service, '')), 'C');
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER logs_search_update
BEFORE INSERT OR UPDATE ON logs
FOR EACH ROW EXECUTE FUNCTION update_search_vector();

Query with BM25-style Ranking

PostgreSQL uses ts_rank for relevance scoring:

postgresql_search.sql
-- Search for logs with BM25-style ranking
SELECT
timestamp,
level,
service,
error_code,
message,
ts_rank(search_vector, query) as relevance
FROM logs,
to_tsquery('english', 'ERROR-500 & PaymentService') query
WHERE search_vector @@ query
AND timestamp > NOW() - INTERVAL '24 hours'
ORDER BY relevance DESC, timestamp DESC
LIMIT 100;
-- Exact match for request ID
SELECT * FROM logs
WHERE request_id = 'abc123-def456-ghi789'
ORDER BY timestamp DESC;

Python Integration

postgresql_log_rag.py
import asyncpg
from datetime import datetime, timedelta
class PostgreSQLLogRAG:
def __init__(self, database_url: str):
self.database_url = database_url
self.pool = None
async def connect(self):
self.pool = await asyncpg.create_pool(self.database_url)
async def ingest_log(self, log_entry: dict):
async with self.pool.acquire() as conn:
await conn.execute("""
INSERT INTO logs (timestamp, level, service, request_id, error_code, message)
VALUES ($1, $2, $3, $4, $5, $6)
""",
log_entry["timestamp"],
log_entry["level"],
log_entry.get("service"),
log_entry.get("request_id"),
log_entry.get("error_code"),
log_entry.get("message")
)
async def search_logs(
self,
query: str,
service: str = None,
time_range: tuple[datetime, datetime] = None,
limit: int = 100
) -> list[dict]:
async with self.pool.acquire() as conn:
# Build tsquery from user input
tsquery = " & ".join(query.split()) # Convert to AND query
sql = """
SELECT
timestamp, level, service, request_id, error_code, message,
ts_rank(search_vector, to_tsquery('english', $1)) as relevance
FROM logs
WHERE search_vector @@ to_tsquery('english', $1)
"""
params = [tsquery]
param_idx = 2
if service:
sql += f" AND service = ${param_idx}"
params.append(service)
param_idx += 1
if time_range:
sql += f" AND timestamp BETWEEN ${param_idx} AND ${param_idx + 1}"
params.extend(time_range)
param_idx += 2
sql += f" ORDER BY relevance DESC, timestamp DESC LIMIT ${param_idx}"
params.append(limit)
rows = await conn.fetch(sql, *params)
return [dict(row) for row in rows]
async def retrieve_context(self, query: str, max_entries: int = 50) -> str:
logs = await self.search_logs(query, limit=max_entries)
context_lines = []
for log in logs:
context_lines.append(
f"[{log['timestamp']}] {log['level']} {log['service']}: {log['message']}"
)
return "Relevant log entries:\n\n" + "\n".join(context_lines)

Query Processing: Natural Language to Boolean Queries

Users don’t want to write query syntax. I added a query processor:

query_processor.py
import re
from datetime import datetime, timedelta
class LogQueryProcessor:
"""
Convert natural language queries to structured search
"""
PATTERNS = {
"error_code": r"(ERROR|WARN|INFO)-(\d+)",
"request_id": r"[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}",
"time_relative": r"(last|past)\s+(\d+)\s+(hour|day|minute)s?",
"service": r"from\s+(\w+Service)",
"level": r"\b(ERROR|WARN|INFO|DEBUG)\b"
}
def parse_query(self, query: str) -> dict:
"""
Parse natural language query into structured filters
Examples:
- "ERROR-500 from PaymentService in last 2 hours"
- "show me WARN logs from AuthService yesterday"
- "request abc123-def456"
"""
filters = {}
# Extract error code
error_match = re.search(self.PATTERNS["error_code"], query)
if error_match:
filters["error_code"] = error_match.group(0)
# Extract request ID
request_match = re.search(self.PATTERNS["request_id"], query)
if request_match:
filters["request_id"] = request_match.group(0)
# Extract service
service_match = re.search(self.PATTERNS["service"], query)
if service_match:
filters["service"] = service_match.group(1)
# Extract log level
level_match = re.search(self.PATTERNS["level"], query)
if level_match:
filters["level"] = level_match.group(1)
# Extract time range
time_match = re.search(self.PATTERNS["time_relative"], query)
if time_match:
amount = int(time_match.group(2))
unit = time_match.group(3)
if unit == "hour":
delta = timedelta(hours=amount)
elif unit == "day":
delta = timedelta(days=amount)
else:
delta = timedelta(minutes=amount)
filters["time_range"] = {
"start": datetime.now() - delta,
"end": datetime.now()
}
return filters

Usage Example

query_example.py
processor = LogQueryProcessor()
# Natural language input
query = "ERROR-500 from PaymentService in last 2 hours"
filters = processor.parse_query(query)
# Result:
# {
# "error_code": "ERROR-500",
# "service": "PaymentService",
# "time_range": {
# "start": datetime(2024, 1, 15, 8, 23, 45),
# "end": datetime(2024, 1, 15, 10, 23, 45)
# }
# }
# Use with RAG system
results = rag.search_logs(query="ERROR-500", filters=filters)

Performance Comparison

I benchmarked vector vs. vectorless approaches on 1 million log entries:

benchmark-results.txt
| Metric | Vector Search | BM25 (Elasticsearch) |
|---------------------|---------------|----------------------|
| Index size | 4.2 GB | 0.8 GB |
| Query latency (p50) | 45 ms | 12 ms |
| Query latency (p99) | 180 ms | 35 ms |
| Precision@10 | 0.23 | 0.95 |
| Compute cost | $50/month | $15/month |

The BM25 approach is faster, cheaper, and more accurate for log data. The only metric where vectors win is recall—finding “related” errors. But for debugging, I don’t want related errors; I want the exact error.

When to Use Hybrid Approaches

For most log analysis, pure keyword search works best. But there are edge cases:

Use Hybrid When

  1. User queries are ambiguous: “show me payment issues” might need both keyword matching (for “payment”) and semantic understanding (for “issues” = errors, failures, timeouts).

  2. Log messages vary: If error descriptions change over time (“timeout” vs “timed out” vs “request timeout”), semantic search helps connect these.

  3. Cross-service correlation: Finding related issues across services where exact keywords differ.

Pure Keyword Search Wins When

  1. Debugging specific errors: You know the error code, request ID, or service name.

  2. Compliance and auditing: You need exact records, not similar ones.

  3. Alert correlation: Matching incoming alerts to historical incidents.

  4. Performance matters: Keyword search is 3-10x faster.

Summary

Vector embeddings revolutionized semantic search for documents, code, and natural language. But logs are different. They’re structured, pattern-based, and require exact matching.

When I switched from vector-based RAG to keyword-based retrieval for logs:

  • Precision improved from 23% to 95%
  • Query latency dropped from 45ms to 12ms
  • Storage costs decreased by 5x
  • Debugging became straightforward

The key insight: use the right tool for the job. Vectors for semantic similarity. Keywords for exact matching. For log analysis, keywords win.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments