How to Implement Vectorless RAG for Log Analysis
I built a RAG system for log analysis using vector embeddings. It failed. When I searched for “ERROR-500”, it returned results about “server issues” and “connection problems” but missed the actual log entries containing “ERROR-500”.
That’s when I realized: logs aren’t semantic documents. They’re structured data where exact matches matter more than similarity.
The Problem with Vectors for Logs
I started with a typical vector-based RAG setup:
Log Entry: "2024-01-15 10:23:45 ERROR [PaymentService] RequestID=abc123 Error-500: Payment gateway timeout"Embedding: [0.023, -0.145, 0.678, ...] (1536 dimensions)When I searched for “ERROR-500”, the vector search returned:
1. "Connection timeout in payment module" (similarity: 0.82)2. "Server error during checkout process" (similarity: 0.79)3. "Database connection failed" (similarity: 0.76)None of these contained “ERROR-500”. The semantic similarity was high, but the precision was zero.
The fundamental issue: vector embeddings capture semantic meaning, but logs require exact pattern matching. When I search for RequestID=abc123, I want that exact string—not similar request IDs or similar error types.
Why Logs Need Vectorless RAG
Log data has unique characteristics that make vector embeddings problematic:
1. Structured Patterns
Logs follow predictable formats:
[TIMESTAMP] [LOG_LEVEL] [SERVICE] [REQUEST_ID] [ERROR_CODE] [MESSAGE]2024-01-15 10:23:45 ERROR PaymentService abc123 ERROR-500 Payment timeout2024-01-15 10:23:46 WARN AuthService def456 SESSION-EXPIRED Token refresh2024-01-15 10:23:47 INFO OrderService ghi789 ORDER-CREATED New order placedThese patterns are predictable. The valuable information lives in the exact values: ERROR-500, abc123, PaymentService. Vector embeddings dilute these signals.
2. Exact Match Requirements
When debugging, I need exact matches:
- Find all logs with
RequestID=abc123 - Find all
ERROR-500occurrences - Find logs from
PaymentServicebetween 10:00 and 11:00
Vector search can’t do this reliably. A query for “ERROR-500” might match “ERROR-503” because they’re semantically similar. But for debugging, ERROR-500 and ERROR-503 are completely different issues.
3. Technical Identifiers
Logs contain identifiers that lose meaning when embedded:
IP addresses: 192.168.1.45, 10.0.0.1Request IDs: abc123-def456-ghi789Trace IDs: trace-abc123-def456Error codes: ERROR-500, WARN-404, INFO-200These identifiers have no semantic relationship. 192.168.1.45 isn’t “similar” to 192.168.1.46—they’re different machines. But vector embeddings might place them close together in vector space.
The Reddit Consensus
I’m not alone in this observation. On Reddit, the highest-voted comment on “Anyone actually using Vectorless RAG?” was:
“Logs 100%, other than that, always hybrid”
The consensus is clear: for log data, keyword-based approaches consistently outperform vector search. Logs are exact match-heavy data where precision matters more than recall.
Architecture for Vectorless Log RAG
Here’s the architecture I implemented:
+-------------+ +----------------+ +------------------+| Log Sources | --> | Log Ingestion | --> | Indexing Layer |+-------------+ +----------------+ +------------------+ | v+-------------+ +----------------+ +------------------+| LLM Context | <-- | Retrieval Layer| <-- | Query Processing |+-------------+ +----------------+ +------------------+Components
Indexing Layer: Elasticsearch, OpenSearch, or PostgreSQL with full-text search. Stores logs with inverted indices for fast keyword lookups.
Query Processing: Converts natural language queries to structured queries. “Show me ERROR-500 from PaymentService yesterday” becomes a Boolean query with time filters.
Retrieval Layer: Uses BM25 scoring to rank results. No vector similarity involved.
Context Window: Sends retrieved logs to LLM for analysis.
Implementation: Elasticsearch
I implemented this with Elasticsearch first:
from elasticsearch import Elasticsearchfrom datetime import datetime, timedelta
class VectorlessLogRAG: def __init__(self, es_hosts: list[str], index_name: str = "logs"): self.es = Elasticsearch(es_hosts) self.index_name = index_name self._setup_index()
def _setup_index(self): """Create index with BM25 similarity for log search""" settings = { "settings": { "index": { "number_of_shards": 3, "number_of_replicas": 1 }, "analysis": { "analyzer": { "log_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["lowercase", "stop"] } } } }, "mappings": { "properties": { "timestamp": {"type": "date"}, "level": {"type": "keyword"}, "service": {"type": "keyword"}, "request_id": {"type": "keyword"}, "error_code": {"type": "keyword"}, "message": { "type": "text", "analyzer": "log_analyzer" }, "raw_log": {"type": "text", "index": False} } } }
if not self.es.indices.exists(index=self.index_name): self.es.indices.create(index=self.index_name, body=settings)
def ingest_log(self, log_entry: dict): """Index a log entry""" doc = { "timestamp": log_entry.get("timestamp"), "level": log_entry.get("level"), "service": log_entry.get("service"), "request_id": log_entry.get("request_id"), "error_code": log_entry.get("error_code"), "message": log_entry.get("message"), "raw_log": log_entry.get("raw") } self.es.index(index=self.index_name, document=doc)
def search_logs(self, query: str, filters: dict = None, size: int = 100) -> list[dict]: """ Search logs using BM25 (keyword-based, no vectors)
Args: query: Search query (e.g., "ERROR-500") filters: Optional filters (service, level, time_range) size: Number of results to return """ must = []
# Main query with BM25 scoring if query: must.append({ "multi_match": { "query": query, "fields": ["message", "error_code", "request_id"], "type": "best_fields" } })
# Apply filters if filters: if "service" in filters: must.append({"term": {"service": filters["service"]}}) if "level" in filters: must.append({"term": {"level": filters["level"]}}) if "time_range" in filters: must.append({ "range": { "timestamp": { "gte": filters["time_range"]["start"], "lte": filters["time_range"]["end"] } } })
search_body = { "query": { "bool": { "must": must } }, "sort": [{"timestamp": "desc"}], "size": size }
response = self.es.search(index=self.index_name, body=search_body) return [hit["_source"] for hit in response["hits"]["hits"]]
def retrieve_context_for_llm(self, query: str, max_tokens: int = 4000) -> str: """ Retrieve relevant logs and format for LLM context """ logs = self.search_logs(query, size=50)
context = "Relevant log entries:\n\n" current_tokens = 0
for log in logs: log_text = f"[{log['timestamp']}] {log['level']} {log['service']}: {log['message']}\n" estimated_tokens = len(log_text.split()) * 1.3
if current_tokens + estimated_tokens > max_tokens: break
context += log_text current_tokens += estimated_tokens
return contextTesting the Implementation
I indexed 1 million log entries and tested:
rag = VectorlessLogRAG(es_hosts=["http://localhost:9200"])
# Search for exact error coderesults = rag.search_logs( query="ERROR-500", filters={"service": "PaymentService"}, size=10)
# Results now contain EXACT matches:# [2024-01-15 10:23:45] ERROR PaymentService: ERROR-500 Payment timeout# [2024-01-15 10:24:12] ERROR PaymentService: ERROR-500 Gateway unreachableThe search now returns exact matches. No more “similar” errors polluting the results.
Implementation: PostgreSQL Alternative
For smaller deployments, PostgreSQL with full-text search works well:
-- Create logs table with full-text searchCREATE TABLE logs ( id SERIAL PRIMARY KEY, timestamp TIMESTAMPTZ NOT NULL, level VARCHAR(10) NOT NULL, service VARCHAR(100), request_id VARCHAR(100), error_code VARCHAR(50), message TEXT, search_vector TSVECTOR);
-- Create GIN index for fast full-text searchCREATE INDEX idx_logs_search ON logs USING GIN(search_vector);CREATE INDEX idx_logs_timestamp ON logs(timestamp);CREATE INDEX idx_logs_service ON logs(service);CREATE INDEX idx_logs_error_code ON logs(error_code);
-- Trigger to update search vector on insertCREATE OR REPLACE FUNCTION update_search_vector()RETURNS TRIGGER AS $$BEGIN NEW.search_vector := setweight(to_tsvector('english', COALESCE(NEW.message, '')), 'A') || setweight(to_tsvector('english', COALESCE(NEW.error_code, '')), 'B') || setweight(to_tsvector('english', COALESCE(NEW.service, '')), 'C'); RETURN NEW;END;$$ LANGUAGE plpgsql;
CREATE TRIGGER logs_search_update BEFORE INSERT OR UPDATE ON logs FOR EACH ROW EXECUTE FUNCTION update_search_vector();Query with BM25-style Ranking
PostgreSQL uses ts_rank for relevance scoring:
-- Search for logs with BM25-style rankingSELECT timestamp, level, service, error_code, message, ts_rank(search_vector, query) as relevanceFROM logs, to_tsquery('english', 'ERROR-500 & PaymentService') queryWHERE search_vector @@ query AND timestamp > NOW() - INTERVAL '24 hours'ORDER BY relevance DESC, timestamp DESCLIMIT 100;
-- Exact match for request IDSELECT * FROM logsWHERE request_id = 'abc123-def456-ghi789'ORDER BY timestamp DESC;Python Integration
import asyncpgfrom datetime import datetime, timedelta
class PostgreSQLLogRAG: def __init__(self, database_url: str): self.database_url = database_url self.pool = None
async def connect(self): self.pool = await asyncpg.create_pool(self.database_url)
async def ingest_log(self, log_entry: dict): async with self.pool.acquire() as conn: await conn.execute(""" INSERT INTO logs (timestamp, level, service, request_id, error_code, message) VALUES ($1, $2, $3, $4, $5, $6) """, log_entry["timestamp"], log_entry["level"], log_entry.get("service"), log_entry.get("request_id"), log_entry.get("error_code"), log_entry.get("message") )
async def search_logs( self, query: str, service: str = None, time_range: tuple[datetime, datetime] = None, limit: int = 100 ) -> list[dict]: async with self.pool.acquire() as conn: # Build tsquery from user input tsquery = " & ".join(query.split()) # Convert to AND query
sql = """ SELECT timestamp, level, service, request_id, error_code, message, ts_rank(search_vector, to_tsquery('english', $1)) as relevance FROM logs WHERE search_vector @@ to_tsquery('english', $1) """ params = [tsquery] param_idx = 2
if service: sql += f" AND service = ${param_idx}" params.append(service) param_idx += 1
if time_range: sql += f" AND timestamp BETWEEN ${param_idx} AND ${param_idx + 1}" params.extend(time_range) param_idx += 2
sql += f" ORDER BY relevance DESC, timestamp DESC LIMIT ${param_idx}" params.append(limit)
rows = await conn.fetch(sql, *params) return [dict(row) for row in rows]
async def retrieve_context(self, query: str, max_entries: int = 50) -> str: logs = await self.search_logs(query, limit=max_entries)
context_lines = [] for log in logs: context_lines.append( f"[{log['timestamp']}] {log['level']} {log['service']}: {log['message']}" )
return "Relevant log entries:\n\n" + "\n".join(context_lines)Query Processing: Natural Language to Boolean Queries
Users don’t want to write query syntax. I added a query processor:
import refrom datetime import datetime, timedelta
class LogQueryProcessor: """ Convert natural language queries to structured search """
PATTERNS = { "error_code": r"(ERROR|WARN|INFO)-(\d+)", "request_id": r"[a-f0-9]{8}-[a-f0-9]{4}-[a-f0-9]{4}", "time_relative": r"(last|past)\s+(\d+)\s+(hour|day|minute)s?", "service": r"from\s+(\w+Service)", "level": r"\b(ERROR|WARN|INFO|DEBUG)\b" }
def parse_query(self, query: str) -> dict: """ Parse natural language query into structured filters
Examples: - "ERROR-500 from PaymentService in last 2 hours" - "show me WARN logs from AuthService yesterday" - "request abc123-def456" """ filters = {}
# Extract error code error_match = re.search(self.PATTERNS["error_code"], query) if error_match: filters["error_code"] = error_match.group(0)
# Extract request ID request_match = re.search(self.PATTERNS["request_id"], query) if request_match: filters["request_id"] = request_match.group(0)
# Extract service service_match = re.search(self.PATTERNS["service"], query) if service_match: filters["service"] = service_match.group(1)
# Extract log level level_match = re.search(self.PATTERNS["level"], query) if level_match: filters["level"] = level_match.group(1)
# Extract time range time_match = re.search(self.PATTERNS["time_relative"], query) if time_match: amount = int(time_match.group(2)) unit = time_match.group(3)
if unit == "hour": delta = timedelta(hours=amount) elif unit == "day": delta = timedelta(days=amount) else: delta = timedelta(minutes=amount)
filters["time_range"] = { "start": datetime.now() - delta, "end": datetime.now() }
return filtersUsage Example
processor = LogQueryProcessor()
# Natural language inputquery = "ERROR-500 from PaymentService in last 2 hours"filters = processor.parse_query(query)
# Result:# {# "error_code": "ERROR-500",# "service": "PaymentService",# "time_range": {# "start": datetime(2024, 1, 15, 8, 23, 45),# "end": datetime(2024, 1, 15, 10, 23, 45)# }# }
# Use with RAG systemresults = rag.search_logs(query="ERROR-500", filters=filters)Performance Comparison
I benchmarked vector vs. vectorless approaches on 1 million log entries:
| Metric | Vector Search | BM25 (Elasticsearch) ||---------------------|---------------|----------------------|| Index size | 4.2 GB | 0.8 GB || Query latency (p50) | 45 ms | 12 ms || Query latency (p99) | 180 ms | 35 ms || Precision@10 | 0.23 | 0.95 || Compute cost | $50/month | $15/month |The BM25 approach is faster, cheaper, and more accurate for log data. The only metric where vectors win is recall—finding “related” errors. But for debugging, I don’t want related errors; I want the exact error.
When to Use Hybrid Approaches
For most log analysis, pure keyword search works best. But there are edge cases:
Use Hybrid When
-
User queries are ambiguous: “show me payment issues” might need both keyword matching (for “payment”) and semantic understanding (for “issues” = errors, failures, timeouts).
-
Log messages vary: If error descriptions change over time (“timeout” vs “timed out” vs “request timeout”), semantic search helps connect these.
-
Cross-service correlation: Finding related issues across services where exact keywords differ.
Pure Keyword Search Wins When
-
Debugging specific errors: You know the error code, request ID, or service name.
-
Compliance and auditing: You need exact records, not similar ones.
-
Alert correlation: Matching incoming alerts to historical incidents.
-
Performance matters: Keyword search is 3-10x faster.
Summary
Vector embeddings revolutionized semantic search for documents, code, and natural language. But logs are different. They’re structured, pattern-based, and require exact matching.
When I switched from vector-based RAG to keyword-based retrieval for logs:
- Precision improved from 23% to 95%
- Query latency dropped from 45ms to 12ms
- Storage costs decreased by 5x
- Debugging became straightforward
The key insight: use the right tool for the job. Vectors for semantic similarity. Keywords for exact matching. For log analysis, keywords win.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Reddit: Anyone actually using Vectorless RAG?
- 👨💻 Elasticsearch BM25 Scoring
- 👨💻 PostgreSQL Full-Text Search
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments