Best Vector Search Approach for AI Agent Knowledge Bases
I built a research aggregation agent that collected articles from multiple sources. The agent stored everything in LanceDB with vector embeddings. When I queried for “Python async patterns”, it returned results about async programming - but also returned an article about async patterns in JavaScript from 2019, a Reddit thread about async UI issues, and a tutorial that had been superseded by a newer version.
The problem: pure vector search returns semantically similar but contextually irrelevant results. Same topic, wrong entity, outdated info, different programming language.
Why Pure Vector Search Fails for Agents
I asked on Reddit which vector search approach works best for agent knowledge bases. The response surprised me: vector search alone isn’t enough. You need:
- Entity resolution - Same concept across sources should map to one canonical record
- Source provenance - Every piece of knowledge needs to trace back to its origin
- Deduplication - Content hash prevents storing the same article multiple times
- Recency decay - Old information should score lower than fresh data
- Confidence scoring - Not all retrieved knowledge is equally reliable
The expensive part isn’t the vector database - it’s the knowledge governance layer on top.
What I Tried First (And Why It Failed)
My initial implementation was straightforward:
from lancedb import connect
db = connect("./knowledge_db")table = db.open_table("agent_memory")
def retrieve(query: str, top_k: int = 10): query_embedding = embed(query) results = table.search(query_embedding).limit(top_k).to_df() return results
def add_knowledge(content: str, source: str): embedding = embed(content) table.add([ {'embedding': embedding, 'content': content, 'source': source} ])This worked for a demo. Then I ran it for a week and discovered:
- The same article appeared 3 times (Reddit cross-post, blog mirror, newsletter archive)
- A query for “React hooks” returned articles from 2018 - before the API stabilized
- I couldn’t trace which source a piece of knowledge came from
- Confidence scores were meaningless - everything scored 0.85+ because vectors cluster together
The Production Approach: Layered Retrieval
I rebuilt the retrieval system with multiple processing layers:
from lancedb import connectimport numpy as npfrom datetime import datetime, timedelta
db = connect("./knowledge_db")table = db.open_table("agent_memory")
def retrieve_with_context(query: str, top_k: int = 10): # Step 1: Vector search (base similarity) query_embedding = embed(query) similar = table.search(query_embedding).limit(top_k * 3).to_df()
# Step 2: Entity resolution (canonical records) resolved = resolve_entities(similar) # Dedupe by content_hash
# Step 3: Source provenance (traceability) for record in resolved: record['sources'] = get_provenance(record['id'])
# Step 4: Recency decay (freshness weighting) now = datetime.now() for record in resolved: age_days = (now - record['timestamp']).days record['decay_score'] = record['score'] * np.exp(-age_days / 30)
# Step 5: Confidence scoring (quality threshold) confident = [r for r in resolved if r['decay_score'] > 0.5]
return confident[:top_k]
def add_knowledge(content: str, source: str, metadata: dict): embedding = embed(content) content_hash = hash_content(content)
# Check for duplicates before inserting existing = table.search(embedding).limit(5).to_df() for record in existing: if record['content_hash'] == content_hash: return record['id'] # Return existing canonical ID
# Insert new canonical record table.add([ { 'embedding': embedding, 'content': content, 'content_hash': content_hash, 'source': source, 'timestamp': datetime.now(), 'metadata': metadata } ])The key changes:
- Over-fetch then filter: I retrieve
top_k * 3results, then filter down. This gives room for deduplication. - Content hash:
hash_content()creates a canonical identifier. Same content from different sources maps to the same hash. - Recency decay: The
np.exp(-age_days / 30)formula halves the score every 30 days. A 60-day-old article scores 25% of a fresh one. - Return existing on duplicate: Instead of inserting duplicate content, I return the existing canonical ID. This prevents “the same article 3 times”.
Multi-Tenant Knowledge Base with Postgres
For systems with multiple users or organizations, tenant isolation becomes critical. LanceDB works well for personal use, but Postgres with pgvector handles multi-tenant production:
from sqlalchemy import textimport psycopg2
conn = psycopg2.connect("postgresql://user:pass@localhost/agent_db")
def retrieve_tenant_knowledge(tenant_id: str, query_embedding: list): with conn.cursor() as cur: # Vector search with tenant isolation cur.execute(""" SELECT id, content, source, timestamp, 1 - (embedding <=> :query_vec::vector) as similarity, created_at FROM knowledge_base WHERE tenant_id = :tenant_id AND is_canonical = true ORDER BY embedding <=> :query_vec::vector LIMIT 20 """, { 'tenant_id': tenant_id, 'query_vec': str(query_embedding) })
results = cur.fetchall()
# Apply recency decay scored = apply_decay_scoring(results)
return scored[:10]The <=> operator is pgvector’s cosine distance. The is_canonical flag ensures only deduplicated master records are retrieved.
Stack Recommendations by Scale
Scale | Vector DB | Why-------------------|------------------|------------------------------------------Personal/Small | LanceDB | Embedded, zero-config, works on VPSMedium/Multi-tenant| Postgres+pgvector| Existing infra, tenant isolation, SQLLarge/High-volume | Qdrant/Milvus | Dedicated vector infra, advanced filteringKnowledge graphs | Kuzu | Entity relationships, structured queriesI use LanceDB for my personal agent because it’s embedded - no separate server to manage. For a production system with multiple users, Postgres with pgvector is the pragmatic choice because:
- Tenant isolation with
WHERE tenant_id = X - Existing backup and monitoring infrastructure
- SQL queries combine vector search with metadata filters
- Row-level security for compliance requirements
The Missing Pieces in Most Frameworks
I tried Hermes Agent and similar frameworks. They handle prompts and workflows well, but skip:
- Canonical entity storage - No mechanism to deduplicate knowledge across sources
- Source provenance - Can’t trace which URL/article a fact came from
- Multi-tenant memory - No tenant isolation by default
- Confidence scoring - No quality thresholding on retrieved knowledge
- Auditability - No audit trail for what the agent “learned”
These gaps become production failures. An agent that can’t distinguish fresh from stale knowledge, or can’t trace sources, produces unreliable output.
My Current Architecture
I ended up with a multi-layer approach:
Source Content | v[Content Hash + Deduplication] | v[Embedding Generation] --> LanceDB (vector index) | v[Entity Resolution] --> Canonical records | v[Provenance Tracking] --> Source mapping table | v[Decay Scoring] --> Timestamp-weighted retrieval | vAgent ContextThe vector database is just the storage layer. The intelligence is in the governance pipeline above it.
What I Would Do Differently
-
Start with deduplication, not embeddings. I spent weeks tuning embedding models before realizing my biggest problem was duplicate content.
-
Add provenance from day one. Every piece of knowledge needs a source URL. Without this, debugging agent decisions is impossible.
-
Test with stale content. My test data was all fresh. Production data includes articles from 2019, outdated APIs, deprecated libraries. Decay scoring catches this.
-
Use content hash, not URL hash. The same article at different URLs should be one canonical record. URL hashing creates duplicates from mirrors.
-
Measure retrieval quality, not just latency. I optimized for fast queries. Then discovered 40% of retrieved content was irrelevant. Quality metrics matter more than speed metrics.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments