What is Embedding Dimension? How I Learned to Stop Over-Provisioning My Vector Database
1. Purpose
I recently hit a storage wall with my vector database. After loading 2 million documents into Pinecone, my monthly bill looked like a phone number. When I investigated, I realized the culprit wasn’t my data volume—it was my embedding dimension choice.
This post explains what embedding dimension is, why the choice matters, and how I learned to match dimensions to actual use cases instead of blindly picking the largest available.
2. The Problem: My Vector Database Was Too Fat
I was building a semantic search system for a client’s documentation portal. Here’s what my initial setup looked like:
from openai import OpenAIimport chromadb
client = OpenAI()db = chromadb.Client()
# I picked OpenAI's ada-002 because "bigger is better", right?def embed_texts(texts): response = client.embeddings.create( model="text-embedding-ada-002", # 1536 dimensions input=texts ) return [item.embedding for item in response.data]
# Embedding 100k documentsdocuments = load_documents() # 100,000 docsembeddings = embed_texts(documents)
# Storage calculation I didn't do upfront:# 100,000 docs * 1536 floats * 4 bytes/float = ~614 MB raw# Plus indexing overhead = ~1.8 GBprint(f"Embedding shape: {len(embeddings[0])}") # 1536I chose ada-002 (1536 dimensions) because it was the “standard” choice. But for simple semantic search over technical documentation, I was over-provisioning by a factor of 4.
The client’s docs were mostly API references, configuration guides, and troubleshooting steps. They didn’t need subtle semantic nuance—they needed fast, accurate keyword-ish matching with some synonym awareness.
3. What Is Embedding Dimension, Really?
After my storage shock, I dove into understanding what dimension actually means.
Embedding dimension is simply the length of the vector that represents your data. If you have a 384-dimensional embedding, you have a list of 384 floating-point numbers. Each number captures some learned feature of your input.
Here’s how I visualized it:
Small dimension (384):[0.12, -0.45, 0.78, ..., 0.33] # 384 numbers-> Less semantic detail captured-> Lower storage (384 * 4 = 1,536 bytes per embedding)-> Faster similarity search
Large dimension (1536):[0.12, -0.45, 0.78, ..., -0.21] # 1536 numbers-> More semantic detail captured-> Higher storage (1536 * 4 = 6,144 bytes per embedding)-> Slower similarity search3.1 Common Models and Their Dimensions
I created a quick reference table:
| Model | Dimension | When to Use |
|---|---|---|
| all-MiniLM-L6-v2 | 384 | Simple search, edge devices |
| all-mpnet-base-v2 | 768 | Balanced workloads |
| text-embedding-ada-002 | 1536 | Complex RAG, nuanced content |
| Cohere embed-v3 | 1024 | Enterprise search |
| Gemini Embedding | Varies | Google AI applications |
4. Testing Different Dimensions
I ran a practical comparison to see the real impact:
from sentence_transformers import SentenceTransformerimport numpy as npimport time
# Test documentsdocs = [ "How to reset the admin password", "Changing administrator credentials", "Password recovery for admin users", "Configuring database connections", "Setting up MySQL connection pool"]
# Model 1: Small dimension (384)model_small = SentenceTransformer('all-MiniLM-L6-v2')embeddings_small = model_small.encode(docs)
# Model 2: Large dimension (768)model_large = SentenceTransformer('all-mpnet-base-v2')embeddings_large = model_large.encode(docs)
print(f"Small model dimension: {embeddings_small.shape[1]}") # 384print(f"Large model dimension: {embeddings_large.shape[1]}") # 768print(f"Storage ratio: {embeddings_large.nbytes / embeddings_small.nbytes:.1f}x")
# Compare similarity resultsfrom sklearn.metrics.pairwise import cosine_similarity
def find_similar(query, embeddings, model): query_emb = model.encode([query]) similarities = cosine_similarity(query_emb, embeddings)[0] return sorted(zip(docs, similarities), key=lambda x: x[1], reverse=True)
query = "reset admin password"print("\nSmall model results:")for doc, score in find_similar(query, embeddings_small, model_small)[:3]: print(f" {score:.3f}: {doc}")
print("\nLarge model results:")for doc, score in find_similar(query, embeddings_large, model_large)[:3]: print(f" {score:.3f}: {doc}")Output:
Small model dimension: 384Large model dimension: 768Storage ratio: 2.0x
Small model results: 0.892: How to reset the admin password 0.745: Password recovery for admin users 0.612: Changing administrator credentials
Large model results: 0.912: How to reset the admin password 0.768: Password recovery for admin users 0.634: Changing administrator credentialsBoth models found the right answer. The larger model had slightly higher similarity scores, but the ranking was identical. For my use case, that extra 2% accuracy wasn’t worth 2x the storage.
5. Why Dimension Choice Matters
The trade-offs became clear when I mapped them out:
5.1 Larger Dimensions (1024-3072+)
Pros:
- Capture subtle semantic distinctions
- Better for long-form, complex content
- Superior for cross-lingual retrieval
Cons:
- Linear storage increase (1536 dims = 4x storage of 384)
- Slower vector similarity calculations
- More GPU memory during inference
- Higher API costs (some providers charge per token in embedding)
5.2 Smaller Dimensions (384-512)
Pros:
- Fast indexing and queries
- Lower storage footprint
- Better for edge/mobile deployment
- Often “good enough” for simple tasks
Cons:
- May miss subtle semantic nuances
- Poorer performance on cross-lingual tasks
- Less effective for very long documents
6. Common Mistakes I Made
Mistake 1: Over-Dimensioning
I was using 1536 dimensions for simple classification. Here’s what I should have done:
def recommend_dimension(use_case: str) -> dict: """Match dimension to actual needs."""
recommendations = { "semantic_search_simple": { "dimension": 384, "model": "all-MiniLM-L6-v2", "reason": "Fast queries, sufficient for simple search" }, "rag_general": { "dimension": 1536, "model": "text-embedding-ada-002", "reason": "Captures nuanced context for Q&A systems" }, "semantic_deduplication": { "dimension": 768, "model": "all-mpnet-base-v2", "reason": "Balances precision and performance" }, "edge_deployment": { "dimension": 256, "model": "all-MiniLM-L6-v2 (with PCA)", "reason": "Optimized for constrained devices" } }
return recommendations.get(use_case, { "dimension": 768, "model": "all-mpnet-base-v2", "reason": "Safe default for most applications" })
# For my docs portal:print(recommend_dimension("semantic_search_simple"))# {'dimension': 384, 'model': 'all-MiniLM-L6-v2', 'reason': 'Fast queries, sufficient for simple search'}Mistake 2: Ignoring Dimensionality Reduction
I didn’t know I could reduce dimensions after the fact:
from sklearn.decomposition import PCAimport numpy as np
# Start with 768-dim embeddingsembeddings_768 = model_large.encode(docs)
# Reduce to 256 dimensions with PCApca = PCA(n_components=256)embeddings_256 = pca.fit_transform(embeddings_768)
print(f"Original: {embeddings_768.shape}")print(f"Reduced: {embeddings_256.shape}")print(f"Explained variance: {sum(pca.explained_variance_ratio_):.2%}")Mistake 3: Mixing Dimensions in the Same Collection
Vector databases typically require uniform dimensions. I tried mixing models and got errors:
import chromadb
client = chromadb.Client()collection = client.create_collection(name="mixed_docs")
# This works finecollection.add( documents=["doc1", "doc2"], embeddings=[[0.1] * 384, [0.2] * 384], # 384 dimensions ids=["1", "2"])
# This throws an error - wrong dimension!# collection.add(# documents=["doc3"],# embeddings=[[0.1] * 1536], # 1536 dimensions - MISMATCH# ids=["3"]# )7. The Solution: Right-Sizing My Embeddings
For the documentation portal, I switched to a smaller model:
from sentence_transformers import SentenceTransformerimport chromadb
# Use smaller dimension model for simple semantic searchmodel = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensionsdb = chromadb.Client()collection = db.create_collection(name="docs")
def embed_and_store(texts): embeddings = model.encode(texts) collection.add( documents=texts, embeddings=embeddings.tolist(), ids=[str(i) for i in range(len(texts))] )
# Storage comparison:# Before: 100k docs * 1536 * 4 bytes = 614 MB raw# After: 100k docs * 384 * 4 bytes = 153 MB raw# Savings: 75% reduction
print(f"New embedding dimension: {model.get_sentence_embedding_dimension()}")# Output: New embedding dimension: 384The search quality remained excellent for the documentation use case, and my storage costs dropped by 75%.
8. When to Actually Use Larger Dimensions
I don’t want to swing too far the other way. Larger dimensions are genuinely necessary for:
- RAG over legal/medical documents: Nuance matters enormously
- Cross-lingual retrieval: Different languages need more semantic space
- Long-form content analysis: Book summaries, research papers
- Fine-grained sentiment: Distinguishing “good” from “excellent”
For my next project—a legal contract search system—I’ll use 1536 dimensions without hesitation.
9. Summary
Embedding dimension is the length of your vector representation. The choice balances semantic richness against computational cost.
Key takeaways:
- Match dimension to task complexity, not just “what’s popular”
- 384 dimensions is often sufficient for simple semantic search
- 768 dimensions is a good default for most applications
- 1536+ dimensions for nuanced content where precision matters
- You can reduce dimensions with PCA if needed
- Never mix dimensions in the same vector collection
How I choose now:
Simple keyword-ish search? -> 384 dimsBalanced general purpose? -> 768 dimsComplex RAG or legal/medical? -> 1536 dimsEdge/mobile deployment? -> 256 dims (with PCA)My vector database bill is now 75% smaller, and search quality hasn’t suffered. Sometimes the answer isn’t “bigger is better”—it’s “right-sized is better.”
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments