Skip to content

What is Embedding Dimension? How I Learned to Stop Over-Provisioning My Vector Database

1. Purpose

I recently hit a storage wall with my vector database. After loading 2 million documents into Pinecone, my monthly bill looked like a phone number. When I investigated, I realized the culprit wasn’t my data volume—it was my embedding dimension choice.

This post explains what embedding dimension is, why the choice matters, and how I learned to match dimensions to actual use cases instead of blindly picking the largest available.

2. The Problem: My Vector Database Was Too Fat

I was building a semantic search system for a client’s documentation portal. Here’s what my initial setup looked like:

initial_setup.py
from openai import OpenAI
import chromadb
client = OpenAI()
db = chromadb.Client()
# I picked OpenAI's ada-002 because "bigger is better", right?
def embed_texts(texts):
response = client.embeddings.create(
model="text-embedding-ada-002", # 1536 dimensions
input=texts
)
return [item.embedding for item in response.data]
# Embedding 100k documents
documents = load_documents() # 100,000 docs
embeddings = embed_texts(documents)
# Storage calculation I didn't do upfront:
# 100,000 docs * 1536 floats * 4 bytes/float = ~614 MB raw
# Plus indexing overhead = ~1.8 GB
print(f"Embedding shape: {len(embeddings[0])}") # 1536

I chose ada-002 (1536 dimensions) because it was the “standard” choice. But for simple semantic search over technical documentation, I was over-provisioning by a factor of 4.

The client’s docs were mostly API references, configuration guides, and troubleshooting steps. They didn’t need subtle semantic nuance—they needed fast, accurate keyword-ish matching with some synonym awareness.

3. What Is Embedding Dimension, Really?

After my storage shock, I dove into understanding what dimension actually means.

Embedding dimension is simply the length of the vector that represents your data. If you have a 384-dimensional embedding, you have a list of 384 floating-point numbers. Each number captures some learned feature of your input.

Here’s how I visualized it:

Dimension comparison
Small dimension (384):
[0.12, -0.45, 0.78, ..., 0.33] # 384 numbers
-> Less semantic detail captured
-> Lower storage (384 * 4 = 1,536 bytes per embedding)
-> Faster similarity search
Large dimension (1536):
[0.12, -0.45, 0.78, ..., -0.21] # 1536 numbers
-> More semantic detail captured
-> Higher storage (1536 * 4 = 6,144 bytes per embedding)
-> Slower similarity search

3.1 Common Models and Their Dimensions

I created a quick reference table:

ModelDimensionWhen to Use
all-MiniLM-L6-v2384Simple search, edge devices
all-mpnet-base-v2768Balanced workloads
text-embedding-ada-0021536Complex RAG, nuanced content
Cohere embed-v31024Enterprise search
Gemini EmbeddingVariesGoogle AI applications

4. Testing Different Dimensions

I ran a practical comparison to see the real impact:

dimension_comparison.py
from sentence_transformers import SentenceTransformer
import numpy as np
import time
# Test documents
docs = [
"How to reset the admin password",
"Changing administrator credentials",
"Password recovery for admin users",
"Configuring database connections",
"Setting up MySQL connection pool"
]
# Model 1: Small dimension (384)
model_small = SentenceTransformer('all-MiniLM-L6-v2')
embeddings_small = model_small.encode(docs)
# Model 2: Large dimension (768)
model_large = SentenceTransformer('all-mpnet-base-v2')
embeddings_large = model_large.encode(docs)
print(f"Small model dimension: {embeddings_small.shape[1]}") # 384
print(f"Large model dimension: {embeddings_large.shape[1]}") # 768
print(f"Storage ratio: {embeddings_large.nbytes / embeddings_small.nbytes:.1f}x")
# Compare similarity results
from sklearn.metrics.pairwise import cosine_similarity
def find_similar(query, embeddings, model):
query_emb = model.encode([query])
similarities = cosine_similarity(query_emb, embeddings)[0]
return sorted(zip(docs, similarities), key=lambda x: x[1], reverse=True)
query = "reset admin password"
print("\nSmall model results:")
for doc, score in find_similar(query, embeddings_small, model_small)[:3]:
print(f" {score:.3f}: {doc}")
print("\nLarge model results:")
for doc, score in find_similar(query, embeddings_large, model_large)[:3]:
print(f" {score:.3f}: {doc}")

Output:

Comparison output
Small model dimension: 384
Large model dimension: 768
Storage ratio: 2.0x
Small model results:
0.892: How to reset the admin password
0.745: Password recovery for admin users
0.612: Changing administrator credentials
Large model results:
0.912: How to reset the admin password
0.768: Password recovery for admin users
0.634: Changing administrator credentials

Both models found the right answer. The larger model had slightly higher similarity scores, but the ranking was identical. For my use case, that extra 2% accuracy wasn’t worth 2x the storage.

5. Why Dimension Choice Matters

The trade-offs became clear when I mapped them out:

5.1 Larger Dimensions (1024-3072+)

Pros:

  • Capture subtle semantic distinctions
  • Better for long-form, complex content
  • Superior for cross-lingual retrieval

Cons:

  • Linear storage increase (1536 dims = 4x storage of 384)
  • Slower vector similarity calculations
  • More GPU memory during inference
  • Higher API costs (some providers charge per token in embedding)

5.2 Smaller Dimensions (384-512)

Pros:

  • Fast indexing and queries
  • Lower storage footprint
  • Better for edge/mobile deployment
  • Often “good enough” for simple tasks

Cons:

  • May miss subtle semantic nuances
  • Poorer performance on cross-lingual tasks
  • Less effective for very long documents

6. Common Mistakes I Made

Mistake 1: Over-Dimensioning

I was using 1536 dimensions for simple classification. Here’s what I should have done:

dimension_choice.py
def recommend_dimension(use_case: str) -> dict:
"""Match dimension to actual needs."""
recommendations = {
"semantic_search_simple": {
"dimension": 384,
"model": "all-MiniLM-L6-v2",
"reason": "Fast queries, sufficient for simple search"
},
"rag_general": {
"dimension": 1536,
"model": "text-embedding-ada-002",
"reason": "Captures nuanced context for Q&A systems"
},
"semantic_deduplication": {
"dimension": 768,
"model": "all-mpnet-base-v2",
"reason": "Balances precision and performance"
},
"edge_deployment": {
"dimension": 256,
"model": "all-MiniLM-L6-v2 (with PCA)",
"reason": "Optimized for constrained devices"
}
}
return recommendations.get(use_case, {
"dimension": 768,
"model": "all-mpnet-base-v2",
"reason": "Safe default for most applications"
})
# For my docs portal:
print(recommend_dimension("semantic_search_simple"))
# {'dimension': 384, 'model': 'all-MiniLM-L6-v2', 'reason': 'Fast queries, sufficient for simple search'}

Mistake 2: Ignoring Dimensionality Reduction

I didn’t know I could reduce dimensions after the fact:

dimensionality_reduction.py
from sklearn.decomposition import PCA
import numpy as np
# Start with 768-dim embeddings
embeddings_768 = model_large.encode(docs)
# Reduce to 256 dimensions with PCA
pca = PCA(n_components=256)
embeddings_256 = pca.fit_transform(embeddings_768)
print(f"Original: {embeddings_768.shape}")
print(f"Reduced: {embeddings_256.shape}")
print(f"Explained variance: {sum(pca.explained_variance_ratio_):.2%}")

Mistake 3: Mixing Dimensions in the Same Collection

Vector databases typically require uniform dimensions. I tried mixing models and got errors:

dimension_mismatch.py
import chromadb
client = chromadb.Client()
collection = client.create_collection(name="mixed_docs")
# This works fine
collection.add(
documents=["doc1", "doc2"],
embeddings=[[0.1] * 384, [0.2] * 384], # 384 dimensions
ids=["1", "2"]
)
# This throws an error - wrong dimension!
# collection.add(
# documents=["doc3"],
# embeddings=[[0.1] * 1536], # 1536 dimensions - MISMATCH
# ids=["3"]
# )

7. The Solution: Right-Sizing My Embeddings

For the documentation portal, I switched to a smaller model:

optimized_setup.py
from sentence_transformers import SentenceTransformer
import chromadb
# Use smaller dimension model for simple semantic search
model = SentenceTransformer('all-MiniLM-L6-v2') # 384 dimensions
db = chromadb.Client()
collection = db.create_collection(name="docs")
def embed_and_store(texts):
embeddings = model.encode(texts)
collection.add(
documents=texts,
embeddings=embeddings.tolist(),
ids=[str(i) for i in range(len(texts))]
)
# Storage comparison:
# Before: 100k docs * 1536 * 4 bytes = 614 MB raw
# After: 100k docs * 384 * 4 bytes = 153 MB raw
# Savings: 75% reduction
print(f"New embedding dimension: {model.get_sentence_embedding_dimension()}")
# Output: New embedding dimension: 384

The search quality remained excellent for the documentation use case, and my storage costs dropped by 75%.

8. When to Actually Use Larger Dimensions

I don’t want to swing too far the other way. Larger dimensions are genuinely necessary for:

  1. RAG over legal/medical documents: Nuance matters enormously
  2. Cross-lingual retrieval: Different languages need more semantic space
  3. Long-form content analysis: Book summaries, research papers
  4. Fine-grained sentiment: Distinguishing “good” from “excellent”

For my next project—a legal contract search system—I’ll use 1536 dimensions without hesitation.

9. Summary

Embedding dimension is the length of your vector representation. The choice balances semantic richness against computational cost.

Key takeaways:

  • Match dimension to task complexity, not just “what’s popular”
  • 384 dimensions is often sufficient for simple semantic search
  • 768 dimensions is a good default for most applications
  • 1536+ dimensions for nuanced content where precision matters
  • You can reduce dimensions with PCA if needed
  • Never mix dimensions in the same vector collection

How I choose now:

Decision flow
Simple keyword-ish search? -> 384 dims
Balanced general purpose? -> 768 dims
Complex RAG or legal/medical? -> 1536 dims
Edge/mobile deployment? -> 256 dims (with PCA)

My vector database bill is now 75% smaller, and search quality hasn’t suffered. Sometimes the answer isn’t “bigger is better”—it’s “right-sized is better.”

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments