Skip to content

How Multimodal Embeddings Work in RAG Pipelines

I was building a RAG system last month and hit a wall. My application needed to search across PDFs, screenshots, and audio recordings. I found myself creating separate pipelines for each data type - a text embedder here, CLIP for images there, whisper embeddings somewhere else. The complexity was spiraling out of control.

Then Google released Gemini Embedding, and I realized my entire architecture was built on a flawed assumption: that each modality needs its own embedding model and vector collection.

The Problem: Modality Silos

My original RAG architecture looked like this:

Architecture Overview
Text Pipeline: Text -> OpenAI ada-002 -> Vector DB Collection A
Image Pipeline: Image -> CLIP -> Vector DB Collection B
Audio Pipeline: Audio -> Whisper -> Vector DB Collection C

Each pipeline required:

  • A different embedding model with different output dimensions
  • A separate vector collection in Qdrant
  • Different chunking/preprocessing logic
  • Different similarity thresholds

The worst part? A text query like “show me architecture diagrams” could never retrieve the actual diagram images I had stored. Cross-modal retrieval was impossible.

traditional_rag.py
from langchain.embeddings import OpenAIEmbeddings
from qdrant_client import QdrantClient
# Separate pipelines for different modalities
text_embedder = OpenAIEmbeddings() # Only handles text
# Would need CLIP for images, separate audio embedder, etc.
client = QdrantClient(":memory:")
# Create separate collections per modality
client.create_collection(
collection_name="text_documents",
vectors_config={"size": 1536, "distance": "Cosine"}
)
client.create_collection(
collection_name="images",
vectors_config={"size": 512, "distance": "Cosine"} # CLIP dimension
)
# Query is text-only
query = "What is retrieval augmented generation?"
query_vector = text_embedder.embed_query(query)
# Can only retrieve text documents - images are in a different collection!
results = client.search(
collection_name="text_documents",
query_vector=query_vector,
limit=5
)

I had three different collections, three different embedding dimensions, and zero ability to search across modalities.

The Solution: Shared Vector Space

Gemini Embedding (and similar multimodal models) solve this by encoding text, images, audio, and video into a single shared vector space. This is fundamentally different from running multiple embedding models and storing results together - the model itself understands the semantic relationships between modalities.

Unified Architecture
All Modalities -> Gemini Embedding -> Single Vector DB Collection
|
v
768-dimensional shared space

Here’s what the unified architecture looks like:

multimodal_rag.py
from google import genai
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct
# Single client for all modalities
client = genai.Client(api_key="YOUR_API_KEY")
qdrant = QdrantClient(":memory:")
# One collection for everything - same dimension for all modalities
qdrant.create_collection(
collection_name="multimodal_content",
vectors_config={"size": 768, "distance": "Cosine"}
)
# Embed different modalities into the SAME space
text_embedding = client.models.embed_content(
model="gemini-embedding-exp-03-07",
content="RAG combines retrieval with generation"
)
image_embedding = client.models.embed_content(
model="gemini-embedding-exp-03-07",
content=open("diagram.png", "rb").read(),
mime_type="image/png"
)
audio_embedding = client.models.embed_content(
model="gemini-embedding-exp-03-07",
content=open("lecture.mp3", "rb").read(),
mime_type="audio/mp3"
)
# Store all in ONE collection
points = [
PointStruct(id=1, vector=text_embedding.values, payload={"type": "text", "content": "..."}),
PointStruct(id=2, vector=image_embedding.values, payload={"type": "image", "path": "diagram.png"}),
PointStruct(id=3, vector=audio_embedding.values, payload={"type": "audio", "path": "lecture.mp3"}),
]
qdrant.upsert(collection_name="multimodal_content", points=points)

Cross-Modal Retrieval: The Game Changer

The real magic happens at query time. I can now query with text and retrieve images, or query with an image and retrieve documents:

cross_modal_search.py
# Cross-modal retrieval: Query with text, retrieve images!
query_result = client.models.embed_content(
model="gemini-embedding-exp-03-07",
content="Show me architecture diagrams"
)
results = qdrant.search(
collection_name="multimodal_content",
query_vector=query_result.values,
limit=5
)
# Results include BOTH text documents AND images
for result in results:
print(f"Score: {result.score}, Type: {result.payload['type']}")

Output:

Search Results
Score: 0.89, Type: image, Path: system-architecture.png
Score: 0.85, Type: text, Content: Our system architecture consists of...
Score: 0.82, Type: image, Path: data-flow-diagram.png
Score: 0.78, Type: audio, Path: architecture-deep-dive.mp3

The query “show me architecture diagrams” returned images, text, and audio - all semantically related to architecture.

Why This Is Different from CLIP

I initially thought this was just CLIP with more modalities. It’s not.

CLIP aligns text and images into a shared space, but:

  1. CLIP only handles two modalities (text + images)
  2. CLIP requires loading both text and image encoders separately
  3. CLIP embeddings for text are not as good as dedicated text models

Gemini Embedding puts text, images, audio, and video on the same “chart” using a single model. As one Reddit user put it:

“I thought Gemini push it a little bit, by putting images, music/voice, text on a single ‘chart’, which all the other models, they always put them on separate. For my Qdrant database this sound like a real change.”

The key insight: it’s not about having multiple models with compatible outputs, it’s about one model that natively understands all modalities.

Pinecone Implementation

If you’re using Pinecone, the pattern is similar:

pinecone_multimodal.py
from pinecone import Pinecone
from google import genai
pc = Pinecone(api_key="YOUR_PINECONE_KEY")
index = pc.Index("multimodal-rag")
# Upsert multimodal content
text_emb = client.models.embed_content(model="gemini-embedding", content="...text...")
image_emb = client.models.embed_content(
model="gemini-embedding",
content=image_bytes,
mime_type="image/png"
)
index.upsert([
("doc-1", text_emb.values, {"modality": "text", "title": "..."}),
("img-1", image_emb.values, {"modality": "image", "description": "..."}),
])
# Unified search across all modalities
query_emb = client.models.embed_content(model="gemini-embedding", content="find relevant content")
results = index.query(vector=query_emb.values, top_k=10, include_metadata=True)

Lessons Learned

After rebuilding my RAG system with multimodal embeddings, here’s what I learned:

Distance thresholds need recalibration. The optimal similarity threshold for cross-modal retrieval differs from text-only. I had to re-tune my threshold from 0.75 to around 0.82.

Metadata still matters. Even though everything is in one collection, I still track modality type in the payload for filtering and result ranking.

Latency is higher but acceptable. Embedding an image takes longer than embedding text, but the architectural simplicity more than compensates.

Cost is different, not necessarily higher. I went from paying for three separate embedding APIs to one, but Gemini Embedding has its own pricing model. Do the math for your use case.

Comparison Summary

AspectTraditional RAGMultimodal RAG
Pipeline complexityMultiple pipelines per modalitySingle unified pipeline
Cross-modal searchNot possibleNative support
Model managementMultiple embedding modelsOne embedding model
Vector DB overheadMultiple collectionsSingle collection
Query flexibilityText-only queriesAny modality as query

When to Use Multimodal Embeddings

This approach shines when:

  • You need cross-modal retrieval (text-to-image, image-to-document)
  • Your content mix is diverse (documents, screenshots, recordings)
  • You want to simplify your architecture
  • You’re building a new system and don’t have legacy constraints

Stick with traditional text embeddings when:

  • You only have text content
  • You’re already invested in a specific text embedding model
  • Latency is critical and you don’t need cross-modal search

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments