How Multimodal Embeddings Work in RAG Pipelines
I was building a RAG system last month and hit a wall. My application needed to search across PDFs, screenshots, and audio recordings. I found myself creating separate pipelines for each data type - a text embedder here, CLIP for images there, whisper embeddings somewhere else. The complexity was spiraling out of control.
Then Google released Gemini Embedding, and I realized my entire architecture was built on a flawed assumption: that each modality needs its own embedding model and vector collection.
The Problem: Modality Silos
My original RAG architecture looked like this:
Text Pipeline: Text -> OpenAI ada-002 -> Vector DB Collection AImage Pipeline: Image -> CLIP -> Vector DB Collection BAudio Pipeline: Audio -> Whisper -> Vector DB Collection CEach pipeline required:
- A different embedding model with different output dimensions
- A separate vector collection in Qdrant
- Different chunking/preprocessing logic
- Different similarity thresholds
The worst part? A text query like “show me architecture diagrams” could never retrieve the actual diagram images I had stored. Cross-modal retrieval was impossible.
from langchain.embeddings import OpenAIEmbeddingsfrom qdrant_client import QdrantClient
# Separate pipelines for different modalitiestext_embedder = OpenAIEmbeddings() # Only handles text# Would need CLIP for images, separate audio embedder, etc.
client = QdrantClient(":memory:")
# Create separate collections per modalityclient.create_collection( collection_name="text_documents", vectors_config={"size": 1536, "distance": "Cosine"})
client.create_collection( collection_name="images", vectors_config={"size": 512, "distance": "Cosine"} # CLIP dimension)
# Query is text-onlyquery = "What is retrieval augmented generation?"query_vector = text_embedder.embed_query(query)
# Can only retrieve text documents - images are in a different collection!results = client.search( collection_name="text_documents", query_vector=query_vector, limit=5)I had three different collections, three different embedding dimensions, and zero ability to search across modalities.
The Solution: Shared Vector Space
Gemini Embedding (and similar multimodal models) solve this by encoding text, images, audio, and video into a single shared vector space. This is fundamentally different from running multiple embedding models and storing results together - the model itself understands the semantic relationships between modalities.
All Modalities -> Gemini Embedding -> Single Vector DB Collection | v 768-dimensional shared spaceHere’s what the unified architecture looks like:
from google import genaifrom qdrant_client import QdrantClientfrom qdrant_client.models import PointStruct
# Single client for all modalitiesclient = genai.Client(api_key="YOUR_API_KEY")qdrant = QdrantClient(":memory:")
# One collection for everything - same dimension for all modalitiesqdrant.create_collection( collection_name="multimodal_content", vectors_config={"size": 768, "distance": "Cosine"})
# Embed different modalities into the SAME spacetext_embedding = client.models.embed_content( model="gemini-embedding-exp-03-07", content="RAG combines retrieval with generation")
image_embedding = client.models.embed_content( model="gemini-embedding-exp-03-07", content=open("diagram.png", "rb").read(), mime_type="image/png")
audio_embedding = client.models.embed_content( model="gemini-embedding-exp-03-07", content=open("lecture.mp3", "rb").read(), mime_type="audio/mp3")
# Store all in ONE collectionpoints = [ PointStruct(id=1, vector=text_embedding.values, payload={"type": "text", "content": "..."}), PointStruct(id=2, vector=image_embedding.values, payload={"type": "image", "path": "diagram.png"}), PointStruct(id=3, vector=audio_embedding.values, payload={"type": "audio", "path": "lecture.mp3"}),]qdrant.upsert(collection_name="multimodal_content", points=points)Cross-Modal Retrieval: The Game Changer
The real magic happens at query time. I can now query with text and retrieve images, or query with an image and retrieve documents:
# Cross-modal retrieval: Query with text, retrieve images!query_result = client.models.embed_content( model="gemini-embedding-exp-03-07", content="Show me architecture diagrams")
results = qdrant.search( collection_name="multimodal_content", query_vector=query_result.values, limit=5)
# Results include BOTH text documents AND imagesfor result in results: print(f"Score: {result.score}, Type: {result.payload['type']}")Output:
Score: 0.89, Type: image, Path: system-architecture.pngScore: 0.85, Type: text, Content: Our system architecture consists of...Score: 0.82, Type: image, Path: data-flow-diagram.pngScore: 0.78, Type: audio, Path: architecture-deep-dive.mp3The query “show me architecture diagrams” returned images, text, and audio - all semantically related to architecture.
Why This Is Different from CLIP
I initially thought this was just CLIP with more modalities. It’s not.
CLIP aligns text and images into a shared space, but:
- CLIP only handles two modalities (text + images)
- CLIP requires loading both text and image encoders separately
- CLIP embeddings for text are not as good as dedicated text models
Gemini Embedding puts text, images, audio, and video on the same “chart” using a single model. As one Reddit user put it:
“I thought Gemini push it a little bit, by putting images, music/voice, text on a single ‘chart’, which all the other models, they always put them on separate. For my Qdrant database this sound like a real change.”
The key insight: it’s not about having multiple models with compatible outputs, it’s about one model that natively understands all modalities.
Pinecone Implementation
If you’re using Pinecone, the pattern is similar:
from pinecone import Pineconefrom google import genai
pc = Pinecone(api_key="YOUR_PINECONE_KEY")index = pc.Index("multimodal-rag")
# Upsert multimodal contenttext_emb = client.models.embed_content(model="gemini-embedding", content="...text...")image_emb = client.models.embed_content( model="gemini-embedding", content=image_bytes, mime_type="image/png")
index.upsert([ ("doc-1", text_emb.values, {"modality": "text", "title": "..."}), ("img-1", image_emb.values, {"modality": "image", "description": "..."}),])
# Unified search across all modalitiesquery_emb = client.models.embed_content(model="gemini-embedding", content="find relevant content")results = index.query(vector=query_emb.values, top_k=10, include_metadata=True)Lessons Learned
After rebuilding my RAG system with multimodal embeddings, here’s what I learned:
Distance thresholds need recalibration. The optimal similarity threshold for cross-modal retrieval differs from text-only. I had to re-tune my threshold from 0.75 to around 0.82.
Metadata still matters. Even though everything is in one collection, I still track modality type in the payload for filtering and result ranking.
Latency is higher but acceptable. Embedding an image takes longer than embedding text, but the architectural simplicity more than compensates.
Cost is different, not necessarily higher. I went from paying for three separate embedding APIs to one, but Gemini Embedding has its own pricing model. Do the math for your use case.
Comparison Summary
| Aspect | Traditional RAG | Multimodal RAG |
|---|---|---|
| Pipeline complexity | Multiple pipelines per modality | Single unified pipeline |
| Cross-modal search | Not possible | Native support |
| Model management | Multiple embedding models | One embedding model |
| Vector DB overhead | Multiple collections | Single collection |
| Query flexibility | Text-only queries | Any modality as query |
When to Use Multimodal Embeddings
This approach shines when:
- You need cross-modal retrieval (text-to-image, image-to-document)
- Your content mix is diverse (documents, screenshots, recordings)
- You want to simplify your architecture
- You’re building a new system and don’t have legacy constraints
Stick with traditional text embeddings when:
- You only have text content
- You’re already invested in a specific text embedding model
- Latency is critical and you don’t need cross-modal search
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments