What is Gemini Embedding 2 and How Does It Enable Multimodal Search?
Problem
I was building a customer support knowledge base. The data looked like this:
- 5000 support tickets (text)
- 2000 screenshot attachments (images)
- 800 call recordings (audio)
My search worked fine for text queries. But users kept asking:
“Can I search by uploading a screenshot?” “Can I find that call where the customer described this error?”
I tried combining separate embedding models:
from openai import OpenAIfrom transformers import CLIPModel
# Text embeddingstext_embedding = openai.embeddings.create( input="Error code 500 on login", model="text-embedding-3-small")
# Image embeddingsclip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")image_embedding = clip_model.get_image_features(image)
# Problem: These vectors live in DIFFERENT spaces!# Can't compare text_embedding with image_embeddingThis approach failed. Text embeddings from OpenAI and image embeddings from CLIP live in completely different vector spaces. I could search text with text, or images with images, but never cross the boundary.
Environment
- Python 3.12
- Google Generative AI SDK
- Pinecone for vector storage
- PostgreSQL for metadata
Solution
Gemini Embedding 2 creates a shared embedding space for text, images, and audio. A product photo and its description text have similar vectors. An audio clip explaining a bug is close to related documentation.
Traditional Approach (Fragmented):+------------------+ +------------------+| Text Embeddings | | Image Embeddings || (text-embedding) | | (CLIP) |+------------------+ +------------------+ | | v v Vector DB 1 Vector DB 2 (text only) (images only)
Gemini Embedding 2 (Unified):+-----------------------------------------+| Shared Vector Space || Text, Images, Audio -> Same Dimensions |+-----------------------------------------+ | v Single Vector DB (all modalities together)Basic Text Embedding
from google.generativeai import embedding
result = embedding.embed_content( model="models/gemini-embedding-2", content="Customer reported login timeout after 30 seconds", task_type="retrieval_document")
vector = result['embedding']print(f"Vector dimension: {len(vector)}") # Fixed dimension for all modalitiesImage Embedding
import base64from google.generativeai import embedding
# Load and encode imagewith open("error_screenshot.png", "rb") as f: image_data = base64.b64encode(f.read()).decode()
result = embedding.embed_content( model="models/gemini-embedding-2", content={ "mime_type": "image/png", "data": image_data }, task_type="retrieval_document")
image_vector = result['embedding']print(f"Vector dimension: {len(image_vector)}") # Same dimension as text!The key insight: text_vector and image_vector have the same dimension and exist in the same vector space. I can now compute cosine similarity between them.
Cross-Modal Search
import base64from pinecone import Pineconefrom google.generativeai import embedding
pc = Pinecone(api_key="your-key")index = pc.Index("multimodal-support")
def search_with_image(image_path: str, top_k: int = 5): """Query with an image, retrieve text documents"""
with open(image_path, "rb") as f: image_data = base64.b64encode(f.read()).decode()
# Embed the image query_vector = embedding.embed_content( model="models/gemini-embedding-2", content={"mime_type": "image/jpeg", "data": image_data}, task_type="retrieval_query" )['embedding']
# Search across ALL modalities results = index.query( vector=query_vector, top_k=top_k, include_metadata=True )
return results
# Usage: Customer uploads screenshot of errorresults = search_with_image("customer_screenshot.png")
for match in results['matches']: print(f"Score: {match['score']:.3f}") print(f"Type: {match['metadata']['content_type']}") # text, image, or audio print(f"Content: {match['metadata']['preview']}")When I tested this with a screenshot of a database connection error:
Score: 0.89 Type: text Content: "How to fix database connection timeout errors..."Score: 0.85 Type: text Content: "Troubleshooting guide for connection pool issues..."Score: 0.82 Type: image Content: "screenshot_db_error_2024.png"Score: 0.78 Type: audio Content: "call_transcript_4721.mp3"The image query returned relevant text documentation, similar screenshots, and even related call recordings.
Ingesting Multimodal Data
from google.generativeai import embeddingfrom pinecone import Pineconeimport base64from pathlib import Path
class MultimodalIngester: def __init__(self, index_name: str): self.pc = Pinecone(api_key="your-key") self.index = self.pc.Index(index_name)
def ingest_text(self, text: str, metadata: dict): vector = embedding.embed_content( model="models/gemini-embedding-2", content=text, task_type="retrieval_document" )['embedding']
self.index.upsert([( metadata['id'], vector, {**metadata, 'content_type': 'text'} )])
def ingest_image(self, image_path: str, metadata: dict): with open(image_path, "rb") as f: image_data = base64.b64encode(f.read()).decode()
vector = embedding.embed_content( model="models/gemini-embedding-2", content={"mime_type": "image/jpeg", "data": image_data}, task_type="retrieval_document" )['embedding']
self.index.upsert([( metadata['id'], vector, {**metadata, 'content_type': 'image'} )])
def ingest_audio(self, audio_path: str, metadata: dict): with open(audio_path, "rb") as f: audio_data = base64.b64encode(f.read()).decode()
# Determine mime type from extension ext = Path(audio_path).suffix.lower() mime_types = {'.mp3': 'audio/mpeg', '.wav': 'audio/wav', '.ogg': 'audio/ogg'} mime_type = mime_types.get(ext, 'audio/mpeg')
vector = embedding.embed_content( model="models/gemini-embedding-2", content={"mime_type": mime_type, "data": audio_data}, task_type="retrieval_document" )['embedding']
self.index.upsert([( metadata['id'], vector, {**metadata, 'content_type': 'audio'} )])
# Ingest support dataingester = MultimodalIngester("support-kb")
# Text ticketsingester.ingest_text( "Customer unable to login after password reset", {'id': 'ticket-001', 'category': 'auth', 'created': '2024-01-15'})
# Screenshot attachmentsingester.ingest_image( "screenshots/error_500.png", {'id': 'screenshot-001', 'ticket_id': 'ticket-045', 'created': '2024-01-16'})
# Call recordingsingester.ingest_audio( "recordings/call_7821.mp3", {'id': 'audio-001', 'agent': 'john', 'duration_sec': 342, 'created': '2024-01-17'})What I Learned
1. Normalization Matters More
When mixing modalities, normalize embeddings before indexing:
import numpy as np
def normalize_embedding(vector: list) -> list: """Normalize vector to unit length""" vec = np.array(vector) norm = np.linalg.norm(vec) if norm > 0: return (vec / norm).tolist() return vector
# Always normalize before upsertnormalized_vector = normalize_embedding(raw_vector)Without normalization, some modalities may dominate search results due to different magnitude distributions.
2. Input Size Limits
Images and audio have limits. I hit this error with a large screenshot:
InvalidArgument: Image size exceeds maximum allowed size of 20MBSolution - resize images before embedding:
from PIL import Imageimport io
def resize_for_embedding(image_path: str, max_size: int = 1024) -> bytes: """Resize image to fit embedding model limits""" img = Image.open(image_path)
# Resize while maintaining aspect ratio img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
# Convert to bytes buffer = io.BytesIO() img.save(buffer, format='JPEG', quality=85) return buffer.getvalue()3. Task Types Affect Results
The task_type parameter changes how embeddings are optimized:
# For documents being indexedembedding.embed_content( model="models/gemini-embedding-2", content=text, task_type="retrieval_document" # Optimized for being retrieved)
# For queriesembedding.embed_content( model="models/gemini-embedding-2", content=query, task_type="retrieval_query" # Optimized for finding documents)Using the wrong task type degrades search quality. Index with retrieval_document, query with retrieval_query.
4. Cross-Modal Alignment Varies
Not all modality pairs align equally well:
Text <-> Image: Strong alignment (trained extensively)Text <-> Audio: Good alignment (speech-text pairs)Image <-> Audio: Weaker alignment (less training data)For image-to-audio search, consider using text as a bridge:
# Instead of: image -> direct audio search# Use: image -> text query -> audio search
def image_to_audio_search(image_path: str): # 1. Get image embedding image_vec = get_image_embedding(image_path)
# 2. Find closest text documents first text_results = index.query(vector=image_vec, filter={'type': 'text'})
# 3. Use top text as query for audio top_text = text_results['matches'][0]['metadata']['text'] text_vec = get_text_embedding(top_text, task_type="retrieval_query")
# 4. Search audio with text query return index.query(vector=text_vec, filter={'type': 'audio'})Real-World Use Cases
E-commerce Product Search
Users upload a photo of a product they want:
def find_similar_products(user_photo: str) -> list: results = search_with_image(user_photo)
products = [] for match in results['matches']: if match['metadata']['type'] == 'product': products.append({ 'name': match['metadata']['product_name'], 'price': match['metadata']['price'], 'image_url': match['metadata']['image_url'] })
return productsCustomer Support System
Query with any modality, retrieve all relevant context:
def get_support_context(query, modality='text'): if modality == 'image': vec = get_image_embedding(query) elif modality == 'audio': vec = get_audio_embedding(query) else: vec = get_text_embedding(query)
results = index.query(vector=vec, top_k=10)
return { 'related_tickets': filter_by_type(results, 'text'), 'similar_screenshots': filter_by_type(results, 'image'), 'relevant_calls': filter_by_type(results, 'audio') }Knowledge Management
Search across mixed documentation:
def search_knowledge_base(query: str): vec = get_text_embedding(query, task_type="retrieval_query")
results = index.query( vector=vec, top_k=15, include_metadata=True )
# Group by content type grouped = { 'docs': [], 'diagrams': [], 'presentations': [] }
for match in results['matches']: content_type = match['metadata'].get('content_type', 'docs') grouped[content_type].append(match)
return groupedSummary
Gemini Embedding 2 solves the fragmented data search problem by creating a unified vector space for text, images, and audio. Instead of maintaining separate embedding models and indexes for each data type, I now use a single model that enables cross-modal retrieval.
The key changes in my architecture:
- Single embedding model instead of text-only + CLIP + Whisper
- Single vector index instead of separate databases per modality
- Cross-modal queries - search with image, get text results
The trade-off: I had to be more careful about normalization and input size limits. Cross-modal alignment isn’t perfect for all pairs, so I sometimes use text as a bridge between image and audio.
Final Words + More Resources
My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me
Here are also the most important links from this article along with some further resources that will help you in this scope:
- 👨💻 Google Gemini Embedding Documentation
- 👨💻 Reddit: Gemini Embedding 2 Discussion
- 👨💻 Pinecone Vector Database
Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!
Comments