Skip to content

What is Gemini Embedding 2 and How Does It Enable Multimodal Search?

Problem

I was building a customer support knowledge base. The data looked like this:

  • 5000 support tickets (text)
  • 2000 screenshot attachments (images)
  • 800 call recordings (audio)

My search worked fine for text queries. But users kept asking:

“Can I search by uploading a screenshot?” “Can I find that call where the customer described this error?”

I tried combining separate embedding models:

failed-approach.py
from openai import OpenAI
from transformers import CLIPModel
# Text embeddings
text_embedding = openai.embeddings.create(
input="Error code 500 on login",
model="text-embedding-3-small"
)
# Image embeddings
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
image_embedding = clip_model.get_image_features(image)
# Problem: These vectors live in DIFFERENT spaces!
# Can't compare text_embedding with image_embedding

This approach failed. Text embeddings from OpenAI and image embeddings from CLIP live in completely different vector spaces. I could search text with text, or images with images, but never cross the boundary.

Environment

  • Python 3.12
  • Google Generative AI SDK
  • Pinecone for vector storage
  • PostgreSQL for metadata

Solution

Gemini Embedding 2 creates a shared embedding space for text, images, and audio. A product photo and its description text have similar vectors. An audio clip explaining a bug is close to related documentation.

concept-diagram.txt
Traditional Approach (Fragmented):
+------------------+ +------------------+
| Text Embeddings | | Image Embeddings |
| (text-embedding) | | (CLIP) |
+------------------+ +------------------+
| |
v v
Vector DB 1 Vector DB 2
(text only) (images only)
Gemini Embedding 2 (Unified):
+-----------------------------------------+
| Shared Vector Space |
| Text, Images, Audio -> Same Dimensions |
+-----------------------------------------+
|
v
Single Vector DB
(all modalities together)

Basic Text Embedding

text-embedding.py
from google.generativeai import embedding
result = embedding.embed_content(
model="models/gemini-embedding-2",
content="Customer reported login timeout after 30 seconds",
task_type="retrieval_document"
)
vector = result['embedding']
print(f"Vector dimension: {len(vector)}") # Fixed dimension for all modalities

Image Embedding

image-embedding.py
import base64
from google.generativeai import embedding
# Load and encode image
with open("error_screenshot.png", "rb") as f:
image_data = base64.b64encode(f.read()).decode()
result = embedding.embed_content(
model="models/gemini-embedding-2",
content={
"mime_type": "image/png",
"data": image_data
},
task_type="retrieval_document"
)
image_vector = result['embedding']
print(f"Vector dimension: {len(image_vector)}") # Same dimension as text!

The key insight: text_vector and image_vector have the same dimension and exist in the same vector space. I can now compute cosine similarity between them.

cross-modal-search.py
import base64
from pinecone import Pinecone
from google.generativeai import embedding
pc = Pinecone(api_key="your-key")
index = pc.Index("multimodal-support")
def search_with_image(image_path: str, top_k: int = 5):
"""Query with an image, retrieve text documents"""
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
# Embed the image
query_vector = embedding.embed_content(
model="models/gemini-embedding-2",
content={"mime_type": "image/jpeg", "data": image_data},
task_type="retrieval_query"
)['embedding']
# Search across ALL modalities
results = index.query(
vector=query_vector,
top_k=top_k,
include_metadata=True
)
return results
# Usage: Customer uploads screenshot of error
results = search_with_image("customer_screenshot.png")
for match in results['matches']:
print(f"Score: {match['score']:.3f}")
print(f"Type: {match['metadata']['content_type']}") # text, image, or audio
print(f"Content: {match['metadata']['preview']}")

When I tested this with a screenshot of a database connection error:

search-results.txt
Score: 0.89 Type: text Content: "How to fix database connection timeout errors..."
Score: 0.85 Type: text Content: "Troubleshooting guide for connection pool issues..."
Score: 0.82 Type: image Content: "screenshot_db_error_2024.png"
Score: 0.78 Type: audio Content: "call_transcript_4721.mp3"

The image query returned relevant text documentation, similar screenshots, and even related call recordings.

Ingesting Multimodal Data

ingest-pipeline.py
from google.generativeai import embedding
from pinecone import Pinecone
import base64
from pathlib import Path
class MultimodalIngester:
def __init__(self, index_name: str):
self.pc = Pinecone(api_key="your-key")
self.index = self.pc.Index(index_name)
def ingest_text(self, text: str, metadata: dict):
vector = embedding.embed_content(
model="models/gemini-embedding-2",
content=text,
task_type="retrieval_document"
)['embedding']
self.index.upsert([(
metadata['id'],
vector,
{**metadata, 'content_type': 'text'}
)])
def ingest_image(self, image_path: str, metadata: dict):
with open(image_path, "rb") as f:
image_data = base64.b64encode(f.read()).decode()
vector = embedding.embed_content(
model="models/gemini-embedding-2",
content={"mime_type": "image/jpeg", "data": image_data},
task_type="retrieval_document"
)['embedding']
self.index.upsert([(
metadata['id'],
vector,
{**metadata, 'content_type': 'image'}
)])
def ingest_audio(self, audio_path: str, metadata: dict):
with open(audio_path, "rb") as f:
audio_data = base64.b64encode(f.read()).decode()
# Determine mime type from extension
ext = Path(audio_path).suffix.lower()
mime_types = {'.mp3': 'audio/mpeg', '.wav': 'audio/wav', '.ogg': 'audio/ogg'}
mime_type = mime_types.get(ext, 'audio/mpeg')
vector = embedding.embed_content(
model="models/gemini-embedding-2",
content={"mime_type": mime_type, "data": audio_data},
task_type="retrieval_document"
)['embedding']
self.index.upsert([(
metadata['id'],
vector,
{**metadata, 'content_type': 'audio'}
)])
# Ingest support data
ingester = MultimodalIngester("support-kb")
# Text tickets
ingester.ingest_text(
"Customer unable to login after password reset",
{'id': 'ticket-001', 'category': 'auth', 'created': '2024-01-15'}
)
# Screenshot attachments
ingester.ingest_image(
"screenshots/error_500.png",
{'id': 'screenshot-001', 'ticket_id': 'ticket-045', 'created': '2024-01-16'}
)
# Call recordings
ingester.ingest_audio(
"recordings/call_7821.mp3",
{'id': 'audio-001', 'agent': 'john', 'duration_sec': 342, 'created': '2024-01-17'}
)

What I Learned

1. Normalization Matters More

When mixing modalities, normalize embeddings before indexing:

normalize.py
import numpy as np
def normalize_embedding(vector: list) -> list:
"""Normalize vector to unit length"""
vec = np.array(vector)
norm = np.linalg.norm(vec)
if norm > 0:
return (vec / norm).tolist()
return vector
# Always normalize before upsert
normalized_vector = normalize_embedding(raw_vector)

Without normalization, some modalities may dominate search results due to different magnitude distributions.

2. Input Size Limits

Images and audio have limits. I hit this error with a large screenshot:

error.txt
InvalidArgument: Image size exceeds maximum allowed size of 20MB

Solution - resize images before embedding:

resize-image.py
from PIL import Image
import io
def resize_for_embedding(image_path: str, max_size: int = 1024) -> bytes:
"""Resize image to fit embedding model limits"""
img = Image.open(image_path)
# Resize while maintaining aspect ratio
img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)
# Convert to bytes
buffer = io.BytesIO()
img.save(buffer, format='JPEG', quality=85)
return buffer.getvalue()

3. Task Types Affect Results

The task_type parameter changes how embeddings are optimized:

task-types.py
# For documents being indexed
embedding.embed_content(
model="models/gemini-embedding-2",
content=text,
task_type="retrieval_document" # Optimized for being retrieved
)
# For queries
embedding.embed_content(
model="models/gemini-embedding-2",
content=query,
task_type="retrieval_query" # Optimized for finding documents
)

Using the wrong task type degrades search quality. Index with retrieval_document, query with retrieval_query.

4. Cross-Modal Alignment Varies

Not all modality pairs align equally well:

alignment-quality.txt
Text <-> Image: Strong alignment (trained extensively)
Text <-> Audio: Good alignment (speech-text pairs)
Image <-> Audio: Weaker alignment (less training data)

For image-to-audio search, consider using text as a bridge:

bridge-search.py
# Instead of: image -> direct audio search
# Use: image -> text query -> audio search
def image_to_audio_search(image_path: str):
# 1. Get image embedding
image_vec = get_image_embedding(image_path)
# 2. Find closest text documents first
text_results = index.query(vector=image_vec, filter={'type': 'text'})
# 3. Use top text as query for audio
top_text = text_results['matches'][0]['metadata']['text']
text_vec = get_text_embedding(top_text, task_type="retrieval_query")
# 4. Search audio with text query
return index.query(vector=text_vec, filter={'type': 'audio'})

Real-World Use Cases

Users upload a photo of a product they want:

product-search.py
def find_similar_products(user_photo: str) -> list:
results = search_with_image(user_photo)
products = []
for match in results['matches']:
if match['metadata']['type'] == 'product':
products.append({
'name': match['metadata']['product_name'],
'price': match['metadata']['price'],
'image_url': match['metadata']['image_url']
})
return products

Customer Support System

Query with any modality, retrieve all relevant context:

support-search.py
def get_support_context(query, modality='text'):
if modality == 'image':
vec = get_image_embedding(query)
elif modality == 'audio':
vec = get_audio_embedding(query)
else:
vec = get_text_embedding(query)
results = index.query(vector=vec, top_k=10)
return {
'related_tickets': filter_by_type(results, 'text'),
'similar_screenshots': filter_by_type(results, 'image'),
'relevant_calls': filter_by_type(results, 'audio')
}

Knowledge Management

Search across mixed documentation:

knowledge-search.py
def search_knowledge_base(query: str):
vec = get_text_embedding(query, task_type="retrieval_query")
results = index.query(
vector=vec,
top_k=15,
include_metadata=True
)
# Group by content type
grouped = {
'docs': [], 'diagrams': [], 'presentations': []
}
for match in results['matches']:
content_type = match['metadata'].get('content_type', 'docs')
grouped[content_type].append(match)
return grouped

Summary

Gemini Embedding 2 solves the fragmented data search problem by creating a unified vector space for text, images, and audio. Instead of maintaining separate embedding models and indexes for each data type, I now use a single model that enables cross-modal retrieval.

The key changes in my architecture:

  1. Single embedding model instead of text-only + CLIP + Whisper
  2. Single vector index instead of separate databases per modality
  3. Cross-modal queries - search with image, get text results

The trade-off: I had to be more careful about normalization and input size limits. Cross-modal alignment isn’t perfect for all pairs, so I sometimes use text as a bridge between image and audio.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!

Comments