What is Gemini Embedding 2 and How Does It Enable Multimodal Search?

Mar 26, 2026

Cowrie

Dev @ Bswen

Problem

I was building a customer support knowledge base. The data looked like this:

5000 support tickets (text)
2000 screenshot attachments (images)
800 call recordings (audio)

My search worked fine for text queries. But users kept asking:

“Can I search by uploading a screenshot?” “Can I find that call where the customer described this error?”

I tried combining separate embedding models:

from openai import OpenAI
from transformers import CLIPModel

# Text embeddings
text_embedding = openai.embeddings.create(
    input="Error code 500 on login",
    model="text-embedding-3-small"
)

# Image embeddings
clip_model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
image_embedding = clip_model.get_image_features(image)

# Problem: These vectors live in DIFFERENT spaces!
# Can't compare text_embedding with image_embedding

This approach failed. Text embeddings from OpenAI and image embeddings from CLIP live in completely different vector spaces. I could search text with text, or images with images, but never cross the boundary.

Environment

Python 3.12
Google Generative AI SDK
Pinecone for vector storage
PostgreSQL for metadata

Solution

Gemini Embedding 2 creates a shared embedding space for text, images, and audio. A product photo and its description text have similar vectors. An audio clip explaining a bug is close to related documentation.

Traditional Approach (Fragmented):
+------------------+     +------------------+
| Text Embeddings  |     | Image Embeddings |
| (text-embedding) |     | (CLIP)           |
+------------------+     +------------------+
        |                        |
        v                        v
   Vector DB 1              Vector DB 2
   (text only)              (images only)

Gemini Embedding 2 (Unified):
+-----------------------------------------+
|           Shared Vector Space            |
|  Text, Images, Audio -> Same Dimensions |
+-----------------------------------------+
                    |
                    v
             Single Vector DB
         (all modalities together)

Basic Text Embedding

from google.generativeai import embedding

result = embedding.embed_content(
    model="models/gemini-embedding-2",
    content="Customer reported login timeout after 30 seconds",
    task_type="retrieval_document"
)

vector = result['embedding']
print(f"Vector dimension: {len(vector)}")  # Fixed dimension for all modalities

Image Embedding

import base64
from google.generativeai import embedding

# Load and encode image
with open("error_screenshot.png", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

result = embedding.embed_content(
    model="models/gemini-embedding-2",
    content={
        "mime_type": "image/png",
        "data": image_data
    },
    task_type="retrieval_document"
)

image_vector = result['embedding']
print(f"Vector dimension: {len(image_vector)}")  # Same dimension as text!

The key insight: text_vector and image_vector have the same dimension and exist in the same vector space. I can now compute cosine similarity between them.

import base64
from pinecone import Pinecone
from google.generativeai import embedding

pc = Pinecone(api_key="your-key")
index = pc.Index("multimodal-support")

def search_with_image(image_path: str, top_k: int = 5):
    """Query with an image, retrieve text documents"""

    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode()

    # Embed the image
    query_vector = embedding.embed_content(
        model="models/gemini-embedding-2",
        content={"mime_type": "image/jpeg", "data": image_data},
        task_type="retrieval_query"
    )['embedding']

    # Search across ALL modalities
    results = index.query(
        vector=query_vector,
        top_k=top_k,
        include_metadata=True
    )

    return results

# Usage: Customer uploads screenshot of error
results = search_with_image("customer_screenshot.png")

for match in results['matches']:
    print(f"Score: {match['score']:.3f}")
    print(f"Type: {match['metadata']['content_type']}")  # text, image, or audio
    print(f"Content: {match['metadata']['preview']}")

When I tested this with a screenshot of a database connection error:

Score: 0.89  Type: text    Content: "How to fix database connection timeout errors..."
Score: 0.85  Type: text    Content: "Troubleshooting guide for connection pool issues..."
Score: 0.82  Type: image   Content: "screenshot_db_error_2024.png"
Score: 0.78  Type: audio   Content: "call_transcript_4721.mp3"

The image query returned relevant text documentation, similar screenshots, and even related call recordings.

Ingesting Multimodal Data

from google.generativeai import embedding
from pinecone import Pinecone
import base64
from pathlib import Path

class MultimodalIngester:
    def __init__(self, index_name: str):
        self.pc = Pinecone(api_key="your-key")
        self.index = self.pc.Index(index_name)

    def ingest_text(self, text: str, metadata: dict):
        vector = embedding.embed_content(
            model="models/gemini-embedding-2",
            content=text,
            task_type="retrieval_document"
        )['embedding']

        self.index.upsert([(
            metadata['id'],
            vector,
            {**metadata, 'content_type': 'text'}
        )])

    def ingest_image(self, image_path: str, metadata: dict):
        with open(image_path, "rb") as f:
            image_data = base64.b64encode(f.read()).decode()

        vector = embedding.embed_content(
            model="models/gemini-embedding-2",
            content={"mime_type": "image/jpeg", "data": image_data},
            task_type="retrieval_document"
        )['embedding']

        self.index.upsert([(
            metadata['id'],
            vector,
            {**metadata, 'content_type': 'image'}
        )])

    def ingest_audio(self, audio_path: str, metadata: dict):
        with open(audio_path, "rb") as f:
            audio_data = base64.b64encode(f.read()).decode()

        # Determine mime type from extension
        ext = Path(audio_path).suffix.lower()
        mime_types = {'.mp3': 'audio/mpeg', '.wav': 'audio/wav', '.ogg': 'audio/ogg'}
        mime_type = mime_types.get(ext, 'audio/mpeg')

        vector = embedding.embed_content(
            model="models/gemini-embedding-2",
            content={"mime_type": mime_type, "data": audio_data},
            task_type="retrieval_document"
        )['embedding']

        self.index.upsert([(
            metadata['id'],
            vector,
            {**metadata, 'content_type': 'audio'}
        )])

# Ingest support data
ingester = MultimodalIngester("support-kb")

# Text tickets
ingester.ingest_text(
    "Customer unable to login after password reset",
    {'id': 'ticket-001', 'category': 'auth', 'created': '2024-01-15'}
)

# Screenshot attachments
ingester.ingest_image(
    "screenshots/error_500.png",
    {'id': 'screenshot-001', 'ticket_id': 'ticket-045', 'created': '2024-01-16'}
)

# Call recordings
ingester.ingest_audio(
    "recordings/call_7821.mp3",
    {'id': 'audio-001', 'agent': 'john', 'duration_sec': 342, 'created': '2024-01-17'}
)

What I Learned

1. Normalization Matters More

When mixing modalities, normalize embeddings before indexing:

import numpy as np

def normalize_embedding(vector: list) -> list:
    """Normalize vector to unit length"""
    vec = np.array(vector)
    norm = np.linalg.norm(vec)
    if norm > 0:
        return (vec / norm).tolist()
    return vector

# Always normalize before upsert
normalized_vector = normalize_embedding(raw_vector)

Without normalization, some modalities may dominate search results due to different magnitude distributions.

2. Input Size Limits

Images and audio have limits. I hit this error with a large screenshot:

InvalidArgument: Image size exceeds maximum allowed size of 20MB

Solution - resize images before embedding:

from PIL import Image
import io

def resize_for_embedding(image_path: str, max_size: int = 1024) -> bytes:
    """Resize image to fit embedding model limits"""
    img = Image.open(image_path)

    # Resize while maintaining aspect ratio
    img.thumbnail((max_size, max_size), Image.Resampling.LANCZOS)

    # Convert to bytes
    buffer = io.BytesIO()
    img.save(buffer, format='JPEG', quality=85)
    return buffer.getvalue()

3. Task Types Affect Results

The task_type parameter changes how embeddings are optimized:

# For documents being indexed
embedding.embed_content(
    model="models/gemini-embedding-2",
    content=text,
    task_type="retrieval_document"  # Optimized for being retrieved
)

# For queries
embedding.embed_content(
    model="models/gemini-embedding-2",
    content=query,
    task_type="retrieval_query"  # Optimized for finding documents
)

Using the wrong task type degrades search quality. Index with retrieval_document, query with retrieval_query.

Not all modality pairs align equally well:

Text <-> Image:   Strong alignment (trained extensively)
Text <-> Audio:   Good alignment (speech-text pairs)
Image <-> Audio:  Weaker alignment (less training data)

For image-to-audio search, consider using text as a bridge:

# Instead of: image -> direct audio search
# Use: image -> text query -> audio search

def image_to_audio_search(image_path: str):
    # 1. Get image embedding
    image_vec = get_image_embedding(image_path)

    # 2. Find closest text documents first
    text_results = index.query(vector=image_vec, filter={'type': 'text'})

    # 3. Use top text as query for audio
    top_text = text_results['matches'][0]['metadata']['text']
    text_vec = get_text_embedding(top_text, task_type="retrieval_query")

    # 4. Search audio with text query
    return index.query(vector=text_vec, filter={'type': 'audio'})

Real-World Use Cases

E-commerce Product Search

Users upload a photo of a product they want:

def find_similar_products(user_photo: str) -> list:
    results = search_with_image(user_photo)

    products = []
    for match in results['matches']:
        if match['metadata']['type'] == 'product':
            products.append({
                'name': match['metadata']['product_name'],
                'price': match['metadata']['price'],
                'image_url': match['metadata']['image_url']
            })

    return products

Customer Support System

Query with any modality, retrieve all relevant context:

def get_support_context(query, modality='text'):
    if modality == 'image':
        vec = get_image_embedding(query)
    elif modality == 'audio':
        vec = get_audio_embedding(query)
    else:
        vec = get_text_embedding(query)

    results = index.query(vector=vec, top_k=10)

    return {
        'related_tickets': filter_by_type(results, 'text'),
        'similar_screenshots': filter_by_type(results, 'image'),
        'relevant_calls': filter_by_type(results, 'audio')
    }

Knowledge Management

Search across mixed documentation:

def search_knowledge_base(query: str):
    vec = get_text_embedding(query, task_type="retrieval_query")

    results = index.query(
        vector=vec,
        top_k=15,
        include_metadata=True
    )

    # Group by content type
    grouped = {
        'docs': [], 'diagrams': [], 'presentations': []
    }

    for match in results['matches']:
        content_type = match['metadata'].get('content_type', 'docs')
        grouped[content_type].append(match)

    return grouped

Summary

Gemini Embedding 2 solves the fragmented data search problem by creating a unified vector space for text, images, and audio. Instead of maintaining separate embedding models and indexes for each data type, I now use a single model that enables cross-modal retrieval.

The key changes in my architecture:

Single embedding model instead of text-only + CLIP + Whisper
Single vector index instead of separate databases per modality
Cross-modal queries - search with image, get text results

The trade-off: I had to be more careful about normalization and input size limits. Cross-modal alignment isn’t perfect for all pairs, so I sometimes use text as a bridge between image and audio.

Final Words + More Resources

My intention with this article was to help others share my knowledge and experience. If you want to contact me, you can contact by email: Email me

Here are also the most important links from this article along with some further resources that will help you in this scope:

👨‍💻 Google Gemini Embedding Documentation
👨‍💻 Reddit: Gemini Embedding 2 Discussion
👨‍💻 Pinecone Vector Database

Oh, and if you found these resources useful, don’t forget to support me by starring the repo on GitHub!