Multimodal Embedding API + Model Routing

Every data type.
One vector space.

Embed text, images, audio, video, and PDFs into a single canonical space. Search across modalities. Route any provider's vectors — OpenAI, Gemini, Cohere — into one unified space. Starting at $0.10/1M tokens.

from schift import Schift
s = Schift(api_key="sk-...")

# Text embedding
text_vec = s.embed("quarterly revenue report")

# Image embedding — same canonical space
img_vec = s.embed(image="product_photo.jpg")

# Cross-modal search: find images with text query
results = s.search("product demo", collection="media")
# → {"file": "demo.mp4", "timestamp": "03:22", "score": 0.94}

# Model Routing: bring existing OpenAI vectors
s.project(vectors=[...], source="openai/text-embedding-3-large")

6

modalities (text, image, audio, video, PDF, code)

99.7%

retrieval recovery

$0.10

starting price /1M tokens

50%

cheaper than Gemini

Proof

Tested across 11 model pairs and 5 modalities. Zero failures.

We measured Recall@10 on real retrieval benchmarks across text, image, and audio encoders. Without projection, switching models drops retrieval to absolute zero. With Schift, text-to-text recovery ranges from 92% to 104% — and cross-modal projection (image→text, audio→text) achieves up to 94.5% recovery.

Common misconception

"Same vendor, same dimensions — should be compatible, right?"

Wrong. OpenAI's ada-002 and text-embedding-3-small are both 1536-dimensional, but querying one with the other returns zero relevant results. Same story for Google's Gemini models. Every model version creates an entirely different vector space. Dimensions match — semantics don't.

OpenAI → OpenAI

ada-002 + ada-002

0.851

ada-002 + 3-small

0.000

Same vendor. Same 1536 dims. Completely broken.

Google → Google

gem-001 + gem-001

0.978

gem-001 + gem-2

0.000

Same vendor. Same 3072 dims. Completely broken.

With Schift

Gemini-001 baseline

0.978

R@10

Model mismatch

0.000

R@10

Schift projected

0.970

R@10 — 99.7% recovered

Source model Target model Dimensions Recovery Verdict
ada-002 text-embedding-3-small 1536 → 1536 97.7% SAFE
ada-002 text-embedding-3-large 1536 → 3072 97.9% SAFE
text-embedding-3-small text-embedding-3-large 1536 → 3072 97.1% SAFE
gemini-embedding-001 gemini-embedding-2 3072 → 3072 99.7% SAFE
ada-002 gemini-embedding-001 1536 → 3072 95.8% SAFE
gemini-embedding-001 text-embedding-3-large 3072 → 3072 103.5% SAFE
gemini-embedding-001 text-embedding-3-small 3072 → 1536 99.8% SAFE

Cross-Modal Projection — Image, Audio → Text Search Space

Can an image encoder's vectors be searched with text queries? We projected image and audio embeddings into our text retrieval space using the same Ridge Regression technique.

Source Encoder Type Dimensions R@10 Recovery R@50
CLIP ViT-L/14 Image 768 → 1024 93.5% 99.0%
CLIP ViT-B/32 Image 512 → 1024 94.5% 99.5%
DINOv2-base (No text training) Image 768 → 1024 86.0% 96.0%
CLAP (audio) Audio 512 → 1024 90.9% 98.4%
Whisper encoder Audio 512 → 1024 67.8% 85.0%

Key insight

DINOv2: Never trained on text. Still 86% recovery.

→ Proves the canonical space is truly modality-agnostic.
→ Any encoder can be projected into the Schift space.

Text benchmarks: SciFact (5,183 documents). Image benchmarks: COCO Karpathy (1,000 images). Audio benchmarks: ESC-50 (1,600 clips). Recovery = projected R@10 / gold R@10 × 100.

How it works

Embed. Search. Pinpoint. Route.

One API for every data type. Cross-modal search out of the box. Bring existing vectors from any provider.

01

Embed Anything

Text, images, audio, video, PDF — one endpoint, one vector space. Upload raw files or pass text. Every modality maps to the same 1024-dimensional canonical space.

from schift import Schift
s = Schift(api_key="sk-...")

# All return the same 1024d canonical vector
text = s.embed("quarterly revenue report")
img  = s.embed(image="chart.png")
doc  = s.embed(document="report.pdf")
02

Search Across Modalities

Query with text, get back images. Query with audio, get back documents. Cross-modal search works because everything lives in the same space.

# Text query → finds matching images, audio, video, PDFs
results = s.search("product demo screenshot")
# → [
#   {"file": "demo.mp4", "at": "03:22", "score": 0.94},
#   {"file": "slides.pdf", "page": 7, "score": 0.91},
# ]
03

Pinpoint Location

Don't just find the document — find the exact page, timestamp, or frame. Chunk-level embedding with location metadata. Enterprise search that actually tells you where to look.

# PDF → per-page embedding
doc = s.embed(document="contract.pdf", chunking=True)
# → [{"page": 1, "embedding": [...]},
#    {"page": 2, "embedding": [...]}, ...]

# Audio → per-second chunks
audio = s.embed(audio="meeting.mp3", chunking=True)
# → [{"at": "12:35", "embedding": [...]}, ...]
04

Route Any Provider

Already using OpenAI, Gemini, or Cohere? Bring your existing vectors. We project them into our canonical space — 99.7% retrieval recovery. Switch providers without re-embedding.

# Bring existing vectors from any provider
s.project(
  vectors=existing_openai_vectors,
  source="openai/text-embedding-3-large"
)
# → 99.7% recovery, no re-embedding needed
# → Now searchable alongside images, audio, video

Pricing

Cheaper than the competition. More modalities.

Text, image, audio, video, PDF — all in one API. Every modality cheaper than the closest alternative. Model routing included.

Modality

Price

T Text
$0.10 / 1M tokens
I Image
$0.20 / 1M tok-equiv
A Audio
$5.00 / 1M tok-equiv
V Video
$10.00 / 1M tok-equiv
D Document / PDF
$0.15 / 1M tok-equiv
P Projection (routing)
$0.005 / 1M vectors

vs Gemini Embedding 2

Text $0.10 vs $0.20 50% cheaper

No input limits (Gemini: 6 images, 80s audio, 6 pages)

Model Routing (Gemini: not available)

Chunk localization with location metadata

What's included

All Modalities

Text, image, audio, video, and PDF in one endpoint. One API key, one SDK, every input type.

Model Routing

Bring existing OpenAI, Gemini, or Cohere vectors and project them to canonical space. Switch models without re-embedding.

Chunk Localization

Every match returns location metadata — page number, timestamp, or frame index. Know exactly where the result came from.

Supported providers for routing

OpenAI

text-embedding-3-small, 3-large

Google

gemini-embedding-2, text-embedding-004

Cohere

embed-v4

More providers added regularly. Request a provider →

Startup Program

Under 50 employees and less than $5M raised? Get $500 in free credits. We believe early-stage teams shouldn't pay for vendor lock-in they didn't choose. Apply →