Multimodal Embedding API + Model Routing
Every data type.
One vector space.
Embed text, images, audio, video, and PDFs into a single canonical space. Search across modalities. Route any provider's vectors — OpenAI, Gemini, Cohere — into one unified space. Starting at $0.10/1M tokens.
from schift import Schift s = Schift(api_key="sk-...") # Text embedding text_vec = s.embed("quarterly revenue report") # Image embedding — same canonical space img_vec = s.embed(image="product_photo.jpg") # Cross-modal search: find images with text query results = s.search("product demo", collection="media") # → {"file": "demo.mp4", "timestamp": "03:22", "score": 0.94} # Model Routing: bring existing OpenAI vectors s.project(vectors=[...], source="openai/text-embedding-3-large")
6
modalities (text, image, audio, video, PDF, code)
99.7%
retrieval recovery
$0.10
starting price /1M tokens
50%
cheaper than Gemini
Proof
Tested across 11 model pairs and 5 modalities. Zero failures.
We measured Recall@10 on real retrieval benchmarks across text, image, and audio encoders. Without projection, switching models drops retrieval to absolute zero. With Schift, text-to-text recovery ranges from 92% to 104% — and cross-modal projection (image→text, audio→text) achieves up to 94.5% recovery.
Common misconception
"Same vendor, same dimensions — should be compatible, right?"
Wrong. OpenAI's ada-002 and text-embedding-3-small are both 1536-dimensional, but querying one with the other returns zero relevant results. Same story for Google's Gemini models. Every model version creates an entirely different vector space. Dimensions match — semantics don't.
OpenAI → OpenAI
ada-002 + ada-002
0.851
ada-002 + 3-small
0.000
Same vendor. Same 1536 dims. Completely broken.
Google → Google
gem-001 + gem-001
0.978
gem-001 + gem-2
0.000
Same vendor. Same 3072 dims. Completely broken.
With Schift
Gemini-001 baseline
0.978
R@10
Model mismatch
0.000
R@10
Schift projected
0.970
R@10 — 99.7% recovered
| Source model | Target model | Dimensions | Recovery | Verdict |
|---|---|---|---|---|
| ada-002 | text-embedding-3-small | 1536 → 1536 | 97.7% | SAFE |
| ada-002 | text-embedding-3-large | 1536 → 3072 | 97.9% | SAFE |
| text-embedding-3-small | text-embedding-3-large | 1536 → 3072 | 97.1% | SAFE |
| gemini-embedding-001 | gemini-embedding-2 | 3072 → 3072 | 99.7% | SAFE |
| ada-002 | gemini-embedding-001 | 1536 → 3072 | 95.8% | SAFE |
| gemini-embedding-001 | text-embedding-3-large | 3072 → 3072 | 103.5% | SAFE |
| gemini-embedding-001 | text-embedding-3-small | 3072 → 1536 | 99.8% | SAFE |
Cross-Modal Projection — Image, Audio → Text Search Space
Can an image encoder's vectors be searched with text queries? We projected image and audio embeddings into our text retrieval space using the same Ridge Regression technique.
| Source Encoder | Type | Dimensions | R@10 Recovery | R@50 |
|---|---|---|---|---|
| CLIP ViT-L/14 | Image | 768 → 1024 | 93.5% | 99.0% |
| CLIP ViT-B/32 | Image | 512 → 1024 | 94.5% | 99.5% |
| DINOv2-base (No text training) | Image | 768 → 1024 | 86.0% | 96.0% |
| CLAP (audio) | Audio | 512 → 1024 | 90.9% | 98.4% |
| Whisper encoder | Audio | 512 → 1024 | 67.8% | 85.0% |
Key insight
DINOv2: Never trained on text. Still 86% recovery.
→ Proves the canonical space is truly modality-agnostic.
→ Any encoder can be projected into the Schift space.
Text benchmarks: SciFact (5,183 documents). Image benchmarks: COCO Karpathy (1,000 images). Audio benchmarks: ESC-50 (1,600 clips). Recovery = projected R@10 / gold R@10 × 100.
How it works
Embed. Search. Pinpoint. Route.
One API for every data type. Cross-modal search out of the box. Bring existing vectors from any provider.
Embed Anything
Text, images, audio, video, PDF — one endpoint, one vector space. Upload raw files or pass text. Every modality maps to the same 1024-dimensional canonical space.
from schift import Schift
s = Schift(api_key="sk-...")
# All return the same 1024d canonical vector
text = s.embed("quarterly revenue report")
img = s.embed(image="chart.png")
doc = s.embed(document="report.pdf") Search Across Modalities
Query with text, get back images. Query with audio, get back documents. Cross-modal search works because everything lives in the same space.
# Text query → finds matching images, audio, video, PDFs
results = s.search("product demo screenshot")
# → [
# {"file": "demo.mp4", "at": "03:22", "score": 0.94},
# {"file": "slides.pdf", "page": 7, "score": 0.91},
# ] Pinpoint Location
Don't just find the document — find the exact page, timestamp, or frame. Chunk-level embedding with location metadata. Enterprise search that actually tells you where to look.
# PDF → per-page embedding
doc = s.embed(document="contract.pdf", chunking=True)
# → [{"page": 1, "embedding": [...]},
# {"page": 2, "embedding": [...]}, ...]
# Audio → per-second chunks
audio = s.embed(audio="meeting.mp3", chunking=True)
# → [{"at": "12:35", "embedding": [...]}, ...] Route Any Provider
Already using OpenAI, Gemini, or Cohere? Bring your existing vectors. We project them into our canonical space — 99.7% retrieval recovery. Switch providers without re-embedding.
# Bring existing vectors from any provider s.project( vectors=existing_openai_vectors, source="openai/text-embedding-3-large" ) # → 99.7% recovery, no re-embedding needed # → Now searchable alongside images, audio, video
Pricing
Cheaper than the competition. More modalities.
Text, image, audio, video, PDF — all in one API. Every modality cheaper than the closest alternative. Model routing included.
Modality
Price
vs Gemini Embedding 2
✓ No input limits (Gemini: 6 images, 80s audio, 6 pages)
✓ Model Routing (Gemini: not available)
✓ Chunk localization with location metadata
What's included
All Modalities
Text, image, audio, video, and PDF in one endpoint. One API key, one SDK, every input type.
Model Routing
Bring existing OpenAI, Gemini, or Cohere vectors and project them to canonical space. Switch models without re-embedding.
Chunk Localization
Every match returns location metadata — page number, timestamp, or frame index. Know exactly where the result came from.
Supported providers for routing
OpenAI
text-embedding-3-small, 3-large
gemini-embedding-2, text-embedding-004
Cohere
embed-v4
More providers added regularly. Request a provider →
Startup Program
Under 50 employees and less than $5M raised? Get $500 in free credits. We believe early-stage teams shouldn't pay for vendor lock-in they didn't choose. Apply →