SigLIP 2 has no native video. peepshow bridges.
SigLIP 2's vision is image-only. peepshow turns video + animated formats into the frame timeline SigLIP 2 already accepts.
- Embedding step, not a chat model. SigLIP 2 turns a frame into a 768-D vector — no generation, no tokens out. peepshow gives it the frame timeline; you push vectors straight to the vector sink.
- Pre-index so query-time is cheap. Embed once at ingest with SigLIP — Chroma / Qdrant / Pinecone hits at query time skip the re-embed pass entirely. Vector search latency drops to a single similarity scan.
- Better text-image alignment than CLIP. SigLIP 2 uses sigmoid loss + NaFlex variable resolution — outperforms OpenAI CLIP on zero-shot retrieval. Same input shape as peepshow's JPEG frames.
- Open weights, fully local. Apache 2.0 / MIT on HuggingFace. Pair peepshow + SigLIP + a local Chroma for an entirely on-prem RAG pipeline — no cloud embeddings API.
- Cheap to run. Encoder-only forward pass — ~80ms / frame on a 4090, ~200ms on M3 Max via MLX. peepshow's default 20-frame budget = a few seconds of GPU per video.
- EVA-CLIP swap-in. Same input shape as EVA-CLIP, BiomedCLIP, MetaCLIP. peepshow's frame bundle is portable across encoders — pick the one tuned for your domain.
Token-cost math (worked examples)
| Clip | Native upload | peepshow + SigLIP 2 |
|---|---|---|
| 1 frame at 768×768 (SigLIP 2 base) | 1 forward pass | ~80ms on 4090 · ~200ms on M3 Max |
| 1 frame embedding (768-D × float32) | — | 3 KB on disk per frame · 0.75 KB at int8 |
| 20-frame peepshow run (default) | — | ~60 KB vectors total · ~1.5s GPU on 4090 |
| 1M frames across a video corpus | — | ~3 GB vector storage · ~25 GPU-hours one-shot ingest |
SigLIP 2 emits no tokens — vectors only. Storage math is for 768-D float32; ViT-Large variants push 1152-D and ~4.6 KB / frame. Pre-indexing trades upfront GPU for permanently-cheap query latency.
Install (CLI)
npm install -g peepshow
# Self-host SigLIP 2 via transformers (open weights):
pip install transformers torch
# Plus a vector sink — peepshow ships sinks for Chroma / Qdrant / Pinecone:
peepshow ./demo.mp4 --emit json > run.jsonInstall (SigLIP 2 API directly, no CLI)
Calling the SigLIP 2 API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:
# Pipeline: peepshow → SigLIP 2 → Chroma
# Step 1: extract frames
# peepshow ./lecture.mp4 --emit json > run.json
# Step 2: embed each frame with SigLIP 2, then push to Chroma
python - <<'PY'
import json, base64
from pathlib import Path
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch, chromadb
processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex").eval()
run = json.load(open("run.json"))
images = [Image.open(f["path"]).convert("RGB") for f in run["frames"]]
with torch.no_grad():
inputs = processor(images=images, return_tensors="pt")
vectors = model.get_image_features(**inputs).cpu().numpy()
client = chromadb.HttpClient(host="localhost", port=8000)
col = client.get_or_create_collection("peepshow")
col.upsert(
ids=[f"{run['runId']}:{i}" for i in range(len(images))],
embeddings=vectors.tolist(),
metadatas=[{"path": f["path"], "ts": f["timestamp"]} for f in run["frames"]],
)
PY
# Step 3: query Chroma with a text embedding (also SigLIP) — no re-embed at scan time.Animated GIF / APNG / WebP — peepshow's killer move on SigLIP 2
SigLIP 2's image encoder reads JPEG / PNG. Animated GIF / APNG / WebP need flattening before they reach the embedder — peepshow does it automatically, so an animated product tour can be pre-indexed for vector search alongside ordinary video.
peepshow ./meme.gif # animated GIF → frame timeline
peepshow ./tutorial.apng # animated PNG → frames
peepshow ./loop.webp # animated WebP → framesFrame strategy presets
peepshow picks scene-change frames by default. For SigLIP 2 specifically, these presets are worth knowing:
--strategy scene --max 20Default — scene-change frames give the densest semantic coverage per embedding GPU-second.--strategy scene --max 12 --dedup perceptualStatic / talking-head footage — dedup before embedding so the vector DB doesn't store near-identical points.--strategy fps --fps 0.5 --max 60Steady-cadence content (sport, gameplay) — predictable embedding density for similarity search.
All 95 sinks still fire
Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one SigLIP 2 run. Browse the full sink catalogue →.
Report + LLM analysis loop
Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When SigLIP 2 consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.
echo '{"summary":"<SigLIP 2's summary>","provider":"siglip2-base-patch16-naflex"}' \
| peepshow report annotate "<outputDir>"When to skip peepshow + use SigLIP 2 direct
- Need a chat completion, not an embedding — call Granite / Claude / GPT instead.
- Corpus is already embedded with a different encoder (don't mix vector spaces).
- Source is a single image — embed it directly without peepshow.
For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → SigLIP 2 reads them as images + text.