peepshow/ models/ siglip

Reel #M-15 SigLIP 2 · EVA-CLIP · embedding only

peepshow for models / siglip

SigLIP 2SigLIP 2 is an image embedder, not a chat model. peepshow turns a video into the frame batch it forward-passes — vectors then go straight to Chroma / Qdrant / Pinecone.

SigLIP 2 (Sigmoid Loss for Language Image Pre-Training, Google Research) and the EVA-CLIP family are image-text embedding models — open-weight, encoder-only, no chat surface. They turn a frame into a 768-D vector, not a sentence. peepshow extracts a clean frame timeline so SigLIP 2 can pre-index every clip before the vectors land in Chroma / Qdrant / Pinecone — making per-query embeddings unnecessary downstream.

SigLIP 2 has no native video. peepshow bridges.

SigLIP 2's vision is image-only. peepshow turns video + animated formats into the frame timeline SigLIP 2 already accepts.

  • Embedding step, not a chat model. SigLIP 2 turns a frame into a 768-D vector — no generation, no tokens out. peepshow gives it the frame timeline; you push vectors straight to the vector sink.
  • Pre-index so query-time is cheap. Embed once at ingest with SigLIP — Chroma / Qdrant / Pinecone hits at query time skip the re-embed pass entirely. Vector search latency drops to a single similarity scan.
  • Better text-image alignment than CLIP. SigLIP 2 uses sigmoid loss + NaFlex variable resolution — outperforms OpenAI CLIP on zero-shot retrieval. Same input shape as peepshow's JPEG frames.
  • Open weights, fully local. Apache 2.0 / MIT on HuggingFace. Pair peepshow + SigLIP + a local Chroma for an entirely on-prem RAG pipeline — no cloud embeddings API.
  • Cheap to run. Encoder-only forward pass — ~80ms / frame on a 4090, ~200ms on M3 Max via MLX. peepshow's default 20-frame budget = a few seconds of GPU per video.
  • EVA-CLIP swap-in. Same input shape as EVA-CLIP, BiomedCLIP, MetaCLIP. peepshow's frame bundle is portable across encoders — pick the one tuned for your domain.

Token-cost math (worked examples)

ClipNative uploadpeepshow + SigLIP 2
1 frame at 768×768 (SigLIP 2 base)1 forward pass~80ms on 4090 · ~200ms on M3 Max
1 frame embedding (768-D × float32)3 KB on disk per frame · 0.75 KB at int8
20-frame peepshow run (default)~60 KB vectors total · ~1.5s GPU on 4090
1M frames across a video corpus~3 GB vector storage · ~25 GPU-hours one-shot ingest

SigLIP 2 emits no tokens — vectors only. Storage math is for 768-D float32; ViT-Large variants push 1152-D and ~4.6 KB / frame. Pre-indexing trades upfront GPU for permanently-cheap query latency.

Install (CLI)

npm install -g peepshow

# Self-host SigLIP 2 via transformers (open weights):
pip install transformers torch

# Plus a vector sink — peepshow ships sinks for Chroma / Qdrant / Pinecone:
peepshow ./demo.mp4 --emit json > run.json

Install (SigLIP 2 API directly, no CLI)

Calling the SigLIP 2 API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# Pipeline: peepshow → SigLIP 2 → Chroma
# Step 1: extract frames
# peepshow ./lecture.mp4 --emit json > run.json

# Step 2: embed each frame with SigLIP 2, then push to Chroma
python - <<'PY'
import json, base64
from pathlib import Path
from transformers import AutoProcessor, AutoModel
from PIL import Image
import torch, chromadb

processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex").eval()

run = json.load(open("run.json"))
images = [Image.open(f["path"]).convert("RGB") for f in run["frames"]]
with torch.no_grad():
    inputs = processor(images=images, return_tensors="pt")
    vectors = model.get_image_features(**inputs).cpu().numpy()

client = chromadb.HttpClient(host="localhost", port=8000)
col = client.get_or_create_collection("peepshow")
col.upsert(
    ids=[f"{run['runId']}:{i}" for i in range(len(images))],
    embeddings=vectors.tolist(),
    metadatas=[{"path": f["path"], "ts": f["timestamp"]} for f in run["frames"]],
)
PY

# Step 3: query Chroma with a text embedding (also SigLIP) — no re-embed at scan time.

Animated GIF / APNG / WebP — peepshow's killer move on SigLIP 2

SigLIP 2's image encoder reads JPEG / PNG. Animated GIF / APNG / WebP need flattening before they reach the embedder — peepshow does it automatically, so an animated product tour can be pre-indexed for vector search alongside ordinary video.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For SigLIP 2 specifically, these presets are worth knowing:

  • --strategy scene --max 20Default — scene-change frames give the densest semantic coverage per embedding GPU-second.
  • --strategy scene --max 12 --dedup perceptualStatic / talking-head footage — dedup before embedding so the vector DB doesn't store near-identical points.
  • --strategy fps --fps 0.5 --max 60Steady-cadence content (sport, gameplay) — predictable embedding density for similarity search.

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one SigLIP 2 run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When SigLIP 2 consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<SigLIP 2's summary>","provider":"siglip2-base-patch16-naflex"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use SigLIP 2 direct

  • Need a chat completion, not an embedding — call Granite / Claude / GPT instead.
  • Corpus is already embedded with a different encoder (don't mix vector spaces).
  • Source is a single image — embed it directly without peepshow.

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → SigLIP 2 reads them as images + text.