peepshow/ how-to/ frame-embeddings-vector-search

Reel #H-19 Per-frame CLIP/SigLIP vectors → vector DB

peepshow how-to / frame-embeddings-vector-search

Index video frames into a vector DB for semantic search

Embed every frame at extraction time, push to your vector DB, then query semantically forever. peepshow's `--embed-frames` shells out to an embedding CLI on PATH (CLIP, SigLIP, EVA-CLIP), attaches the vector to each frame, and lets the vector sinks consume it directly — no second pass needed.

Steps

  1. Install peepshow + an embedding CLI

    Recommend `pip install open-clip-torch` + a thin CLI wrapper, OR `clip-cli` / `siglip-cli` from PyPI.

    npm install -g peepshow
    pip install open-clip-torch  # plus a clip-cli script of your choice
  2. Run with --embed-frames

    Auto-detects `embed-cli`, then `clip-cli`, then `siglip-cli` on PATH.

    peepshow ./demo.mp4 --embed-frames
  3. Pick a model

    Any `open_clip` model name or a HuggingFace SigLIP id.

    peepshow ./demo.mp4 --embed-frames --embed-model siglip-base-patch16-naflex
  4. Push to a vector sink

    Vectors land in the sink directly.

    peepshow ./demo.mp4 --embed-frames --sink chroma
    # query later:
    chroma-client query --collection peepshow --text 'person walking dog'

Why it works

Vector sinks need an embedding vector per frame. Without `--embed-frames`, sinks like Chroma have to compute embeddings themselves (per-frame, at query time, or via a second batch pass). With `--embed-frames`, peepshow does it once at extraction time and the vectors flow through the same JSON manifest to the sink. Top-level `EmbeddingInfo` reports model, dim, frames embedded, total vector bytes.

When it helps

  • Long-term video archives — semantic search over years of footage.
  • Multimodal RAG — pair frame embeddings with transcript text in the same vector DB.
  • Surveillance / security forensics — 'find every frame that looks like this query image' across thousands of clips.
  • Creative / asset workflows — search a footage library by description ('sunset over water', 'busy street').

Pitfalls

  • Embedding CLIs aren't standardised — `embed-cli`, `clip-cli`, `siglip-cli` are placeholders. Wire up your own thin wrapper around `open_clip_torch` or HuggingFace `transformers`.
  • Vector dim varies by model (CLIP ViT-B/32 = 512, ViT-L/14 = 768, SigLIP-Large = 1024). Sinks may need schema updates per model.
  • Per-frame GPU passes — long videos add minutes on CPU.

Works with these LLMs

Pairs with these sinks