peepshow + Local LLMs — Local multimodal

Local LLMs has no native video. peepshow bridges.

Local LLMs's vision is image-only. peepshow turns video + animated formats into the frame timeline Local LLMs already accepts.

Fully offline. peepshow + whisper.cpp + Ollama = video → answer without any network hop. No cloud, no API key, no upload.
Cap visual VRAM. Local vision models choke on long video. peepshow trims the input to N frames so any 7B / 11B multimodal model fits in 12GB VRAM.
Animated GIF / APNG / WebP. Most local vision adapters only accept JPEG/PNG. peepshow flattens animated formats.
Audio transcript text-cheap. whisper.cpp ships standalone — peepshow auto-detects it. Frames + transcript reach the model as cheap inputs.
Vendor-agnostic. Same peepshow bundle feeds Ollama, LM Studio, llama.cpp, Jan, OpenWebUI, GPT4All. Pick or switch models freely.
No telemetry leakage. Run with `PEEPSHOW_TELEMETRY=0` for a hard offline pipeline.

Token-cost math (worked examples)

Clip	Native upload	peepshow + Local LLMs
30s clip + Llama 3.2 11B Vision	—	~3GB VRAM peak, ~2s/frame on M3 Max
10-min clip + Qwen2.5-VL 7B	—	~8GB VRAM, ~80s total for 20 frames
1-hour CCTV + Pixtral 12B	—	~14GB VRAM, ~3min for 30 frames
3-hour conference + LLaVA 13B	—	Chunk into 5-min segments; ~18GB VRAM peak

Numbers on Apple Silicon (M3 Max, MPS backend). x86 CUDA performance varies by GPU. peepshow itself adds <2s overhead per clip — ffmpeg + whisper.cpp dominate.

Install (CLI)

# 1. Install peepshow + whisper.cpp + Ollama
npm install -g peepshow
brew install whisper-cpp ollama
ollama pull llama3.2-vision

# 2. Extract video → frames + transcript
peepshow ./demo.mp4 --emit json > run.json

Install (Local LLMs API directly, no CLI)

Calling the Local LLMs API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# Pipe frames to Ollama directly
node -e '
  import { readFileSync } from "node:fs";
  const run = JSON.parse(readFileSync("run.json", "utf8"));
  const r = await fetch("http://127.0.0.1:11434/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "llama3.2-vision",
      messages: [{
        role: "user",
        content: "Summarise this clip. Transcript: " + (run.transcript?.text ?? ""),
        images: run.frames.map(f => readFileSync(f.path).toString("base64"))
      }],
      stream: false
    })
  });
  console.log((await r.json()).message.content);
'

Animated GIF / APNG / WebP — peepshow's killer move on Local LLMs

Most local vision adapters reject animated images outright. peepshow normalises GIF / APNG / WebP into JPEG sequences so any local multimodal model reads them.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For Local LLMs specifically, these presets are worth knowing:

--strategy scene --max 12Lean preset for 8GB VRAM cards. Keeps Llama 3.2 11B Vision comfortable.
--strategy scene --max 6 --resize 512Tight VRAM budget — half-res frames + 6-frame max. Works on 6GB cards.
--strategy fps --fps 0.5 --max 30Steady-cadence sampling for long static content. Pairs with chunked inference.

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one Local LLMs run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When Local LLMs consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<Local LLMs's summary>","provider":"llama3.2-vision"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use Local LLMs direct

Footage is already a small set of stills.
Running a non-vision model — text-only LLMs don't read frames.
Need realtime streaming inference — peepshow is one-shot, not streaming.

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → Local LLMs reads them as images + text.

Llama 3.2 Vision — Same peepshow pipeline, offline. Frames into Llama 3.2 Vision / Qwen2.5-VL / Pixtral / LLaVA via Ollama or LM Studio.

Local LLMs has no native video. peepshow bridges.

Token-cost math (worked examples)

Install (CLI)

Install (Local LLMs API directly, no CLI)

Animated GIF / APNG / WebP — peepshow's killer move on Local LLMs

Frame strategy presets

All 95 sinks still fire

Report + LLM analysis loop

When to skip peepshow + use Local LLMs direct

peepshow + other LLMs

Related

Llama 3.2 Vision — Same peepshow pipeline, offline. Frames into Llama 3.2 Vision / Qwen2.5-VL / Pixtral / LLaVA via Ollama or LM Studio.

Local LLMs has no native video. peepshow bridges.

Token-cost math (worked examples)

Install (CLI)

Install (Local LLMs API directly, no CLI)

Animated GIF / APNG / WebP — peepshow's killer move on Local LLMs

Frame strategy presets

All 95 sinks still fire

Report + LLM analysis loop

When to skip peepshow + use Local LLMs direct

Related models

peepshow + other LLMs

Related