peepshow/ models/ gpt

Reel #M-03 OpenAI vision models

peepshow for models / gpt

GPT-5GPT vision is image input only. peepshow turns video + animated formats into the frame timeline GPT can ingest.

OpenAI's vision API is image-only. No video container goes in directly. peepshow extracts scene-change frames and a transcript so GPT-4o / GPT-5 reads the video as a sequence of images — same shape GPT already accepts.

GPT-4o / GPT-5 has no native video. peepshow bridges.

GPT-4o / GPT-5's vision is image-only. peepshow turns video + animated formats into the frame timeline GPT-4o / GPT-5 already accepts.

  • Only path to video on GPT. OpenAI's Files + Vision APIs accept images. peepshow turns video into the input shape GPT already understands.
  • Pair with Whisper API. peepshow's transcription provider chain includes OpenAI Whisper — run extraction + transcription on the same vendor in one call.
  • Files API sink ships in-tree. `peepshow-sink-openai-files` uploads frames directly to OpenAI Files; reference them by `file-id` in subsequent Responses calls.
  • Animated GIF / APNG / WebP. GPT vision treats these as static images. peepshow flattens them into a JPEG sequence.
  • Token cost is N × ~85-170 per image. Predictable budget. Native video would arrive uncapped — peepshow caps it.
  • Reasoning models work too. o4 / o5 reasoning models accept image content. peepshow's frame bundle feeds reasoning runs identically to chat runs.

Token-cost math (worked examples)

ClipNative uploadpeepshow + GPT-4o / GPT-5
30s product demo (peepshow)~3K (6 frames at ~170 tok each + transcript)
10-minute lecture (peepshow)~6K (20 scene frames + transcript)
1-hour CCTV reel (peepshow)~10K (30 motion frames + sparse transcript)
3-hour conference (peepshow + chunked)~28K (60 scene frames + chaptered transcript)

Per-image token cost uses OpenAI's high-detail vision pricing (~170 tokens / 512×512 tile). Lower for `detail: low` (~85 tok / image flat).

Install (CLI)

npm install -g peepshow

# Run peepshow with the OpenAI Files sink:
export OPENAI_API_KEY=sk-...
peepshow ./demo.mp4 --sink openai-files
# → frames uploaded to OpenAI Files, file-ids returned in the manifest.

Install (GPT-4o / GPT-5 API directly, no CLI)

Calling the GPT-4o / GPT-5 API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# 1. Extract
peepshow ./demo.mp4 --emit json > run.json

# 2. Hand the frames + transcript to GPT
node -e '
  import OpenAI from "openai";
  import { readFileSync } from "node:fs";
  const run = JSON.parse(readFileSync("run.json", "utf8"));
  const content = [
    { type: "input_text", text: "Summarise this clip." },
    ...run.frames.map(f => ({
      type: "input_image",
      image_url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64"),
      detail: "high"
    })),
    { type: "input_text", text: "Transcript:\n" + (run.transcript?.text ?? "") }
  ];
  const client = new OpenAI();
  const r = await client.responses.create({
    model: "gpt-4o",
    input: [{ role: "user", content }]
  });
  console.log(r.output_text);
'

Animated GIF / APNG / WebP — peepshow's killer move on GPT-4o / GPT-5

GPT's vision endpoint reads animated images as still frame one. peepshow extracts every motion frame from animated GIF / APNG / WebP, so GPT sees the whole loop.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For GPT-4o / GPT-5 specifically, these presets are worth knowing:

  • --strategy scene --max 24Default — 24 frames at high detail = ~4K tokens. Good for narrative video.
  • --strategy scene --max 60 --detail lowLong content with low-detail vision (flat 85 tok/image). 60 frames ≈ 5.1K tokens — wide context window.
  • --strategy fps --fps 1 --max 30 --sink openai-filesCache frames to OpenAI Files for reuse across multiple Responses calls (RAG-style).

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one GPT-4o / GPT-5 run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When GPT-4o / GPT-5 consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<GPT-4o / GPT-5's summary>","provider":"gpt-4o"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use GPT-4o / GPT-5 direct

  • OpenAI ships native video in the future and your clip is short.
  • You only need a single frame at a known timestamp.
  • Audio-only — use Whisper API directly, no frames needed.

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → GPT-4o / GPT-5 reads them as images + text.