peepshow/ models/ deepseek

Reel #M-08 DeepSeek-OCR

peepshow for models / deepseek

DeepSeek-OCRDeepSeek-OCR reads images with OCR precision. peepshow extracts the frame timeline so it reads video as a sequence of stills.

DeepSeek-OCR is DeepSeek's latest open-weight vision-language model — OCR-focused, document-grade, supersedes DeepSeek-VL2 for most use cases. It accepts images, not video containers. peepshow extracts scene-change frames + transcript so DeepSeek-OCR handles video as the image sequence it already understands, with text in every frame intact.

DeepSeek has no native video. peepshow bridges.

DeepSeek's vision is image-only. peepshow turns video + animated formats into the frame timeline DeepSeek already accepts.

  • Only path to video on DeepSeek-OCR. DeepSeek-OCR accepts image inputs only. peepshow turns video into that shape — including any preceding DeepSeek-VL2 deployment.
  • Built for text-in-frame. DeepSeek-OCR is tuned for documents, screen captures, slide decks. peepshow's scene-change extractor lands exactly on the frames where text changes.
  • Open weights. DeepSeek-OCR ships with open weights on HuggingFace. Run on-prem with vLLM or transformers. peepshow pipeline unchanged.
  • Animated GIF / APNG / WebP. Adapter expects JPEG / PNG. peepshow flattens animated formats — useful for animated UI walkthroughs.
  • Cost-bounded. Open-weights inference = compute cost. peepshow caps frame count so VRAM use stays predictable.
  • Reuse the bundle on DeepSeek text models. Once extracted, feed the transcript to DeepSeek-V3 / R1 for downstream reasoning.

Token-cost math (worked examples)

ClipNative uploadpeepshow + DeepSeek
30s product demo (peepshow)~2.5K (6 frames + transcript)
10-minute lecture (peepshow)~6K (20 scene frames + transcript)
1-hour CCTV reel (peepshow)~10K (30 motion frames + sparse transcript)
3-hour conference (peepshow + chunked)~28K (60 scene frames + chaptered transcript)

DeepSeek-OCR is open-weight — billed in VRAM-seconds rather than $/token if you self-host. Numbers above are token-equivalent for context-budget planning.

Install (CLI)

npm install -g peepshow

# Self-host DeepSeek-OCR via vLLM:
pip install vllm
python -m vllm.entrypoints.openai.api_server --model deepseek-ai/DeepSeek-OCR

peepshow ./demo.mp4 --emit json > run.json

Install (DeepSeek API directly, no CLI)

Calling the DeepSeek API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# Frames + transcript → DeepSeek-OCR via OpenAI-compatible endpoint
node -e '
  import OpenAI from "openai";
  import { readFileSync } from "node:fs";
  const run = JSON.parse(readFileSync("run.json", "utf8"));
  const content = [
    { type: "text", text: "Transcribe and summarise this clip." },
    ...run.frames.map(f => ({
      type: "image_url",
      image_url: { url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64") }
    })),
    { type: "text", text: "Transcript:\n" + (run.transcript?.text ?? "") }
  ];
  const client = new OpenAI({ baseURL: "http://127.0.0.1:8000/v1", apiKey: "none" });
  const r = await client.chat.completions.create({
    model: "deepseek-ai/DeepSeek-OCR",
    messages: [{ role: "user", content }]
  });
  console.log(r.choices[0].message.content);
'

Animated GIF / APNG / WebP — peepshow's killer move on DeepSeek

DeepSeek-OCR's image adapter reads JPEG / PNG. peepshow normalises animated GIF / APNG / WebP into a flat JPEG sequence — handy for screen-recording GIFs where the text changes per frame.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For DeepSeek specifically, these presets are worth knowing:

  • --strategy scene --max 16Slide decks / screen recordings — scene detection lands on slide transitions.
  • --strategy scene --max 30Document review / OCR-heavy footage — more frames, higher recall on text changes.
  • --strategy fps --fps 1 --max 30Steady-cadence sampling for sport, broadcast, gameplay (non-OCR use).

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one DeepSeek run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When DeepSeek consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<DeepSeek's summary>","provider":"deepseek-ocr"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use DeepSeek direct

  • Source is already a small set of images / a single document scan.
  • Running DeepSeek-V3 / R1 text-only — no vision capability.
  • Need streaming-frame OCR on a live feed (peepshow is one-shot).

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → DeepSeek reads them as images + text.