peepshow/ models/ phi

Reel #M-12 Microsoft Phi-4 Multimodal

peepshow for models / phi

Phi-4-multimodal 5.6BPhi-4-multimodal runs on a laptop. peepshow caps frame count so even a 5.6B vision model handles arbitrary video.

Microsoft Phi-4-multimodal (5.6B) and Phi-4-reasoning-vision (15B) are small, edge-friendly multimodal models. They take still images — no video container. peepshow extracts scene frames + transcript so any Phi-4 vision deployment, from Azure AI Foundry to a Jetson / NIM container, handles arbitrary video with predictable VRAM.

Phi-4 Multimodal has no native video. peepshow bridges.

Phi-4 Multimodal's vision is image-only. peepshow turns video + animated formats into the frame timeline Phi-4 Multimodal already accepts.

  • Only path to video on Phi-4. Phi-4-multimodal / Phi-4-reasoning-vision accept images only. peepshow turns video into the input shape they already understand.
  • Tiny VRAM footprint. Phi-4-multimodal is 5.6B — fits in 8GB VRAM at 4-bit. peepshow trims frame count so the visual budget stays inside that envelope.
  • Edge + laptop friendly. Run Phi-4 on a Mac M-series, a Jetson Orin, an Intel iGPU, or a single consumer GPU. peepshow + whisper.cpp = fully offline video understanding.
  • Azure AI Foundry compatible. Phi-4 ships on Azure AI Foundry, NIM, and HuggingFace. peepshow's manifest feeds all three identically.
  • Animated GIF / APNG / WebP. Phi-4 vision adapter reads JPEG / PNG. peepshow flattens animated formats.
  • Reasoning variant included. Phi-4-reasoning-vision (15B) uses extended thinking on the same image shape — same peepshow bundle, deeper analysis.

Token-cost math (worked examples)

ClipNative uploadpeepshow + Phi-4 Multimodal
30s demo + Phi-4-multimodal 5.6B~1.5K tokens, ~6GB VRAM peak
10-min lecture + Phi-4-multimodal 5.6B~6K tokens, ~7GB VRAM, ~40s on M3 Max
1-hour CCTV + Phi-4-reasoning-vision 15B~10K tokens, ~14GB VRAM
Edge / Jetson Orin Nano + Phi-4-multimodal Q4~6K tokens, ~5GB VRAM peak

Numbers on Apple Silicon (M3 Max, MLX backend) and Jetson Orin Nano. CUDA performance varies by GPU. peepshow itself adds <2s overhead per clip.

Install (CLI)

npm install -g peepshow

# Local via transformers / MLX:
pip install transformers
# OR Azure AI Foundry (hosted):
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_ENDPOINT=https://your-foundry.openai.azure.com

peepshow ./demo.mp4 --emit json > run.json

Install (Phi-4 Multimodal API directly, no CLI)

Calling the Phi-4 Multimodal API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# Hand frames + transcript to Phi-4-multimodal via Azure AI Foundry
node -e '
  import OpenAI from "openai";
  import { readFileSync } from "node:fs";
  const run = JSON.parse(readFileSync("run.json", "utf8"));
  const content = [
    { type: "text", text: "Summarise this clip." },
    ...run.frames.map(f => ({
      type: "image_url",
      image_url: { url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64") }
    })),
    { type: "text", text: "Transcript:\n" + (run.transcript?.text ?? "") }
  ];
  const client = new OpenAI({
    apiKey: process.env.AZURE_OPENAI_API_KEY,
    baseURL: process.env.AZURE_OPENAI_ENDPOINT + "/openai/v1",
  });
  const r = await client.chat.completions.create({
    model: "Phi-4-multimodal-instruct",
    messages: [{ role: "user", content }]
  });
  console.log(r.choices[0].message.content);
'

Animated GIF / APNG / WebP — peepshow's killer move on Phi-4 Multimodal

Phi-4's vision adapter expects JPEG / PNG. Animated GIF / APNG / WebP need flattening — peepshow does it automatically, so even a tiny edge deployment reads animated tutorials and UI walkthroughs.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For Phi-4 Multimodal specifically, these presets are worth knowing:

  • --strategy scene --max 8 --resize 512Phi-4-multimodal 5.6B on Jetson / iGPU — minimal frames, half-res.
  • --strategy scene --max 16Phi-4-multimodal 5.6B on a 12GB GPU — standard frame budget.
  • --strategy scene --max 24 --resize 1024Phi-4-reasoning-vision 15B — full-res frames, reasoning runs longer.

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one Phi-4 Multimodal run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When Phi-4 Multimodal consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<Phi-4 Multimodal's summary>","provider":"Phi-4-multimodal-instruct"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use Phi-4 Multimodal direct

  • Source is already a single still — call Phi-4 directly.
  • Running Phi-4 text-only (no -multimodal / -vision suffix) — no vision capability.
  • Need streaming-frame edge inference (peepshow is one-shot).

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → Phi-4 Multimodal reads them as images + text.