Phi-4 Multimodal has no native video. peepshow bridges.
Phi-4 Multimodal's vision is image-only. peepshow turns video + animated formats into the frame timeline Phi-4 Multimodal already accepts.
- Only path to video on Phi-4. Phi-4-multimodal / Phi-4-reasoning-vision accept images only. peepshow turns video into the input shape they already understand.
- Tiny VRAM footprint. Phi-4-multimodal is 5.6B — fits in 8GB VRAM at 4-bit. peepshow trims frame count so the visual budget stays inside that envelope.
- Edge + laptop friendly. Run Phi-4 on a Mac M-series, a Jetson Orin, an Intel iGPU, or a single consumer GPU. peepshow + whisper.cpp = fully offline video understanding.
- Azure AI Foundry compatible. Phi-4 ships on Azure AI Foundry, NIM, and HuggingFace. peepshow's manifest feeds all three identically.
- Animated GIF / APNG / WebP. Phi-4 vision adapter reads JPEG / PNG. peepshow flattens animated formats.
- Reasoning variant included. Phi-4-reasoning-vision (15B) uses extended thinking on the same image shape — same peepshow bundle, deeper analysis.
Token-cost math (worked examples)
| Clip | Native upload | peepshow + Phi-4 Multimodal |
|---|---|---|
| 30s demo + Phi-4-multimodal 5.6B | — | ~1.5K tokens, ~6GB VRAM peak |
| 10-min lecture + Phi-4-multimodal 5.6B | — | ~6K tokens, ~7GB VRAM, ~40s on M3 Max |
| 1-hour CCTV + Phi-4-reasoning-vision 15B | — | ~10K tokens, ~14GB VRAM |
| Edge / Jetson Orin Nano + Phi-4-multimodal Q4 | — | ~6K tokens, ~5GB VRAM peak |
Numbers on Apple Silicon (M3 Max, MLX backend) and Jetson Orin Nano. CUDA performance varies by GPU. peepshow itself adds <2s overhead per clip.
Install (CLI)
npm install -g peepshow
# Local via transformers / MLX:
pip install transformers
# OR Azure AI Foundry (hosted):
export AZURE_OPENAI_API_KEY=...
export AZURE_OPENAI_ENDPOINT=https://your-foundry.openai.azure.com
peepshow ./demo.mp4 --emit json > run.jsonInstall (Phi-4 Multimodal API directly, no CLI)
Calling the Phi-4 Multimodal API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:
# Hand frames + transcript to Phi-4-multimodal via Azure AI Foundry
node -e '
import OpenAI from "openai";
import { readFileSync } from "node:fs";
const run = JSON.parse(readFileSync("run.json", "utf8"));
const content = [
{ type: "text", text: "Summarise this clip." },
...run.frames.map(f => ({
type: "image_url",
image_url: { url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64") }
})),
{ type: "text", text: "Transcript:\n" + (run.transcript?.text ?? "") }
];
const client = new OpenAI({
apiKey: process.env.AZURE_OPENAI_API_KEY,
baseURL: process.env.AZURE_OPENAI_ENDPOINT + "/openai/v1",
});
const r = await client.chat.completions.create({
model: "Phi-4-multimodal-instruct",
messages: [{ role: "user", content }]
});
console.log(r.choices[0].message.content);
'Animated GIF / APNG / WebP — peepshow's killer move on Phi-4 Multimodal
Phi-4's vision adapter expects JPEG / PNG. Animated GIF / APNG / WebP need flattening — peepshow does it automatically, so even a tiny edge deployment reads animated tutorials and UI walkthroughs.
peepshow ./meme.gif # animated GIF → frame timeline
peepshow ./tutorial.apng # animated PNG → frames
peepshow ./loop.webp # animated WebP → framesFrame strategy presets
peepshow picks scene-change frames by default. For Phi-4 Multimodal specifically, these presets are worth knowing:
--strategy scene --max 8 --resize 512Phi-4-multimodal 5.6B on Jetson / iGPU — minimal frames, half-res.--strategy scene --max 16Phi-4-multimodal 5.6B on a 12GB GPU — standard frame budget.--strategy scene --max 24 --resize 1024Phi-4-reasoning-vision 15B — full-res frames, reasoning runs longer.
All 95 sinks still fire
Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one Phi-4 Multimodal run. Browse the full sink catalogue →.
Report + LLM analysis loop
Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When Phi-4 Multimodal consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.
echo '{"summary":"<Phi-4 Multimodal's summary>","provider":"Phi-4-multimodal-instruct"}' \
| peepshow report annotate "<outputDir>"When to skip peepshow + use Phi-4 Multimodal direct
- Source is already a single still — call Phi-4 directly.
- Running Phi-4 text-only (no -multimodal / -vision suffix) — no vision capability.
- Need streaming-frame edge inference (peepshow is one-shot).
For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → Phi-4 Multimodal reads them as images + text.