peepshow/ models/ nemotron

Reel #M-11 NVIDIA Nemotron 3 Nano Omni

peepshow for models / nemotron

Nemotron 3 Nano OmniNemotron 3 Nano Omni reads vision, audio, and text in a single 30B MoE. peepshow keeps it predictable across long clips.

NVIDIA Nemotron 3 Nano Omni is a 30B mixture-of-experts that takes images, audio, and text natively in one call — released April 2026 with open weights. It doesn't decode video containers end-to-end. peepshow extracts scene-change frames + audio so the Omni endpoint sees the timeline as the multimodal bundle it was trained on.

NVIDIA Nemotron 3 Nano Omni has no native video. peepshow bridges.

NVIDIA Nemotron 3 Nano Omni's vision is image-only. peepshow turns video + animated formats into the frame timeline NVIDIA Nemotron 3 Nano Omni already accepts.

  • Omni input shape, fed properly. Nemotron 3 Nano Omni accepts vision + audio + text together. peepshow's manifest carries frames and the source audio path — feed both in one call instead of running modalities separately.
  • Open weights via HuggingFace. Run Nemotron 3 Nano Omni on-prem under NVIDIA NIM or transformers. peepshow's pipeline is identical to the hosted route.
  • OpenRouter ready. OpenRouter exposes Nemotron 3 Nano Omni with an OpenAI-compatible shape — peepshow's JSON drops in with the standard `image_url` parts.
  • 30B MoE = small VRAM footprint. Only ~6B active params per token — peepshow trims frame count so an L40S / 4090 keeps up.
  • Animated GIF / APNG / WebP. Adapter expects still images. peepshow flattens animated formats.
  • Audio split-out optional. Send the raw audio to Nemotron's omni input, or run peepshow's whisper.cpp pass for a text transcript and save the audio tokens.

Token-cost math (worked examples)

ClipNative uploadpeepshow + NVIDIA Nemotron 3 Nano Omni
30s product demo (peepshow + audio)~5K (6 frames + raw audio + transcript)
10-minute lecture (peepshow + audio)~16K (20 scene frames + audio + transcript)
1-hour CCTV reel (peepshow, video only)~10K (30 motion frames + sparse transcript)
3-hour conference (peepshow + chunked audio)~40K (60 scene frames + chaptered audio + transcript)

Nemotron 3 Nano Omni is open-weight — self-hosted cost is VRAM-seconds, hosted cost is per-token on NIM / OpenRouter. Audio adds tokens but typically less than equivalent frame coverage of the same content.

Install (CLI)

npm install -g peepshow

# Self-host via NVIDIA NIM (Docker):
docker run --gpus all -p 8000:8000 \
  nvcr.io/nim/nvidia/nemotron-3-nano-omni:latest

# OR OpenRouter:
export OPENROUTER_API_KEY=sk-or-...

peepshow ./demo.mp4 --emit json > run.json

Install (NVIDIA Nemotron 3 Nano Omni API directly, no CLI)

Calling the NVIDIA Nemotron 3 Nano Omni API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# Hand frames + audio to Nemotron 3 Nano Omni (OpenAI-compatible via NIM / OpenRouter)
node -e '
  import OpenAI from "openai";
  import { readFileSync } from "node:fs";
  const run = JSON.parse(readFileSync("run.json", "utf8"));
  const content = [
    { type: "text", text: "Summarise this clip using both the frames and the audio." },
    ...run.frames.map(f => ({
      type: "image_url",
      image_url: { url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64") }
    })),
    ...(run.audio ? [{
      type: "input_audio",
      input_audio: { data: readFileSync(run.audio.path).toString("base64"), format: "wav" }
    }] : []),
    { type: "text", text: "Transcript fallback:\n" + (run.transcript?.text ?? "") }
  ];
  const client = new OpenAI({
    apiKey: process.env.OPENROUTER_API_KEY,
    baseURL: "https://openrouter.ai/api/v1",
  });
  const r = await client.chat.completions.create({
    model: "nvidia/nemotron-3-nano-omni",
    messages: [{ role: "user", content }]
  });
  console.log(r.choices[0].message.content);
'

Animated GIF / APNG / WebP — peepshow's killer move on NVIDIA Nemotron 3 Nano Omni

Nemotron 3 Nano Omni's vision adapter reads JPEG / PNG. peepshow flattens animated GIF / APNG / WebP into a frame sequence so animated content reaches the omni stack alongside the audio track.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For NVIDIA Nemotron 3 Nano Omni specifically, these presets are worth knowing:

  • --strategy scene --max 20Default — leaves headroom for the audio modality in the same call.
  • --strategy scene --max 30 --emit jsonLong static content (CCTV / lecture). Combine with the audio path for an omni call that still fits VRAM.
  • --strategy fps --fps 1 --max 24Steady-motion sport / gameplay with synced audio.

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one NVIDIA Nemotron 3 Nano Omni run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When NVIDIA Nemotron 3 Nano Omni consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<NVIDIA Nemotron 3 Nano Omni's summary>","provider":"nvidia/nemotron-3-nano-omni"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use NVIDIA Nemotron 3 Nano Omni direct

  • Source is already an image + audio pair — call Nemotron directly.
  • Running Nemotron 3 Nano text-only (no Omni suffix) — no vision capability.
  • Need realtime streaming omni inference (peepshow is one-shot).

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → NVIDIA Nemotron 3 Nano Omni reads them as images + text.