peepshow/ models/ cohere

Reel #M-09 Cohere Command A Vision

peepshow for models / cohere

Command A VisionCommand A Vision is enterprise-focused, OCR-grade, image-only. peepshow turns video into the 20-image batch it natively accepts.

Cohere's Command A Vision is an enterprise vision model — OCR-strong, document-grade, 128K context, up to 20 images per request. The API is OpenAI-compatible. peepshow extracts scene-change frames + transcript so any clip arrives as the 20-image bundle Command A Vision was designed for.

Cohere Command A Vision has no native video. peepshow bridges.

Cohere Command A Vision's vision is image-only. peepshow turns video + animated formats into the frame timeline Cohere Command A Vision already accepts.

  • 20-image cap matches peepshow's defaults. Command A Vision accepts up to 20 images per request — peepshow's default frame budget lands inside that limit with room for prompt + transcript.
  • OCR-grade vision needs OCR-grade frames. Scene-change extraction lands exactly on the frames where text or layout changes — slide decks, screen captures, document walkthroughs.
  • OpenAI-compatible shape. Cohere's vision endpoint takes `image_url` parts. peepshow's JSON manifest drops in with one line of glue code.
  • 128K context, predictable spend. The large context is for transcript + reasoning, not for paying per-second native-video bills. peepshow keeps the visual budget bounded.
  • Animated GIF / APNG / WebP. Cohere reads still images. peepshow flattens animated formats into a JPEG sequence.
  • Enterprise audit trail. Same auditable frame bundle peepshow gives every model — useful when compliance wants to see exactly what the LLM saw.

Token-cost math (worked examples)

ClipNative uploadpeepshow + Cohere Command A Vision
30s product demo (peepshow)~3K (6 frames + transcript)
10-minute lecture (peepshow)~7K (20 frames — at the per-request cap — + transcript)
1-hour CCTV reel (peepshow, chunked 3×)~22K (3 × 20-frame batches + sparse transcript)
3-hour conference (peepshow + chunked)~64K (9 batches of 20 frames + chaptered transcript) — fits 128K ctx

Cohere bills per image (high-detail) plus context. Numbers approximate against published Command A Vision pricing tiers. The 20-image cap forces chunking on longer clips — peepshow does the chunking deterministically.

Install (CLI)

npm install -g peepshow

# Set Cohere credentials:
export COHERE_API_KEY=co-...

# Run extraction (defaults sit nicely inside the 20-image cap):
peepshow ./demo.mp4 --emit json > run.json

Install (Cohere Command A Vision API directly, no CLI)

Calling the Cohere Command A Vision API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# Hand frames + transcript to Cohere Command A Vision (OpenAI-compatible)
node -e '
  import OpenAI from "openai";
  import { readFileSync } from "node:fs";
  const run = JSON.parse(readFileSync("run.json", "utf8"));
  const content = [
    { type: "text", text: "Summarise this clip; extract any visible text." },
    ...run.frames.slice(0, 20).map(f => ({
      type: "image_url",
      image_url: { url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64") }
    })),
    { type: "text", text: "Transcript:\n" + (run.transcript?.text ?? "") }
  ];
  const client = new OpenAI({
    apiKey: process.env.COHERE_API_KEY,
    baseURL: "https://api.cohere.com/compatibility/v1",
  });
  const r = await client.chat.completions.create({
    model: "command-a-vision-07-2025",
    messages: [{ role: "user", content }]
  });
  console.log(r.choices[0].message.content);
'

Animated GIF / APNG / WebP — peepshow's killer move on Cohere Command A Vision

Command A Vision reads still images only. peepshow normalises animated GIF / APNG / WebP into a JPEG sequence, capped at 20 frames so the per-request limit is respected without manual pruning.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For Cohere Command A Vision specifically, these presets are worth knowing:

  • --strategy scene --max 20Default — exactly at the Cohere per-request image cap. One call covers the whole clip.
  • --strategy scene --max 12 --dedup perceptualStatic / talking-head footage — drop near-duplicates so the 20-frame budget covers more meaningful change.
  • --strategy fps --fps 0.5 --max 20Steady-motion content — predictable cadence, still inside the cap.

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one Cohere Command A Vision run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When Cohere Command A Vision consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<Cohere Command A Vision's summary>","provider":"command-a-vision-07-2025"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use Cohere Command A Vision direct

  • Source is already a single document scan — call Cohere with the one image.
  • Running Command R+ text-only — no vision capability.
  • Need to exceed 20 images in one request (peepshow chunks instead).

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → Cohere Command A Vision reads them as images + text.