peepshow/ models/ aya

Reel #M-10 Cohere Aya Vision

peepshow for models / aya

Aya VisionAya Vision is Cohere's multilingual research VLM — 23+ languages, OCR-strong, image-only. peepshow turns video into the frame batch it natively reads.

Cohere's Aya Vision is a multilingual research vision-language model — 23+ languages out of the box, strong on non-English document / UI screenshots, released under a research-friendly licence on HuggingFace. It reads still images, not video. peepshow extracts scene-change frames + transcript so any clip arrives in the shape Aya Vision was trained on.

Cohere Aya Vision has no native video. peepshow bridges.

Cohere Aya Vision's vision is image-only. peepshow turns video + animated formats into the frame timeline Cohere Aya Vision already accepts.

  • Only path to video on Aya Vision. Aya Vision accepts images only. peepshow turns video into the JPEG sequence the model already understands — same pipeline whether you hit Cohere's API or self-host the open weights.
  • Tuned for non-English content. Aya Vision excels on Arabic, Hindi, Japanese, Korean, Mandarin, and 18+ other languages. Scene-change extraction lands on the frames where text or UI changes — exactly where multilingual OCR / VLM signal lives.
  • Drop-in for Command A Vision pipelines. Aya Vision shares Cohere's OpenAI-compatible vision endpoint shape — peepshow's JSON manifest drops in with the same one-line glue you'd use for Command A Vision.
  • Open weights, research licence. Aya Vision ships on HuggingFace under a research-friendly licence — the same peepshow pipeline runs against a self-hosted vLLM endpoint with zero data egress.
  • Strong on non-English UI screenshots. Mobile apps and SaaS dashboards in non-English locales render labels Aya Vision was trained to read. peepshow lands the right frames at scene changes for max signal-per-frame.
  • Animated GIF / APNG / WebP. Aya Vision reads still images. peepshow normalises animated formats into a JPEG sequence so animated demos still reach the model.

Token-cost math (worked examples)

ClipNative uploadpeepshow + Cohere Aya Vision
30s product demo (peepshow)~3K (6 frames + transcript)
10-min multilingual training (peepshow)~8K (20 scene frames + transcript)
1-hour i18n UX review (peepshow)~15K (30 motion frames + sparse transcript)
3-hour international conference (peepshow + chunked)~42K (60 scene frames + chaptered transcript)

Aya Vision bills per image (high-detail) plus context. Self-host (vLLM on the HuggingFace weights) costs VRAM-seconds. Multilingual OCR runs cleaner on scene-change frames than naïve fps sampling because text typically changes on cuts, not within them.

Install (CLI)

npm install -g peepshow

# Cohere hosted (OpenAI-compatible):
export COHERE_API_KEY=co-...

# OR self-host the open weights via vLLM:
pip install vllm
python -m vllm.entrypoints.openai.api_server --model CohereForAI/aya-vision-8b

peepshow ./demo.mp4 --emit json > run.json

Install (Cohere Aya Vision API directly, no CLI)

Calling the Cohere Aya Vision API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:

# Hand frames + transcript to Aya Vision (OpenAI-compatible)
node -e '
  import OpenAI from "openai";
  import { readFileSync } from "node:fs";
  const run = JSON.parse(readFileSync("run.json", "utf8"));
  const content = [
    { type: "text", text: "What is happening in this clip? Note any non-English text." },
    ...run.frames.map(f => ({
      type: "image_url",
      image_url: { url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64") }
    })),
    { type: "text", text: "Transcript:\n" + (run.transcript?.text ?? "") }
  ];
  const client = new OpenAI({
    apiKey: process.env.COHERE_API_KEY,
    baseURL: "https://api.cohere.com/compatibility/v1",
  });
  const r = await client.chat.completions.create({
    model: "aya-vision-8b",
    messages: [{ role: "user", content }]
  });
  console.log(r.choices[0].message.content);
'

Animated GIF / APNG / WebP — peepshow's killer move on Cohere Aya Vision

Aya Vision reads still images. peepshow normalises animated GIF / APNG / WebP into a JPEG sequence so multilingual product tours and animated UI walkthroughs reach Aya Vision without losing motion context.

peepshow ./meme.gif         # animated GIF → frame timeline
peepshow ./tutorial.apng    # animated PNG → frames
peepshow ./loop.webp        # animated WebP → frames

Frame strategy presets

peepshow picks scene-change frames by default. For Cohere Aya Vision specifically, these presets are worth knowing:

  • --strategy scene --max 16Default for multilingual app demos — scene detection lands on locale switches and screen transitions.
  • --strategy scene --max 24 --resize 1024Non-English document / UI review — full-res frames so Aya Vision's multilingual OCR keeps small text legible.
  • --strategy scene --max 8 --resize 768Aya Vision 8B on edge hardware — lean budget, native input resolution.

All 95 sinks still fire

Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one Cohere Aya Vision run. Browse the full sink catalogue →.

Report + LLM analysis loop

Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When Cohere Aya Vision consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.

echo '{"summary":"<Cohere Aya Vision's summary>","provider":"aya-vision-8b"}' \
  | peepshow report annotate "<outputDir>"

When to skip peepshow + use Cohere Aya Vision direct

  • Source is a single multilingual screenshot — call Aya Vision directly with the image.
  • Running Aya 23 / Aya Expanse text-only — no vision capability.
  • Workload is English-only and a smaller VLM (Granite Mini) is cheaper to run.

For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → Cohere Aya Vision reads them as images + text.