Cohere Aya Vision has no native video. peepshow bridges.
Cohere Aya Vision's vision is image-only. peepshow turns video + animated formats into the frame timeline Cohere Aya Vision already accepts.
- Only path to video on Aya Vision. Aya Vision accepts images only. peepshow turns video into the JPEG sequence the model already understands — same pipeline whether you hit Cohere's API or self-host the open weights.
- Tuned for non-English content. Aya Vision excels on Arabic, Hindi, Japanese, Korean, Mandarin, and 18+ other languages. Scene-change extraction lands on the frames where text or UI changes — exactly where multilingual OCR / VLM signal lives.
- Drop-in for Command A Vision pipelines. Aya Vision shares Cohere's OpenAI-compatible vision endpoint shape — peepshow's JSON manifest drops in with the same one-line glue you'd use for Command A Vision.
- Open weights, research licence. Aya Vision ships on HuggingFace under a research-friendly licence — the same peepshow pipeline runs against a self-hosted vLLM endpoint with zero data egress.
- Strong on non-English UI screenshots. Mobile apps and SaaS dashboards in non-English locales render labels Aya Vision was trained to read. peepshow lands the right frames at scene changes for max signal-per-frame.
- Animated GIF / APNG / WebP. Aya Vision reads still images. peepshow normalises animated formats into a JPEG sequence so animated demos still reach the model.
Token-cost math (worked examples)
| Clip | Native upload | peepshow + Cohere Aya Vision |
|---|---|---|
| 30s product demo (peepshow) | — | ~3K (6 frames + transcript) |
| 10-min multilingual training (peepshow) | — | ~8K (20 scene frames + transcript) |
| 1-hour i18n UX review (peepshow) | — | ~15K (30 motion frames + sparse transcript) |
| 3-hour international conference (peepshow + chunked) | — | ~42K (60 scene frames + chaptered transcript) |
Aya Vision bills per image (high-detail) plus context. Self-host (vLLM on the HuggingFace weights) costs VRAM-seconds. Multilingual OCR runs cleaner on scene-change frames than naïve fps sampling because text typically changes on cuts, not within them.
Install (CLI)
npm install -g peepshow
# Cohere hosted (OpenAI-compatible):
export COHERE_API_KEY=co-...
# OR self-host the open weights via vLLM:
pip install vllm
python -m vllm.entrypoints.openai.api_server --model CohereForAI/aya-vision-8b
peepshow ./demo.mp4 --emit json > run.jsonInstall (Cohere Aya Vision API directly, no CLI)
Calling the Cohere Aya Vision API from your own code? Run peepshow first, then feed the JSON manifest in as multimodal parts:
# Hand frames + transcript to Aya Vision (OpenAI-compatible)
node -e '
import OpenAI from "openai";
import { readFileSync } from "node:fs";
const run = JSON.parse(readFileSync("run.json", "utf8"));
const content = [
{ type: "text", text: "What is happening in this clip? Note any non-English text." },
...run.frames.map(f => ({
type: "image_url",
image_url: { url: "data:image/jpeg;base64," + readFileSync(f.path).toString("base64") }
})),
{ type: "text", text: "Transcript:\n" + (run.transcript?.text ?? "") }
];
const client = new OpenAI({
apiKey: process.env.COHERE_API_KEY,
baseURL: "https://api.cohere.com/compatibility/v1",
});
const r = await client.chat.completions.create({
model: "aya-vision-8b",
messages: [{ role: "user", content }]
});
console.log(r.choices[0].message.content);
'Animated GIF / APNG / WebP — peepshow's killer move on Cohere Aya Vision
Aya Vision reads still images. peepshow normalises animated GIF / APNG / WebP into a JPEG sequence so multilingual product tours and animated UI walkthroughs reach Aya Vision without losing motion context.
peepshow ./meme.gif # animated GIF → frame timeline
peepshow ./tutorial.apng # animated PNG → frames
peepshow ./loop.webp # animated WebP → framesFrame strategy presets
peepshow picks scene-change frames by default. For Cohere Aya Vision specifically, these presets are worth knowing:
--strategy scene --max 16Default for multilingual app demos — scene detection lands on locale switches and screen transitions.--strategy scene --max 24 --resize 1024Non-English document / UI review — full-res frames so Aya Vision's multilingual OCR keeps small text legible.--strategy scene --max 8 --resize 768Aya Vision 8B on edge hardware — lean budget, native input resolution.
All 95 sinks still fire
Same CLI = same sinks. Push frames to SQLite, embed captions into Chroma, mirror to S3, drop a thumbnail in Slack, file a GitHub issue with the offending frame attached — all from one Cohere Aya Vision run. Browse the full sink catalogue →.
Report + LLM analysis loop
Every run also writes a self-contained report.html + manifest.json next to the frames (see the Report page). When Cohere Aya Vision consumes the frames, the analysis flows back into the report — whoever opens it next sees the model's understanding without re-running the prompt.
echo '{"summary":"<Cohere Aya Vision's summary>","provider":"aya-vision-8b"}' \
| peepshow report annotate "<outputDir>"When to skip peepshow + use Cohere Aya Vision direct
- Source is a single multilingual screenshot — call Aya Vision directly with the image.
- Running Aya 23 / Aya Expanse text-only — no vision capability.
- Workload is English-only and a smaller VLM (Granite Mini) is cheaper to run.
For everything beyond those edge cases, peepshow is the bridge: video + animated formats + transcript → Cohere Aya Vision reads them as images + text.