Audio transcription

Local by default · six provider options

peepshow already extracts the audio track from any video input. When transcription is enabled, the spoken words get attached to the same JSON payload — next to video, frames, and audio — so every downstream sink sees the transcript without any extra plumbing. The default provider is whisper.cpp, which runs entirely on your own machine.

How it works

Every peepshow run follows the same pipeline. Transcription is the final step, fed by the audio that gets extracted alongside the frames:

ffmpeg decode → scene-detect or fps-sample frames.
ffmpeg audio pass → mono 16 kHz AAC audio.m4a + loudness peak + silence ratio.
Transcribe (if enabled) → the audio file is handed to the selected provider; segments + full text return on audio.transcript.
Emit + fan out → the payload is serialised for --emit json/caveman/markdown and streamed into every auto-sink and --sink.

Detection runs once at startup. If no provider is configured and whisper.cpp is on PATH, peepshow auto-enables --transcribe whisper-cpp. If the user explicitly passes --no-transcribe or sets PEEPSHOW_TRANSCRIBE=off, the transcription step is skipped even when a binary is available.

Providers

Six providers ship built in. Pick the one that matches your privacy, cost, and latency constraints — they all produce the same audio.transcript shape so downstream code doesn't care which one ran.

Provider	Cost	Runs on	Setup
`whisper-cpp` default when on `PATH`	Free	Your CPU (Metal on macOS, CUDA / OpenBLAS optional on Linux)	`brew install whisper-cpp` (macOS) · `scoop install whisper-cpp` (Windows) · prebuilt Linux releases. Default model: `base.en`. One-time download to `~/.peepshow/whisper-models/ggml-<model>.bin`.
`openai`	Usage-based — see openai.com/api/pricing	OpenAI cloud	`export OPENAI_API_KEY=…`. Default model `whisper-1`; OpenAI also offers `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` (override with `PEEPSHOW_TRANSCRIBE_MODEL`). Single-request API; files capped at 25 MB per upload (peepshow's 16 kHz mono AAC track is small — a ≈ 30-minute clip fits comfortably).
`groq`	Usage-based — see groq.com/pricing	Groq LPU cloud (OpenAI-compatible API)	`export GROQ_API_KEY=…`. Default model `whisper-large-v3-turbo` — Groq's pruned / fine-tuned v3 variant, cheaper and faster than `whisper-large-v3` while still multilingual. For error-sensitive work, override with `PEEPSHOW_TRANSCRIBE_MODEL=whisper-large-v3`. Same request shape as the OpenAI endpoint; Groq's LPU is generally an order of magnitude faster per request.
`deepgram`	Usage-based — see deepgram.com/pricing	Deepgram cloud	`export DEEPGRAM_API_KEY=…`. Default model `nova-3` (Deepgram's highest-accuracy general model); `nova-2` is still available for unsupported `nova-3` languages or filler-word detection. Deepgram supports diarisation, streaming, and language auto-detection; peepshow uses the synchronous REST endpoint.
`assemblyai`	Usage-based — see assemblyai.com/pricing	AssemblyAI cloud	`export ASSEMBLYAI_API_KEY=…`. Default `speech_model` is `best`; alternatives include `nano` (cheaper, wider language coverage), `slam-1`, and AssemblyAI's newer `universal` / `universal-2` / `universal-3-pro` tiers. AssemblyAI's API is upload → create job → poll, so peepshow polls until the job finishes (tunable via `PEEPSHOW_ASSEMBLYAI_MAX_POLLS` and `PEEPSHOW_ASSEMBLYAI_POLL_DELAY_MS`).
`custom`	Whatever your own pipeline costs	Anywhere — local, remote, whatever your command does	Set `PEEPSHOW_TRANSCRIBE_CMD='…'` to a shell command that reads audio bytes on stdin and writes `{segments:[{start,end,text}], text}` JSON on stdout. Non-zero exit codes are treated as a skip with the first line of stderr stored as the reason.

Environment variables

Everything in the table below is optional. Sensible defaults kick in the moment a provider is picked.

Variable	Purpose
`PEEPSHOW_TRANSCRIBE`	Provider: `whisper-cpp` · `openai` · `groq` · `deepgram` · `assemblyai` · `custom` · `off`. Overridden by `--transcribe` / `--no-transcribe`.
`PEEPSHOW_TRANSCRIBE_MODEL`	Force a specific model (e.g. `tiny.en`, `small`, `large-v3` for whisper-cpp; `whisper-large-v3` for Groq; `nova-3` / `nova-2` for Deepgram; `nano` / `universal-3-pro` for AssemblyAI).
`PEEPSHOW_TRANSCRIBE_LANGUAGE`	ISO-639 language hint (e.g. `en`, `ja`, `auto`). Improves accuracy for multilingual content.
`PEEPSHOW_TRANSCRIBE_ENDPOINT`	Override the HTTP endpoint for cloud providers — useful for Azure OpenAI, self-hosted Groq-compatible proxies, or local Deepgram on-prem.
`PEEPSHOW_WHISPER_CPP`	Absolute path to a `whisper-cli` binary (the current whisper.cpp CLI name — legacy `whisper` / `main` names still work). Defaults to whatever `which whisper-cli` resolves to.
`PEEPSHOW_WHISPER_MODEL_DIR`	Where `ggml-<model>.bin` lives. Defaults to `~/.peepshow/whisper-models/`; peepshow downloads from `huggingface.co/ggerganov/whisper.cpp` on first use.
`PEEPSHOW_TRANSCRIBE_CMD`	Shell command for the `custom` provider — reads audio on stdin, emits JSON on stdout.
`OPENAI_API_KEY` · `GROQ_API_KEY` · `DEEPGRAM_API_KEY` · `ASSEMBLYAI_API_KEY`	Per-provider credentials. Only the one matching the active provider is read.

LLM pipeline examples

The transcript lands on the JSON payload as audio.transcript, so it flows through every sink with zero extra config. A few common shapes:

# Default: local whisper.cpp, stream JSON (with transcript) into an LLM CLI
peepshow ./talk.mp4 --emit json | llm -m gpt-4.1 "summarise the call"

# Cloud transcription + archive to Postgres (transcript rides along)
export OPENAI_API_KEY=…
peepshow ./keynote.mp4 --transcribe openai --sink postgres

# Pipe transcript + frames to Obsidian, one markdown note per run
peepshow ./interview.mov --sink obsidian:MyVault

# Frames only — no audio, no transcript
peepshow ./loop.gif --no-audio --no-transcribe

# Plug in your own ASR — e.g. a local whisperx script
export PEEPSHOW_TRANSCRIBE_CMD='python ~/bin/whisperx-wrapper.py'
peepshow ./meeting.mp4 --transcribe custom

Privacy

The default is local. When whisper.cpp is on PATH, nothing — audio, frames, metadata, transcript — leaves the machine. The model binary is downloaded once from Hugging Face on first use and then never touches the network again.

If you explicitly pick a cloud provider (openai, groq, deepgram, assemblyai) the extracted audio.m4a is uploaded over HTTPS to that provider's endpoint and their own retention / privacy policy applies. Peepshow itself never records, caches, or logs your audio — the file is streamed straight from the temp directory into the HTTP client and released when the request finishes.

Caveats

OpenAI's 25 MB cap. Very long clips can exceed the upload limit. Peepshow's mono 16 kHz AAC re-encode keeps most content well under — a typical hour of dialogue lands around 30 MB, so split into parts if you hit the wall.
AssemblyAI polls. AssemblyAI is an async API: upload, create job, poll for completion. Default is 60 polls at 2 s each (≈ 2 min ceiling). Tune with PEEPSHOW_ASSEMBLYAI_MAX_POLLS + PEEPSHOW_ASSEMBLYAI_POLL_DELAY_MS.
ffmpeg → AAC → provider chain. Transcription runs against the re-encoded audio.m4a, not the original track. That trade-off buys compact uploads + consistent encoding across providers at the cost of a tiny bit of transcoder loss.
No audio = no transcript. GIF, APNG, and animated WebP inputs never get an audio pass, so transcription is always skipped for them — a one-line note lands on stderr and peepshow moves on.
Provider failures are soft. If a cloud request errors, peepshow records the reason on audio.transcript.skippedReason and continues with an empty transcript — the rest of the payload (frames, audio, sinks) is unaffected.

← Back to peepshow.dev Sinks → Releases →