Audio transcription
Local by default · six provider options
peepshow already extracts the audio track from any video input. When transcription is enabled, the spoken words get attached to the same JSON payload — next to video, frames, and audio — so every downstream sink sees the transcript without any extra plumbing. The default provider is whisper.cpp, which runs entirely on your own machine.
How it works
Every peepshow run follows the same pipeline. Transcription is the final step, fed by the audio that gets extracted alongside the frames:
- ffmpeg decode → scene-detect or fps-sample frames.
- ffmpeg audio pass → mono 16 kHz AAC
audio.m4a+ loudness peak + silence ratio. - Transcribe (if enabled) → the audio file is handed to the selected provider; segments + full text return on
audio.transcript. - Emit + fan out → the payload is serialised for
--emit json/caveman/markdownand streamed into every auto-sink and--sink.
Detection runs once at startup. If no provider is configured and whisper.cpp is on PATH, peepshow auto-enables --transcribe whisper-cpp. If the user explicitly passes --no-transcribe or sets PEEPSHOW_TRANSCRIBE=off, the transcription step is skipped even when a binary is available.
Providers
Six providers ship built in. Pick the one that matches your privacy, cost, and latency constraints — they all produce the same audio.transcript shape so downstream code doesn't care which one ran.
| Provider | Cost | Runs on | Setup |
|---|---|---|---|
whisper-cppdefault when on PATH | Free | Your CPU (Metal on macOS, CUDA / OpenBLAS optional on Linux) | brew install whisper-cpp (macOS) · scoop install whisper-cpp (Windows) · prebuilt Linux releases. Default model: base.en. One-time download to ~/.peepshow/whisper-models/ggml-<model>.bin. |
openai | Usage-based — see openai.com/api/pricing | OpenAI cloud | export OPENAI_API_KEY=…. Default model whisper-1; OpenAI also offers gpt-4o-transcribe and gpt-4o-mini-transcribe (override with PEEPSHOW_TRANSCRIBE_MODEL). Single-request API; files capped at 25 MB per upload (peepshow's 16 kHz mono AAC track is small — a ≈ 30-minute clip fits comfortably). |
groq | Usage-based — see groq.com/pricing | Groq LPU cloud (OpenAI-compatible API) | export GROQ_API_KEY=…. Default model whisper-large-v3-turbo — Groq's pruned / fine-tuned v3 variant, cheaper and faster than whisper-large-v3 while still multilingual. For error-sensitive work, override with PEEPSHOW_TRANSCRIBE_MODEL=whisper-large-v3. Same request shape as the OpenAI endpoint; Groq's LPU is generally an order of magnitude faster per request. |
deepgram | Usage-based — see deepgram.com/pricing | Deepgram cloud | export DEEPGRAM_API_KEY=…. Default model nova-3 (Deepgram's highest-accuracy general model); nova-2 is still available for unsupported nova-3 languages or filler-word detection. Deepgram supports diarisation, streaming, and language auto-detection; peepshow uses the synchronous REST endpoint. |
assemblyai | Usage-based — see assemblyai.com/pricing | AssemblyAI cloud | export ASSEMBLYAI_API_KEY=…. Default speech_model is best; alternatives include nano (cheaper, wider language coverage), slam-1, and AssemblyAI's newer universal / universal-2 / universal-3-pro tiers. AssemblyAI's API is upload → create job → poll, so peepshow polls until the job finishes (tunable via PEEPSHOW_ASSEMBLYAI_MAX_POLLS and PEEPSHOW_ASSEMBLYAI_POLL_DELAY_MS). |
custom | Whatever your own pipeline costs | Anywhere — local, remote, whatever your command does | Set PEEPSHOW_TRANSCRIBE_CMD='…' to a shell command that reads audio bytes on stdin and writes {segments:[{start,end,text}], text} JSON on stdout. Non-zero exit codes are treated as a skip with the first line of stderr stored as the reason. |
Environment variables
Everything in the table below is optional. Sensible defaults kick in the moment a provider is picked.
| Variable | Purpose |
|---|---|
PEEPSHOW_TRANSCRIBE | Provider: whisper-cpp · openai · groq · deepgram · assemblyai · custom · off. Overridden by --transcribe / --no-transcribe. |
PEEPSHOW_TRANSCRIBE_MODEL | Force a specific model (e.g. tiny.en, small, large-v3 for whisper-cpp; whisper-large-v3 for Groq; nova-3 / nova-2 for Deepgram; nano / universal-3-pro for AssemblyAI). |
PEEPSHOW_TRANSCRIBE_LANGUAGE | ISO-639 language hint (e.g. en, ja, auto). Improves accuracy for multilingual content. |
PEEPSHOW_TRANSCRIBE_ENDPOINT | Override the HTTP endpoint for cloud providers — useful for Azure OpenAI, self-hosted Groq-compatible proxies, or local Deepgram on-prem. |
PEEPSHOW_WHISPER_CPP | Absolute path to a whisper-cli binary (the current whisper.cpp CLI name — legacy whisper / main names still work). Defaults to whatever which whisper-cli resolves to. |
PEEPSHOW_WHISPER_MODEL_DIR | Where ggml-<model>.bin lives. Defaults to ~/.peepshow/whisper-models/; peepshow downloads from huggingface.co/ggerganov/whisper.cpp on first use. |
PEEPSHOW_TRANSCRIBE_CMD | Shell command for the custom provider — reads audio on stdin, emits JSON on stdout. |
OPENAI_API_KEY · GROQ_API_KEY · DEEPGRAM_API_KEY · ASSEMBLYAI_API_KEY | Per-provider credentials. Only the one matching the active provider is read. |
LLM pipeline examples
The transcript lands on the JSON payload as audio.transcript, so it flows through every sink with zero extra config. A few common shapes:
# Default: local whisper.cpp, stream JSON (with transcript) into an LLM CLI
peepshow ./talk.mp4 --emit json | llm -m gpt-4.1 "summarise the call"
# Cloud transcription + archive to Postgres (transcript rides along)
export OPENAI_API_KEY=…
peepshow ./keynote.mp4 --transcribe openai --sink postgres
# Pipe transcript + frames to Obsidian, one markdown note per run
peepshow ./interview.mov --sink obsidian:MyVault
# Frames only — no audio, no transcript
peepshow ./loop.gif --no-audio --no-transcribe
# Plug in your own ASR — e.g. a local whisperx script
export PEEPSHOW_TRANSCRIBE_CMD='python ~/bin/whisperx-wrapper.py'
peepshow ./meeting.mp4 --transcribe custom
Privacy
The default is local. When whisper.cpp is on PATH, nothing — audio, frames, metadata, transcript — leaves the machine. The model binary is downloaded once from Hugging Face on first use and then never touches the network again.
If you explicitly pick a cloud provider (openai, groq, deepgram, assemblyai) the extracted audio.m4a is uploaded over HTTPS to that provider's endpoint and their own retention / privacy policy applies. Peepshow itself never records, caches, or logs your audio — the file is streamed straight from the temp directory into the HTTP client and released when the request finishes.
Caveats
- OpenAI's 25 MB cap. Very long clips can exceed the upload limit. Peepshow's mono 16 kHz AAC re-encode keeps most content well under — a typical hour of dialogue lands around 30 MB, so split into parts if you hit the wall.
- AssemblyAI polls. AssemblyAI is an async API: upload, create job, poll for completion. Default is 60 polls at 2 s each (≈ 2 min ceiling). Tune with
PEEPSHOW_ASSEMBLYAI_MAX_POLLS+PEEPSHOW_ASSEMBLYAI_POLL_DELAY_MS. - ffmpeg → AAC → provider chain. Transcription runs against the re-encoded
audio.m4a, not the original track. That trade-off buys compact uploads + consistent encoding across providers at the cost of a tiny bit of transcoder loss. - No audio = no transcript. GIF, APNG, and animated WebP inputs never get an audio pass, so transcription is always skipped for them — a one-line note lands on stderr and peepshow moves on.
- Provider failures are soft. If a cloud request errors, peepshow records the reason on
audio.transcript.skippedReasonand continues with an empty transcript — the rest of the payload (frames, audio, sinks) is unaffected.