Audio transcription

peepshow already extracts the audio track from any video input. When transcription is enabled, the spoken words get attached to the same JSON payload — next to video, frames, and audio — so every downstream sink sees the transcript without any extra plumbing. The default provider is whisper.cpp, which runs entirely on your own machine.

How it works

Every peepshow run follows the same pipeline. Transcription is the final step, fed by the audio that gets extracted alongside the frames:

  1. ffmpeg decode → scene-detect or fps-sample frames.
  2. ffmpeg audio pass → mono 16 kHz AAC audio.m4a + loudness peak + silence ratio.
  3. Transcribe (if enabled) → the audio file is handed to the selected provider; segments + full text return on audio.transcript.
  4. Emit + fan out → the payload is serialised for --emit json/caveman/markdown and streamed into every auto-sink and --sink.

Detection runs once at startup. If no provider is configured and whisper.cpp is on PATH, peepshow auto-enables --transcribe whisper-cpp. If the user explicitly passes --no-transcribe or sets PEEPSHOW_TRANSCRIBE=off, the transcription step is skipped even when a binary is available.

Providers

Six providers ship built in. Pick the one that matches your privacy, cost, and latency constraints — they all produce the same audio.transcript shape so downstream code doesn't care which one ran.

Environment variables

Everything in the table below is optional. Sensible defaults kick in the moment a provider is picked.

LLM pipeline examples

The transcript lands on the JSON payload as audio.transcript, so it flows through every sink with zero extra config. A few common shapes:

# Default: local whisper.cpp, stream JSON (with transcript) into an LLM CLI
peepshow ./talk.mp4 --emit json | llm -m gpt-4.1 "summarise the call"

# Cloud transcription + archive to Postgres (transcript rides along)
export OPENAI_API_KEY=…
peepshow ./keynote.mp4 --transcribe openai --sink postgres

# Pipe transcript + frames to Obsidian, one markdown note per run
peepshow ./interview.mov --sink obsidian:MyVault

# Frames only — no audio, no transcript
peepshow ./loop.gif --no-audio --no-transcribe

# Plug in your own ASR — e.g. a local whisperx script
export PEEPSHOW_TRANSCRIBE_CMD='python ~/bin/whisperx-wrapper.py'
peepshow ./meeting.mp4 --transcribe custom

Privacy

The default is local. When whisper.cpp is on PATH, nothing — audio, frames, metadata, transcript — leaves the machine. The model binary is downloaded once from Hugging Face on first use and then never touches the network again.

If you explicitly pick a cloud provider (openai, groq, deepgram, assemblyai) the extracted audio.m4a is uploaded over HTTPS to that provider's endpoint and their own retention / privacy policy applies. Peepshow itself never records, caches, or logs your audio — the file is streamed straight from the temp directory into the HTTP client and released when the request finishes.

Caveats