I think a lot of people have heard of OpenAI’s local-friendly Whisper model, but I don’t see enough self-hosters talking about WhisperX, so I’ll hop on the soapbox:
Whisper is extremely good when you have lots of audio with one person talking, but fails hard in a conversational setting with people talking over each other. It’s also hard to sync up transcripts with the original audio.
Enter WhisperX: WhisperX is an improved whisper implementation that automatically tags who is talking, and tags each line of speech with a timestamp.
I’ve found it great for DMing TTRPGs — simply record your session with a conference mic, run a transcript with WhisperX, and pass the output to a long-context LLM for easy session summaries. It’s a great way to avoid slowing down the game by taking notes on minor events and NPCs.
I’ve also used it in a hacky script pipeline to bulk download podcast episodes with yt-dlp, create searchable transcripts, and scrub ads by having an LLM sniff out timestamps to cut with ffmpeg.
Privacy-friendly, modest hardware requirements, and good at what it does. WhisperX, apply directly to the forehead.
This is genius. Could you appify this and I’ll pay you in real or pretend currency as you prefer
Okay that’s just crazy. ;)
Probably not that hard to build a simple flask frontend around it.
Automatically processing files in an S3/WebDAV directory would also be useful.