I think a lot of people have heard of OpenAI’s local-friendly Whisper model, but I don’t see enough self-hosters talking about WhisperX, so I’ll hop on the soapbox:

Whisper is extremely good when you have lots of audio with one person talking, but fails hard in a conversational setting with people talking over each other. It’s also hard to sync up transcripts with the original audio.

Enter WhisperX: WhisperX is an improved whisper implementation that automatically tags who is talking, and tags each line of speech with a timestamp.

I’ve found it great for DMing TTRPGs — simply record your session with a conference mic, run a transcript with WhisperX, and pass the output to a long-context LLM for easy session summaries. It’s a great way to avoid slowing down the game by taking notes on minor events and NPCs.

I’ve also used it in a hacky script pipeline to bulk download podcast episodes with yt-dlp, create searchable transcripts, and scrub ads by having an LLM sniff out timestamps to cut with ffmpeg.

Privacy-friendly, modest hardware requirements, and good at what it does. WhisperX, apply directly to the forehead.

  • irmadlad@lemmy.world
    link
    fedilink
    English
    arrow-up
    8
    ·
    1 day ago

    What would be some use cases for WhisperX? I’m struggling to envision how I would use that in a selfhosting/homelabbing environment.

    • fatalicus@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      ·
      16 hours ago

      I’m personally looking at setting up whisper or whisperx with bazarr, to get subtitles for movies and series that I can’t find any to download.

    • TheFogan@programming.dev
      link
      fedilink
      English
      arrow-up
      14
      ·
      edit-2
      1 day ago

      half sarcastic but the overall premise of rigging something in to a local voice assistant, when an arguement starts “Ok nabu record this conversation”. then 2 weeks later on another arguement… “OK nabu search our last arguement for the cabinet”. Would be like having a court transcriber on call.

      • irmadlad@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 day ago

        I have a lady friend that does quite a good enough job of that. LOL

        ‘You remember back in 1979…it was a Friday at 2:11 PM, and you said…’ ‘Babe, I don’t remember what I had for breakfast yesterday.’

          • irmadlad@lemmy.world
            link
            fedilink
            English
            arrow-up
            3
            ·
            1 day ago

            What kind of stupid-ass question is that? LOL All kidding aside, she’s a good soul. We’re not married, we’ve just know each other for 45+ years. It just kind of clicked. She lives in her house, and I in mine, and we get together as often as possible.

      • hendrik@palaver.p3x.de
        link
        fedilink
        English
        arrow-up
        4
        arrow-down
        1
        ·
        1 day ago

        Hmm… Would be interesting to find out what kind of effect that has on the average marriage or relationship 😅

        • TheFogan@programming.dev
          link
          fedilink
          English
          arrow-up
          2
          ·
          23 hours ago

          I mean, I’d imagine probably not a good one :) Somehow I imagine asking the AI to record a conversation, is an instant arguement escalator… as is asking to read the facts back, and usually the topic would be switched rather than one side admitting their fault in the conversation.

          Actually I think there’s a black mirror episode on roughly that (not a device for recording audio when asked, but everyone having a chip in their head that automatically records their memories, and a huge fight when a husband discovers his wife deleted a few hours of recordings.

      • irmadlad@lemmy.world
        link
        fedilink
        English
        arrow-up
        4
        ·
        1 day ago

        I guess that’s why I am having difficulty coming up with a use case. I mean, I walk around the lab talking to myself all day long, but I think it’d be a bad idea to have a record of all those conversations. lol

        • onslaught545@lemmy.zip
          link
          fedilink
          English
          arrow-up
          3
          ·
          1 day ago

          If you don’t have to sit through a bunch of ‘meetings that could have been emails’ on a daily basis, you likely won’t have a use case for it.

          But in my last job I was a systems engineer for a web development company. I had to be included on all of the dev calls in case an infrastructure question came up that I needed to answer, and so I was vaguely aware of what the devs were doing.

          This software would have been a lifesaver, because my ADHD doesn’t let me listen to stuff like that for a straight hour or two.

      • irmadlad@lemmy.world
        link
        fedilink
        English
        arrow-up
        2
        ·
        10 hours ago

        Now that’s an interesting angle. I am a mediocre musician on my best day, but sometimes I incorporate phrases and lyric snippits in a piece. I wonder if I could use WhisperX to find those words or phrases from a stack of songs. For instance, I did a piece that used a line from Jimi Hendrix’s ‘If 6 were 9’ where he says ‘I’m the one who’s gotta die when it’s time for me to die. So let me live my life the way I want to.’ I wonder if WhisperX could pick that out of a stack of Jimi Hendrix songs.