Performs local speech to text transcription and speaker identification of audio and video files.
Especially useful for transcription of user interviews where confidentially might be an issue or where data privacy is needed. No data leaves your computer, everything runs locally.
Transcription will take pretty long compared to what you might be used to, since this script does not use any GPU optimization.
On a 2023 Mac M3 Pro this script runs roughly in real-time, meaning one minute of audio will take about one minute. Identifying speakers takes most of this time.
For many or large files it might be best to run this tool over night.
This script has been tested on Mac OS 14.4.1 (23E224). Your milage on Windows or another OS might vary.
These instructions are for usage on Mac OS:
- Download and install Python from https://www.python.org/downloads/
- Open a Terminal and install Homebrew, see https://brew.sh/
- In the open Terminal install FFmpeg with the following command:
brew install ffmpeg
- If you haven't already, create an account on Hugging Face
- Note! This is only used to download a model from Hugging Face, no training or other data will leave your computer.
- Log in to Hugging Face and accept the following user conditions:
- Download this repository and extract it to a folder of you choice. You should see these files:
config.yml
localwhisperx.py
- Install these necessary Python modules:
pip3 install pyyaml pip3 install ffmpeg-python pip3 install git+https://github.com/m-bain/whisperx.git
- Open
config.yml
with your favourite Editor, e.g. TextEdit and add your Huggin Face User Access Token in the appropriate line
- Put your audio and/or video files into a directory and open a Terminal.
- In the open Terminal change to the directory, where you extracted this repository:
cd path/to/unzipped/localwhisperx
- Transcribe your files with the following command
This will run the script and create a .txt file for the provided file or all files within the given directory.
python3 localwhisperx.py your/audio/or/video/file/or/folder
On the first run the program will download a local copy of a large language model (LLM) which requires a couple of free gigabytes.
The script can take a couple of command line arguments to refine your transcription:
--language
Language spoken in all files, e.g.en
; default isde
--minspeaker
Minimum number of speakers; default is1
--maxspeaker
Minimum number of speakers; default is2
Here's a example, converting an english language audio file with exactly 4 speakers:
python3 localwhisperx.py test.mp3 --language en --minspeaker 4 --maxspeaker 4
The file config.yml
contains a line to change the LLM-model size. A larger model will generally result in a more accurate transcription. Likewise processing time will increase.
You can select one of the follwing options:
tiny
base
small
medium
(default, recommended compromise of accuracy and processing time, takes about 1.5 GB of space on your harddisk)large
The tool might show a couple of warnings -- it is safe to ignore these:
UserWarning: torchaudio.\_backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.3.0. Bad things might happen unless you revert torch to 1.x.
If you're transcribing interviews put the resulting transcriptions into ChatGPT. Maybe try using the Interview Analyst GPT and see what a quick analysis will result in.