WhisperX

This repository provides fast automatic speech recognition (70x realtime with large-v2) with word-level timestamps and speaker diarization.

⚡️ Batched inference for 70x realtime transcription using whisper large-v2
🪶 faster-whisper backend, requires <8GB gpu memory for large-v2 with beam_size=5
🎯 Accurate word-level timestamps using wav2vec2 alignment
👯‍♂️ Multispeaker ASR using speaker diarization from pyannote-audio (speaker ID labels)
🗣️ VAD preprocessing, reduces hallucination & batching with no WER degradation

Whisper is an ASR model developed by OpenAI, trained on a large dataset of diverse audio. Whilst it does produces highly accurate transcriptions, the corresponding timestamps are at the utterance-level, not per word, and can be inaccurate by several seconds. OpenAI's whisper does not natively support batching.

Phoneme-Based ASR A suite of models finetuned to recognise the smallest unit of speech distinguishing one word from another, e.g. the element p in "tap". A popular example model is wav2vec2.0.

Forced Alignment refers to the process by which orthographic transcriptions are aligned to audio recordings to automatically generate phone level segmentation.

Voice Activity Detection (VAD) is the detection of the presence or absence of human speech.

Speaker Diarization is the process of partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.

New🚨

1st place at Ego4d transcription challenge 🏆
WhisperX accepted at INTERSPEECH 2023
v3 transcript segment-per-sentence: using nltk sent_tokenize for better subtitlting & better diarization
v3 released, 70x speed-up open-sourced. Using batched whisper with faster-whisper backend!
v2 released, code cleanup, imports whisper library VAD filtering is now turned on by default, as in the paper.
Paper drop🎓👨‍🏫! Please see our ArxiV preprint for benchmarking and details of WhisperX. We also introduce more efficient batch inference resulting in large-v2 with *60-70x REAL TIME speed.

Installation with AIME MLC ⚙️

Easy installation within an AIME ML-Container.

1. Create Python3.10 environment

mlc create whisper_container Pytorch 2.1.2-aime

mlc open whisper_container

2. Install this repo

git clone https://github.com/aime-labs/aime-api_whisperx.git
cd aime-api_whisperx
pip install -e .

2. Install these system dependencies

Install FFmpeg support:

sudo apt update && sudo apt install ffmpeg

Running inference

Running WhisperX as HTTP/HTTPS API with AIME API Server

To run WhisperX as HTTP/HTTPS API with AIME API Server start following Python script through the command line:

python3 run_whisper_with_api_server.py --api_server <address of api server>

It will start WhisperX as worker, waiting for job request through the AIME API Server.

Usage 💬 (command line)

1. Run a basic transcription on an audio file:

Run whisper on example segment (using default params, whisper small) add --highlight_words True to visualise word timings in the .srt file.

python -m whisperx audio_file.mp3

2. Advanced Options

For increased timestamp accuracy, at the cost of higher gpu mem, use bigger models (bigger alignment model not found to be that helpful, see paper) e.g.

python -m whisperx audio_file.mp3 --model large-v3 --align_model WAV2VEC2_ASR_LARGE_LV60K_960H --batch_size 32

Specify language:

python -m whisperx audio_file.mp3 --language en

Choose device:

python -m whisperx audio_file.mp3 --device cuda

Define output format:

python -m whisperx audio_file.mp3 --output-format srt

Technical Details 👷‍♂️

For specific details on the batching and alignment, the effect of VAD, as well as the chosen alignment model, see the preprint paper.

To reduce GPU memory requirements, try any of the following (2. & 3. can affect quality):

reduce batch size, e.g. --batch_size 4
use a smaller ASR model --model base
Use lighter compute type --compute_type int8

Transcription differences from openai's whisper:

Transcription without timestamps. To enable single pass batching, whisper inference is performed --without_timestamps True, this ensures 1 forward pass per sample in the batch. However, this can cause discrepancies the default whisper output.
VAD-based segment transcription, unlike the buffered transcription of openai's. In Wthe WhisperX paper we show this reduces WER, and enables accurate batched inference
--condition_on_prev_text is set to False by default (reduces hallucination)

Limitations ⚠️

Transcript words which do not contain characters in the alignment models dictionary e.g. "2014." or "£13.60" cannot be aligned and therefore are not given a timing.
Overlapping speech is not handled particularly well by whisper nor whisperx
Diarization is far from perfect
Language specific wav2vec2 model is needed

Contribute 🧑‍🏫

If you are multilingual, a major way you can contribute to this project is to find phoneme models on huggingface (or train your own) and test them on speech for the target language. If the results look good send a pull request and some examples showing its success.

Bug finding and pull requests are also highly appreciated to keep this project going, since it's already diverging from the original research scope.

Acknowledgements 🙏

This is builds on openAI's whisper. Borrows important alignment code from PyTorch tutorial on forced alignment And uses the wonderful pyannote VAD / Diarization https://github.com/pyannote/pyannote-audio

Valuable VAD & Diarization Models from [pyannote audio][https://github.com/pyannote/pyannote-audio]

Great backend from faster-whisper and CTranslate2

Note
This documentation is tailored specifically for the AIME server environment. For general WhisperX usage, refer to the original repository.

Name		Name	Last commit message	Last commit date
Latest commit History 381 Commits
.github		.github
figures		figures
whisperx		whisperx
.gitignore		.gitignore
EXAMPLES.md		EXAMPLES.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
run_whisper_with_api_server.py		run_whisper_with_api_server.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WhisperX

New🚨

Installation with AIME MLC ⚙️

1. Create Python3.10 environment

2. Install this repo

2. Install these system dependencies

Running inference

Running WhisperX as HTTP/HTTPS API with AIME API Server

Usage 💬 (command line)

1. Run a basic transcription on an audio file:

2. Advanced Options

Technical Details 👷‍♂️

Limitations ⚠️

Contribute 🧑‍🏫

Acknowledgements 🙏

About

Releases

Packages

Languages

License

aime-labs/aime-api_whisperx

Folders and files

Latest commit

History

Repository files navigation

WhisperX

New🚨

Installation with AIME MLC ⚙️

1. Create Python3.10 environment

2. Install this repo

2. Install these system dependencies

Running inference

Running WhisperX as HTTP/HTTPS API with AIME API Server

Usage 💬 (command line)

1. Run a basic transcription on an audio file:

2. Advanced Options

Technical Details 👷‍♂️

Limitations ⚠️

Contribute 🧑‍🏫

Acknowledgements 🙏

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages