DiCoW (Diarization-Conditioned Whisper) enhances OpenAI’s Whisper ASR model by integrating speaker diarization for multi-speaker transcription. The app leverages pyannote/speaker-diarization-3.1
to segment speakers and provides diarization-conditioned transcription for long-form audio inputs.
Training and inference source codes can be found here: TS-ASR-Whisper
- Multi-Speaker ASR: Handles multi-speaker audio using diarization-aware transcription.
- Flexible Input Sources:
- Microphone: Record and transcribe live audio.
- Audio File Upload: Upload pre-recorded audio files for transcription.
- Diarization Support: Powered by
pyannote/speaker-diarization-3.1
for accurate speaker segmentation. - Built with 🤗 Transformers: Uses the latest Whisper checkpoints for robust transcription.
Disclaimer: This version of DiCoW currently supports English only and is still under active development. Expect frequent updates and feature improvements.
Run the app directly in your browser with Gradio app.
Before running the app, ensure you have the following installed:
- Python 3.11
- FFmpeg: Required for audio processing.
- Python Libraries:
gradio
transformers
pyannote.audio
torch
librosa
soundfile
- Clone the repository:
git clone https://github.com/your-username/DiCoW-v1.git
cd DiCoW-v1
- Setup dependencies:
pip install -r requirements.txt
- Export your Hugging Face API token:
export HF_TOKEN=''
Run the application locally:
python app.py
Once the server is running, access the app in your browser at http://localhost:7860
.
If you want to run this demo on background, it may be good to make a service out of it. (some distros kill the background jobs when user logs out, hence kill the demo).
To register the demo as service, first edit ./run_server.sh
and ./DiCoW-background.service
and set proper paths and users. It is important to set the conda correctly in ./run_server.sh
as the service is started out of the userspace (.profile
).
Then register and start the service (run as root):
systemctl enable ./DiCoW-background.service #register the service
systemctl start DiCoW-background.service #start
systemctl status DiCoW-background.service #check if it is running
systemctl stop DiCoW-background.service #stop
systemctl disable DiCoW-background.service #will not start on restart anymore
- Microphone: Use your device’s microphone for live transcription.
- Audio File Upload: Upload pre-recorded audio files for diarization-conditioned transcription.
We welcome contributions! If you’d like to add features or improve the app, please open an issue or submit a pull request.
If you use our model or code, please, cite:
@misc{polok2024dicowdiarizationconditionedwhispertarget,
title={DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition},
author={Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
year={2024},
eprint={2501.00114},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2501.00114},
}
@misc{polok2024targetspeakerasrwhisper,
title={Target Speaker ASR with Whisper},
author={Alexander Polok and Dominik Klement and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
year={2024},
eprint={2409.09543},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2409.09543},
}
For more information, feel free to contact us: [email protected], [email protected].