Overview

This repository includes implementations of speaker verification systems that input raw waveforms.

Currently, it includes four systems in python. Detailed instructions on each system are described in individual ReadME files.

RawNet3 in ESPnet

As a part of an open-source project, ESPnet-SPK, pre-trained RawNet3 using the ESPnet-SPK framework is supported for easy access. Albeit the same architecture, with an enhanced framework, the performance has further improved slightly.

Performance
- Vox1-O: EER 0.73%

Usage

As mentioned in Figure 3 of the ESPnet-SPK paper, the below few lines of code are sufficient to extract RawNet3 embeddings. Refer to the code snippet below and replace np.zeros with your raw waveform.

ESPnet installation is a prerequisite

import numpy as np 
from espnet2.bin.spk_inference import Speech2Embedding

speech2spk_embed = Speech2Embedding.from_pretrained(model_tag="espnet/voxcelebs12_rawnet3")
embedding = speech2spk_embed(np.zeros(16500))

ESPnet-SPK is currently on arXiv.

@article{jung2024espnet,
  title={ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models},
  author={Jung, Jee-weon and Zhang, Wangyou and Shi, Jiatong and Aldeneh, Zakaria and Higuchi, Takuya and Theobald, Barry-John and Abdelaziz, Ahmed Hussen and Watanabe, Shinji},
  journal={arXiv preprint arXiv:2401.17230},
  year={2024}
}

RawNet3

PyTorch implementation
Performance
- supervised learning with AAM-Softmax: EER 0.89%
- self-supervised learning: EER 5.40%
Training recipe
- Will be served in https://github.com/clovaai/voxceleb_trainer
Inference
- Pre-trained weight parameters are stored in HuggingFace and is included as a submodule.
- Vox1-O benchmark is available in RawNet3.
- Extracting speaker embedding from any 16k 16bit mono utterance is supported.
Published as a conference paper in Interspeech 2022.

@article{jung2022pushing,
  title={Pushing the limits of raw waveform speaker recognition},
  author={Jung, Jee-weon and Kim, You Jin and Heo, Hee-Soo and Lee, Bong-Jin and Kwon, Youngki and Chung, Joon Son},
  journal={Proc. Interspeech},
  year={2022}
}

RawNet2_modified

Code refactoring
- PyTorch ResNet alike model implementation
- Deeper architecture
- Improved feature map scaling method
  - α-feature map scaling for raw waveform speaker verification
    - Only abstract is in English
- Angular loss function adopted
Performance
- EER 1.91%
  - Trained using VoxCeleb2
  - VoxCeleb1 original trial
- Will be used as a baseline system for authors' future works

RawNet2

Improved performance than RawNet
- DNN speaker embedding extraction with raw waveform inputs
- cosine similarity back-end
- EER 4.8% -->> 2.56%
  - VoxCeleb1 original trial
Uses a technique named feature map scaling
- scales feature map alike squeeze-excitation
Implemented in PyTorch.
Published as a conference paper in Interspeech 2020.
- Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms

@article{jung2020improved,
  title={Improved RawNet with Feature Map Scaling for Text-independent Speaker Verification using Raw Waveforms},
  author={Jung, Jee-weon and Kim, Seung-bin and Shim, Hye-jin and Kim, Ju-ho and Yu, Ha-Jin},
  journal={Proc. Interspeech},
  pages={3583--3587},
  year={2020}
}

RawNet

DNN-based speaker embedding extractor used with another DNN-based classifier
- Built on top of authors' previous works on raw waveform speaker verification
  - ICASSP2018 and Interspeech2018
- EER 4.8% with cosine similarity back-end, 4.0% with proposed concat&mul back-end
  - VoxCeleb1 original trial
Implemented in Keras and PyTorch
Published as a conference paper in Interspeech 2019.
- RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification

@article{jung2019RawNet,
  title={RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification},
  author={Jung, Jee-weon and Heo, Hee-soo and Kim, ju-ho and Shim, Hye-jin and Yu, Ha-jin},
  journal={Proc. Interspeech},
  pages={1268--1272},
  year={2019}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReadMe.md

ReadMe.md

Overview

RawNet3 in ESPnet

Usage

RawNet3

RawNet2_modified

RawNet2

RawNet

Files

ReadMe.md

Latest commit

History

ReadMe.md

File metadata and controls

Overview

RawNet3 in ESPnet

Usage

RawNet3

RawNet2_modified

RawNet2

RawNet