This is the project using active learning to train a speech spoofing countermeasure.
Arxiv link: https://arxiv.org/abs/2203.14553
Xin Wang, and Junichi Yamagishi. Investigating Active-Learning-Based Training Data Selection for Speech Spoofing Countermeasure. In Proc. SLT, accepted. 2023.
@inproceedings{Wang2023,
author = {Wang, Xin and Yamagishi, Junichi},
booktitle = {Proc. SLT},
pages = {accepted},
title = {{Investigating Active-learning-based Training Data Selection for Speech Spoofing Countermeasure}},
year = {2023}
}
The ideas are not straightforward to implement. It is also complicated to prepare off-the-self scripts for all kinds of data sets.
Hence, this project demonstrates the training and scoring process using a toy data set.
The data sets and resources used in the experiment of the paper is on zenodo. Some data sets cannot be fully re-distributed by this repository, but we provide a link to the original repository.
We also provide the audio file selected by the two best AL systems. Please check the selected_files
in the above tar package on zenodo.
If you need to run it on your own databases, check the steps required below. Apologize if the scripts will take you much time to setup : )
No need to setup anything, just:
bash 00_demo.sh model-AL-NegE config_AL_train_toyset 01
Here
model-AL-NegE
is the model name. It can be other modelsconfig_AL_train_toyset
is the name of the prepared toy data set configuraiton.01
is a random seed
The script will
- Download a toy data set, an SSL front end, and a seed CM model.
- Build a conda environment if it is not available.
- Train the CM
model-AL-NegE
on the toy set specified inconfig_AL_train_toyset
with random seed01
- Score the evaluation data in a toy set (which is specified in 00_demo.sh)
Notes:
- The demonstration script uses the toy data set as both the seed training set and the pool.
- To run the experiment on your data, please check
step-by-step
below. - We don't compute EER from the CM scores. See tutorial here on how to compute EER.
- Score file contains three fields. Please use the
CM score
column to compute EER
TRIAL_NAME CM_SCORE CONFIDENCE_SCORE
LA_E_9933162 -9.500082 -4.499880
|- 00_demo.sh: demonstration script
|- 01_download.sh: script to download toy data and seed models
|- 01_train.sh: script to start model training
|- 02_score.sh: script to score waveforms
|
|- main.py: main entry of python code
|- config_AL_train_toyset.py: configuration for the toy data set
|- config_auto.py: general config for scoring
| (no need to change it)
|
|- model-AL-NegE: Al-NegE CM model in the paper
| |- model.py CM definition
| |- config_AL_train_toyset working folder when using toy
| | dataset
| |= 01 running with random seed = 01
| |
| |= trained_network_al_cycle_xxx.pt: trained network
| | after xxx cycles
| |= epoch_al_cycle_xxx_epoch_yyy.pt: trained network
| | with intermediate statistics
| | after xxx cycles
| |= cache_al_data_log_xxx.txt: cache that shows the
| | data selected in each cycle
| |= log_..._cycle_xxx_NNN: raw output file for test set NNN
| | after xxx active learning cycle
| |= log_..._cycle_xxx_NNN_err: raw code error messages
| |= log_..._cycle_xxx_NNN_score.txt: scores printed in CSV format for EER
| | computation
| |= log_train: log of model training
| |= log_train_err: code error messages during training
| |= NNN.dic: cache of data length (temporary files)
|
|- model-AL-Adv: ALAdv in the paper
|- model-AL-Pas: ALPas in the paper
|- model-AL-PosE: ALPosE in the paper
|- model-AL-Rem: ALRem in the paper
|
|- seed_model: folder to download CM pre-trained on seed
| training set
|- SSL_pretrained: folder to download SSL model
|- DATA: folder to save the toy data set
Files or folders marked with = will be produced after running the demo script.
When running bash 00_demo.sh model-AL-NegE config_AL_train_toyset 01
,
The demonstration script 00_demo.sh
will
- Prepare the working folder
- Copy
main.py
tomodel-AL-NegE/config_AL_train_toyset/01
- Copy
config_AL_train_toyset.py
tomodel-AL-NegE/config_AL_train_toyset/01
- Copy
model-AL-NegE/model.py
tomodel-AL-NegE/config_AL_train_toyset/01
- Copy
- Call
01_train.sh
to start training inmodel-AL-NegE/config_AL_train_toyset/01
- Python command line in
01_train.sh
is called - CM is trained for multiple active learning cycles (Figure 1 above)
- The trained CM after each cycle will be saved as
model-AL-NegE/config_AL_train_toyset/01/trained_network_al_cycle_NNN.pt
- Python command line in
- Call
02_score.sh
to score the toy data set- For each cycle
model-AL-NegE/config_AL_train_toyset/01/trained_network_al_cycle_NNN.pt
, score the test set
- For each cycle
Follow DATA/toy_example
and prepare the data for a seed training set and a pool data set. For each set, you need
DATA/SEEDSET/
|- train_dev: folder to save the trn. and dev. sets waveforms
|- eval: folder to save the eval. set waveforms
|- protocol.txt: protocol file
| each line contains SPEAKER TRIAL ... Label
|- scp
| |- train.lst list of file names in trn. set
| |- val.lst list of file names in dev. set
| |- test.lst list of file names in eval. set
(just file name w/o extension)
Note that
- If the pool or seed set contains multiple subsets, just prepare each subset in the same manner as above.
- Name of
*.lst
can be anything other thantrain
,val
, ortest
. - We will tell the script which lst to use in the next step.
Create a config_AL_NNN.py
based on config_AL_train_toyset.py
.
Run
bash 00_demo.sh MODEL config_AL_NNN RAND_SEED
The trained CM from each active learning cycle will be saved to MODEL/config_AL_NNN/RAND_SEED
.
That's all