This repository provides the DECRO dataset mentioned in the paper: Transferring Audio Deepfake Detection Capability across Languages (accepted by TheWebConf 2023). DEepfake CROss-lingual evaluation dataset is constructed to evaluate the influence of language differences on deepfake detection.
The latest DECRO dataset is available at https://zenodo.org/record/7603208.
If you use DECRO dataset for deepfake detection, please use the following citation:
@inproceedings{ba2023transferring,
title={Transferring Audio Deepfake Detection Capability across Languages},
author={Ba, Zhongjie and Wen, Qing and Cheng, Peng and Wang, Yuwei and Lin, Feng and Lu, Li and Liu, Zhenguang},
booktitle={Proceedings of the ACM Web Conference 2023},
pages={2033--2044},
year={2023}
}
DECRO consists of two subsets: English and Chinese subsets. The English and Chinese parts both contain bona-fide and spoofed speech samples, and have almost the same total audio length. Most importantly, the spoofed speech signals in the two parts are generated with the same types of synthetic algorithms, which helps to exclude other interference factors and benefits accurate measurement of the impact of language differences on the detection accuracy.
There are 21218 bona-fide utterances in the Chinese subset and 12484 bona-fide utterances in the English subset.
For the Chinese part, we collected data from multiple open-sourced recording datasets. Six recording datasets are covered to guarantee the diversity of Chinese recordings, including Aidatatang_200zh, Aishell1, Aishell3, freeST, MagicData, and Aishell2 (free for academic usage).
For the English part, bona-fide audios are collected from the ASVspoof2019 LA dataset and redivided to fit our setting.
There are 41880 and 42800 spoofed utterances in the Chinese and the English dataset, respectively. Part of the spoofed speeches is from public datasets and the others are generated using commercial and open-source algorithms, including Text-to-speech (TTS) and Voice conversion (VC) techniques.
We collect samples from two public deepfake speech datasets: a Chinese dataset Wavefake and an English dataset FAD. The two datasets collect some speech samples generated by the same synthesis algorithms, including HiFiGAN, Multiband-MelGAN, and PWG.
Besides, Tacotron, FastSpeech2, VITS, and Starganv2-vc are end-to-end synthesis algorithms inherently supporting the generation of both Chinese and English. We collect Tacotron English data from A10 in the ASVspoof2019 LA dataset and generate the others with the corresponding pre-trained models. Note that NVCNet was initially proposed to perform VC in English. We retrain the model to generate Chinese speech. The Chinese samples of Baidu and Xunfei TTS come from the FMFCC-A dataset and we synthesize the corresponding English samples via online APIs. Noted, we refer to the above spoofing algorithms using their abbreviations in the latter sections.
DECRO specifics, including the number of bona-fide and spoofed utterances.
English | Chinese | |||||
Train Set | Dev Set | Eval Set | Train Set | Dev Set | Eval Set | |
Bona-fide | 5129 | 3049 | 4306 | 9000 | 6109 | 6109 |
Spoofed | 17412 | 10503 | 14884 | 17850 | 12015 | 12015 |
Total | 22541 | 13552 | 19190 | 26850 | 18124 | 18124 |
The distributed audio files are encoded at a single channel, *.wav format. The corpus is split into train, dev, and eval subsets.
All protocol files for the deepfake detection models are in ASCII format. Each column of the protocol is formatted as:
SPEAKER_ID AUDIO_FILE_NAME - SYSTEM_ID KEY
where,
- SPEAKER_ID: ****, speaker ID
- AUDIO_FILE_NAME: ****, the name of the audio file (!!!! no file extension, eg. "001" for "001.wav")
- SYSTEM_ID: Abbreviation of the speech spoofing system, or, for bonafide speech SYSTEM-ID is left blank ('-')
- -: This column is NOT used.
- KEY: 'bonafide' for genuine speech, or, 'spoof' for spoofing speech
2023.2.10 -- 2024.1.1 Chinese and English data generated by more spoofing algorithms will be included.
[1] Aidatatang_200zh
@online{openslrDatatang,
author = {DataTang},
title = {aidatatang\_200zh, a free Chinese Mandarin speech corpus by Beijing DataTang Technology Co., Ltd ( www.datatang.com )},
year = {2020},
howpublished = {\url{https://openslr.org/62/}},
note = {Online; accessed 08-Oct-2022}
}
[2] Aishell1
@inproceedings{bu2017aishell,
title={Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline},
author={Bu, Hui and Du, Jiayu and Na, Xingyu and Wu, Bengu and Zheng, Hao},
booktitle={2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA)},
pages={1--5},
year={2017},
organization={IEEE}
}
[3] Aishell3
@inproceedings{AISHELL-3_2020,
title={AISHELL-3: A Multi-speaker Mandarin TTS Corpus and the Baselines},
author={Yao Shi, Hui Bu, Xin Xu, Shaoji Zhang, Ming Li},
year={2015},
url={https://arxiv.org/abs/2010.11567}
}
[4] freeST
@article{openslrFreeST,
author = {Surfing Technology Beijing Co., Ltd},
title = {ST-CMDS-20170001\_1, Free ST Chinese Mandarin Corpus},
year = {2018},
howpublished = {\url{http://www.openslr.org/38/}},
note = {Online; accessed 08-Oct-2022}
}
[5] MagicData
@article{openslrMagicdata,
author = {Magic Data Technology Co., Ltd},
title = {MAGICDATA Mandarin Chinese Read Speech Corpus},
year = {2019},
howpublished = {\url{http://www.openslr.org/68/}},
note = {Online; accessed 08-Oct-2022}
}
[6] WaveFake
@inproceedings{frank2021wavefake,
title={WaveFake: A Data Set to Facilitate Audio Deepfake Detection},
author={Joel Frank and Lea Sch{\"o}nherr},
booktitle={Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2)},
year={2021},
url={https://openreview.net/forum?id=74TZg9gsO8W}
}
[7] FAD
@article{ma2022fad,
title={FAD: A Chinese Dataset for Fake Audio Detection},
author={Ma, Haoxin and Yi, Jiangyan and Wang, Chenglong and Yan, Xinrui and Tao, Jianhua and Wang, Tao and Wang, Shiming and Xu, Le and Fu, Ruibo},
journal={arXiv preprint arXiv:2207.12308},
year={2022}
}
[8] GST-Tacotron
@article{GST-Tacotron,
author = {KinglittleQ},
title = {GST-Tacotron},
year = {2018},
howpublished = {\url{https://github.com/KinglittleQ/GST-Tacotron}},
note = {Online; accessed 09-Oct-2022}
}
[9] FastSpeech2
@INPROCEEDINGS{chien2021investigating,
author={Chien, Chung-Ming and Lin, Jheng-Hao and Huang, Chien-yu and Hsu, Po-chun and Lee, Hung-yi},
booktitle={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Investigating on Incorporating Pretrained and Learnable Speaker Representations for Multi-Speaker Multi-Style Text-to-Speech},
year={2021},
volume={},
number={},
pages={8588-8592},
doi={10.1109/ICASSP39728.2021.9413880}}
[10] VITS_chinese
@article{VITSch,
author = {UEhQZXI},
title = {vits\_chinese},
year = {2021},
howpublished = {\url{https://github.com/UEhQZXI/2021}},
note = {Online; accessed 09-Oct-2022}
}
[11] Starganv2-vc
@article{li2021starganv2,
title={Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion},
author={Li, Yinghao Aaron and Zare, Ali and Mesgarani, Nima},
journal={arXiv preprint arXiv:2107.10394},
year={2021}
}
[12] NVC-Net
@article{nguyen2021nvc,
title={NVC-Net: End-to-End Adversarial Voice Conversion},
author={Nguyen, Bac and Cardinaux, Fabien},
journal={arXiv preprint arXiv:2106.00992},
year={2021}
}
[13] FMFCC-A
@article{zhang2021FMFCCA,
author = {Zhenyu Zhang and Yewei Gu and Xiaowei Yi and Xianfeng Zhao},
title = {{FMFCC-A:} {A} Challenging Mandarin Dataset for Synthetic Speech Detection},
journal = {CoRR},
volume = {abs/2110.09441},
year = {2021},
url = {https://arxiv.org/abs/2110.09441},
eprinttype = {arXiv},
eprint = {2110.09441},
timestamp = {Mon, 25 Oct 2021 20:07:12 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2110-09441.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
[14] ASVspoof2019
@article{todisco2019asvspoof,
title={{ASVspoof} 2019: Future Horizons in Spoofed and Fake Audio Detection},
author={Todisco, Massimiliano and Wang, Xin and Vestman, Ville and Sahidullah, Md and Delgado, Hector and Nautsch, Andreas and Yamagishi, Junichi and Evans, Nicholas and Kinnunen, Tomi and Lee, Kong Aik},
journal={arXiv preprint arXiv:1904.05441},
year={2019}
}
CC-BY-4.0.
Qing Wen, Zhejiang University ([email protected])