This repository contains a list of audio deepfake resources. We also have a survey report on Audio Deepfake Detection (ADD). We include sections on ADD Datasets, Audio Preprocessing, Feature Extraction and Network Training to introduce beginners to carefully selected material to learn the ADD domain. We will endeavour to maintain this repository on an ongoing basis for a fixed period.
- Audio Large Model
- Datasets
- Audio Preprocessing
- Feature Extraction
- Network Training
- Reference
- Statement
- Contact
Model | Publisher | Years | Achievable Tasks |
---|---|---|---|
AudioLM Paper Website Code |
2022.09 | 1. AudioLM maps the input audio to a sequence of discrete tokens and casts audio generation as a language modeling task in this representation space. 2. Speech continuation, Acoustic generation, Unconditional generation, Generation without semantic tokens, and Piano continuation. |
|
VALL-E Paper Website |
Microsoft | 2023.01 | 1. Simply record a 3-second registration of an unseen speaker to create a high-quality personalised speech. 2. VALL-E X: Cross-lingual speech synthesis. |
USM Website |
2023.03 | 1. ASR beyond 100 languages. 2. Downstream ASR tasks. 3. Automated Speech Translation (AST). |
|
SpeechGPT Website |
Fudan University | 2023.05 | 1. Perceive and generate multi-modal contents. 2. Spoken dialogue LLM with strong human instruction. |
Pengi Paper Website |
Microsoft | 2023.05 | 1. an Audio Language Model that leverages Transfer Learning by framing all audio tasks as text-generation tasks. 2. The unified architecture of Pengi enables open-ended tasks and close-ended tasks without any additional fine-tuning or task-specific extensions. |
VoiceBox Website |
Meta | 2023.06 | 1. Synthesize speech across six languages. 2. Remove transient noise. 3. Edit content. 4. Transfer audio style within and across languages. 5. Generate diverse speech samples. |
AudioPaLM Paper Website |
2023.06 | 1. Speech-to-speech translation. 2. Automatic Speech Recognition (ASR). |
Attack Types | Years | Dataset | Number of Audio (Subdataset:Real/Fake) |
Language |
---|---|---|---|---|
TTS | 2021 | WaveFake Paper Dataset |
16283/117985 | English, Japanese |
TTS | 2021 | HAD Paper |
53612/107224 | Chinese |
TTS | 2022 | ADD 2022 Paper |
LF: 5619/46067 PF: 5319/46419 FG-D: 5319/46206 |
Chinese |
TTS | 2022 | CMFD Paper Dataset |
Chinese: 1800/1000 English: 1800/1000 |
English, Chinese |
TTS | 2022 | In-the-Wild Paper Dataset |
19963/11816 | English |
TTS | 2022 | FAD Paper Dataset |
115800/115800 | Chinese |
Replay | 2017 | ASVspoof 2017 Paper Dataset |
3565/14465 | English |
Replay | 2019 | ReMASC Paper Dataset |
9240/45472 | English, Chinese, Hindi |
TTS和VC | 2015 | AVspoof Paper Dataset |
LA: 15504/120480 PA: 15504/14465 |
English |
TTS和VC | 2015 | ASVspoof 2015 Paper Dataset |
16651/246500 | English |
TTS和VC | 2021 | FMFCC-A Paper Dataset |
10000/40000 | Chinese |
TTS和VC | 2022 | SceneFake Paper Dataset |
19838/64642 | English |
TTS和VC | 2022 | EmoFake Paper |
35000/53200 | English, Chinese |
TTS和VC | 2023 | PartialSpoof Paper Dataset |
12483/108978 | English |
TTS和VC | 2023 | ADD 2023 Paper |
FG-D: 172819/113042 RL: 55468/65449 AR: 14907/95383 |
Chinese |
TTS和VC | 2023 | DECRO Paper Dataset |
Chinese: 21218/41880 English: 12484/42799 |
English, Chinese |
TTS、VC和Replay | 2019 | ASVspoof 2019 Paper Dataset |
LA: 12483/108978 PA: 28890/189540 |
English |
TTS、VC和Replay | 2021 | ASVspoof 2021 Paper Dataset |
LA: 18452/163114 PA: 126630/816480 PF: 14869/519059 |
English |
Dataset | Description |
---|---|
MUSAN Dataset |
A corpus of music, speech and noise |
RIR Dataset |
A database of simulated and real room impulse responses, isotropic and point-source noises. The audio files in this data are all in 16k sampling rate and 16-bit precision. |
NOIZEUS Dataset |
Contains 30 IEEE sentences (generated by three male and three female speakers) corrupted by eight different real-world noises at different SNRs. Noises include suburban train noise, murmur, car, exhibition hall, restaurant, street, airport and train station noise. |
NoiseX-92 Dataset |
All noises are obtained with a duration of 235 seconds, a sampling rate of 19.98 KHz, an analogue-to-digital converter (A/D) with 16 bits, an anti-alias filter and no pre-emphasis stage. Fifteen noise types are included. |
DEMAND Dataset |
Multi-channel acoustic noise database for diverse environments. |
ESC-50 Dataset |
A tagged collection of 2000 environmental audios obtained from clips in Freesound.org, suitable for environmental sound classification. The dataset consists of 5-second-long recordings organised into 5 broad categories, each with 10 subcategories (40 examples per subcategory). |
ESC Dataset |
Including the ESC-50, ESC-10, and ESC-US. |
FSD50K Dataset |
An open dataset of human tagged sound events containing 51,197 Freesound clips totalling 108.3 hours of multi-labeled audio, unequally distributed across 200 classes from the AudioSet Ontology. |
Method | Description |
---|---|
SpecAugment Paper Code |
Enhancement strategies include time warping, frequency masking and time masking |
WavAugment Paper Code |
Enhancement strategies include pitch randomization, reverberation, additive noise, time dropout (temporal masking), band reject and clipping |
RawBoost Paper Code |
Enhancement strategies include linear and non-linear convolutive noise, impulsive signal-dependent additive noise and stationary signal-independent additive noise |
Paper | Audio Deepfake Detection | Results | ||||
Data Augmentation | Feature Extraction | Network Framework | Loss Function | EER (%) | t-DCF | |
Detecting spoofing attacks using VGG and SincNet: BUT-Omilia submission to ASVspoof 2019 challenge Paper Code |
— | CQT, Power Spectrum | VGG, SincNet | CE | LA: 8.01 (4) PA: 1.51 (2) |
LA: 0.208 (4) PA: 0.037 (1) |
Long-term high frequency features for synthetic speech detection Paper |
Cafe, White and Street Noise | ICQC, ICQCC, ICBC, ICLBC | DNN | CE | LA: 7.78 (3) | LA: 0.187 (3) |
Voice spoofing countermeasure for logical access attacks detection Paper |
— | ELTP-LFCC | DBiLSTM | — | LA: 0.74 (1) | LA: 0.008 (1) |
Voice spoofing detector: A unified anti-spoofing framework Paper |
— | ATP-GTCC | SVM | Hamming Distance |
LA: 0.75 (2) PA: 1.00 (1) |
LA: 0.050 (2) PA: 0.064 (2) |
Paper | Audio Deepfake Detection | Results | ||||
Data Augmentation | Feature Extraction | Network Structure | Loss Function | EER (%) | t-DCF | |
Light convolutional neural network with feature genuinization for detection of synthetic speech attacks Paper |
— | CQT-based LPS | LCNN | — | LA: 4.07 (11) | LA: 0.102 (10) |
Siamese convolutional neural network using gaussian probability feature for spoofing speech detection Paper |
— | LFCC | Siamese CNN | CE | LA: 3.79 (10) PA: 7.98 (5) |
LA: 0.093 (5) PA: 0.195 (2) |
Generalization of audio deepfake detection Paper |
RIR and MUSAN | LFB | ResNet18 | LCML | LA: 1.81 (4) | LA: 0.052 (4) |
Continual learning for fake audio detection Paper |
— | LFCC | LCNN, DFWF | Similarity Loss | LA: 7.74 (15) PA: 8.85 (6) |
— |
Partially-connected differentiable architecture search for deepfake and spoofing detection Paper Code |
Frequency Mask | LFCC | PC-DARTS | WCE | LA: 4.96 (12) | LA: 0.091 (8) |
One-class learning towards synthetic voice spoofing detection Paper Code |
— | LFCC | ResNet18 | OC-Softmax | LA: 2.19 (7) | LA: 0.059 (5) |
Replay and synthetic speech detection with res2net architecture Paper Code |
— | CQT | SE-Res2Net50 | BCE | LA: 2.50 (8) PA: 0.46 (2) |
LA: 0.074 (7) PA: 0.012 (2) |
An empirical study on channel effects for synthetic voice spoofing countermeasure systems Paper Code |
Telephone Codecs, and Device/Room Impulse Responses (IRs). | LFCC | LCNN, ResNet-OC | OC-Softmax, CE | LA: 3.92 (10) | — |
Efficient attention branch network with combined loss function for automatic speaker verification spoof detection Paper Code |
SpecAug, Attention Mask | LFCC | EfficientNet-A0, SE-Res2Net50 | WCE, Triplet Loss | LA: 1.89 (6) PA: 0.86 (4) |
LA: 0.507 (11) PA: 0.024 (4) |
Resmax: Detecting voice spoofing attacks with residual network and max feature map Paper |
— | CQT | ResMax | BCE | LA: 2.19 (7) PA: 0.37 (1) |
LA: 0.060 (6) PA: 0.009 (1) |
Synthetic voice detection and audio splicing detection using se-res2net-conformer architecture Paper |
Adding noise according to a signal-to-noise ratio of 15dB or 25dB | CQT | SE-Res2Net34-Confromer | CE | LA: 1.85 (5) | LA: 0.060 (6) |
Fastaudio: A learnable audio front-end for spoof speech detection Paper Code |
— | L-VQT | L-DenseNet | NLLLoss | LA: 1.54 (3) | LA: 0.045 (3) |
Learning from yourself: A self-distillation method for fake speech detection Paper |
— | LPS, F0 | ECANet, SENet | A-Softmax | LA: 1.00 (2) PA: 0.65 (3) |
LA: 0.031 (2) PA: 0.017 (3) |
How to boost anti-spoofing with x-vectors Paper |
— | LFCC, MFCC | TDNN, SENet34 | LCML | LA: 0.83 (1) | LA: 0.024 (1) |
Paper | Audio Deepfake Detection | Results | ||||
Data Augmentation | Feature Extraction | Network Structure | Loss Function | EER (%) | t-DCF | |
A light convolutional GRU-RNN deep feature extractor for asv spoofing detection Paper |
— | LC-GRNN | PLDA | — | LA: 6.28 (13) PA: 2.23 |
LA: 0.152 (10) PA: 0.061 |
Rw-resnet: A novel speech anti-spoofing model using raw waveform Paper |
— | 1D Convolution Residual Block | ResNet | CE | LA: 2.98 (11) | LA: 0.082 (9) |
Raw differentiable architecture search for speech deepfake and spoofing detection Paper Code |
Masking Filter | Sinc Filter | PC-DARTS | P2SGrad | LA: 1.77 (10) | LA: 0.052 (7) |
Towards end-to-end synthetic speech detection Paper Code |
— | DNN | Res-TSSDNet, Inc-TSSDNet | WCE | LA: 1.64 (9) | LA: 0.048 (6) |
End-to-end anti-spoofing with RawNet2 Paper Code |
— | Sinc Filter | RawNet2 | CE | LA: 1.12 (5) | LA: 0.033 (3) |
Long-term variable Q transform: A novel time-frequency transform algorithm for synthetic speech detection Paper |
— | FastAudio filter | X-vector, ECAPA-TDNN | — | LA: 1.54 (7) | LA: 0.045 (5) |
Fully automated end-to-end fake audio detection Paper |
Sinc Filter | Wav2Vec2 | light-DARTS | Comparative loss | LA: 1.08 (4) | — |
Audio anti-spoofing using a simple attention module and joint optimization based on additive angular margin loss and meta-learning Paper |
— | Sinc Filter | RawNet2, SimAM | AAM Softmax, MSE | LA: 0.99 (3) | LA: 0.029 (2) |
AASIST: Audio anti-spoofing using integrated spectro-temporal graph attention networks Paper Code |
— | Sinc Filter | RawNet2, MGO, HS-GAL | CE | LA: 0.83 (2) | LA: 0.028 (1) |
Ai-synthesized voice detection using neural vocoder artifacts Paper Code |
Resampling, Noise Addition | Sinc Filter | RawNet2 | CE, Softmax | LA: 4.54 (12) | — |
To-RawNet: Improving rawnet with tcn and orthogonal regularization for fake audio detection Paper |
RawBoost | Sinc Filter | RawNet2, TCN | CE, Orthogonal Loss | LA: 1.58 (8) | — |
Speaker-Aware Anti-spoofing Paper |
— | Sinc Filter | AASIST, M2S Converter | CE | LA: 1.13 (6) | LA: 0.038 (4) |
Spoofing attacker also benefits from self-supervised pretrained model Paper |
— | HuBERT, WavLM | Residual block, Conv-TasNet | AAM softmax | LA: 0.44 (1) | — |
Paper | Audio Deepfake Detection | Results | ||
Feature Extraction | Network Structure | Loss Function | EER (%) | |
Voice spoofing countermeasure for synthetic speech detection Paper |
GTCC, MFCC, Spectral Flux, Spectral Centroid | Bi-LSTM | — | LA: 3.05 (4) |
Combining automatic speaker verification and prosody analysis for synthetic speech detection Paper |
MFCC, Mel-Spectrogram | ECAPA-TDNN, Prosody Encoder | BCE | LA: 5.39 (5) |
Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation Paper |
Sinc Filter, Wav2Vec2 | AASIST | Contrastive Loss, WCE | — |
Overlapped frequency-distributed network: Frequency-aware voice spoofing countermeasure Paper |
Mel-Spectrogram, CQT | LCNN, ResNet | — | LA: 1.35 (2) PA: 0.35 |
Detection of cross-dataset fake audio based on prosodic and pronunciation features Paper |
Phoneme Feature, Prosody Feature, Wav2Vec2 | LCNN, Bi-LSTM | CTC | LA: 1.58 (3) |
Betray oneself: A novel audio deepfake detection model via mono-to-stereo conversion Paper Code |
Sinc Filter | AASIST, M2S Converter | CE | LA: 1.34 (1) |
Paper | Audio Deepfake Detection | Results | |||
Feature Extraction | Network Structure | Loss Function | EER (%) | t-DCF | |
Multi-task learning in utterance-level and segmental-level spoof detection Paper |
LFCC | SELCNN, Bi-LSTM | P2SGrad | — | — |
SA-SASV: An end-to-end spoof-aggregated spoofing-aware speaker verification system Paper Code |
Fbanks, Sinc Filter | ECAPA-TDNN, ARawNet | BCE, AAM Softmax, CE | LA: 4.86 (4) | — |
STATNet: Spectral and temporal features based multi-task network for audio spoofing detection Paper |
Sinc Filter | RawNet2, TCM, SCM | CE | LA: 2.45 (3) | LA: 0.062 (2) |
A probabilistic fusion framework for spoofing aware speaker verification Paper Code |
Mel Filter, Sinc Filter | ECAPA-TDNN, AASIST | BCE | LA: 1.53 (2) | — |
DSVAE: Interpretable disentangled representation for synthetic speech detection Paper |
Spectrogram | VAE | KL Divergence Loss, BCE | LA: 6.56 (5) | — |
End-to-end dual-branch network towards synthetic speech detection Paper Code |
LFCC, CQT | Dual-Branch Network | Classification Loss, Fake Type Classification Loss | LA: 0.80 (1) | LA: 0.021 (1) |
More details about on the above, you may check the following this papers: //: (```python)
The purpose of this project is to establish a database based on audio deepfake detection, solely for the purpose of communication and learning. All the content collected in this project is sourced from journals and the internet, and we express sincere gratitude to the researchers and authors who have published related research achievements. In the event of a complaint of copyright infringement, the content will be removed as appropriate.
We are glad to hear from you. If you have any questions, please feel free to contact [email protected].