This project contains a simple web interface to evaluate the performance of a TTS model on given dataset and produce the Phoneme Error Rate (PER), Deepfake Detection Confidence (DDC), and Speaker Verification (SV) metrics. The repository also contains several modern dockerized TTS models that can be easily built and run using the provided docker files.
-
Phoneme Error Rate (PER): The PER is a metric that measures the difference between the phoneme sequence generated by the TTS model and the ground truth phoneme sequence. Since modern TTS models noramlly skip the phoneme generation step, the phoneme sequence is generated by first converting the generated speech to text using Whisper models and then converting the text to phonemes. The PER is calculated using the Levenshtein distance between the two sequences. The reason we don't use the WER metric is that the phoneme sequence is more informative in the context of TTS evaluation and it kinda mitigates the bias of the ASR model.
-
Speaker Verification Score (SVS): The SVS is calculated using iic/speech_eres2netv2_sv_zh-cn_16k-common. It measures the similarity between the generated speech and the ground truth speech. Note that the model is trained on Chinese data, so the SVS may not be accurate for other languages.
-
Deepfake Detection Confidence (DDC): The DDC is a metric that measures the confidence of a deepfake detection model in detecting the generated speech as a deepfake. For open speech synthesis models, a detection model should be specifically trained to ensure distinguishability. There lacks a good deepfake detection model for languages other than English. Experimented models include HyperMoon/wav2vec2-base-960h-finetuned-deepfake; abhishtagatya/wav2vec2-base-960h-itw-deepfake; Hemg/small-deepfake; motheecreator/Deepfake-audio-detection; MelodyMachine/Deepfake-audio-detection-V2; DavidCombei/wavLM-base-DeepFake_UTCN.
DavidCombei/wavLM-base-DeepFake_UTCN is the best model we have found, so we use it as the default model to calculate the DDC for now.
Consider finetuning new deepfake detection model on MLAAD.
-
Clone the repository.
-
Create a conda environment using the provided environment.yml file, or install the required packages manually.
conda env create -f environment.yml conda activate eval
-
run the following command to start the web interface.
python webUI.py
-
Open the web interface in your browser, the UI should look like this:
-
Upload a Custom Data Configuration File to the UI. The configuration file should be in JSON format, structured as follows:
[ { "audio_path": "path/to/audio.wav", "transcription": "你好,我是一个TTS模型。" } // Add more samples to the list as needed. ]
Ensure that any audio data you have in the format of
(array, sampling rate)
is converted to.wav
files before including them in the configuration file. Each entry in the JSON array represents a single sample, with theaudio_path
pointing to the location of the.wav
file andtranscription
containing the corresponding text. -
Select the TTS function to generate the audio samples. By selecting
ground truth
, ground truth audio samples are used to calculate the metrics. We provide vits-fast-finetuning as an example. To add more TTS models, implement your custom TTS function in theapi.py
file. The function should have the following signature:def ur_fn(text: str) -> Tuple[np.ndarray, int]: """ Args: text: The text to be synthesized. Returns: audio: The synthesized audio array. sampling_rate: The sampling rate of the audio. """
Then, add the function to the
TTS_FNs
dictionary in theapi.py
file.TTS_FNs = {call_vits_ft.__name__: call_vits_ft, ur_fn.__name__: ur_fn}
Now, you can select the tts function from the dropdown menu in the UI. Then, click the Generate Audio from JSON button to generate the audio samples from the given text using the selected TTS model.
-
Select the Whisper model size to use for the PER calculation. This should be done in the Enter ASR Whisper Size dropdown menu.
The corresponding Whisper model will be downloaded automatically.
-
Click Calculate Metrics to calculate the PER, SVS, and DDC scores for the generated audio samples.
By clicking the button, the DavidCombei/wavLM-base-DeepFake_UTCN and iic/speech_eres2netv2_sv_zh-cn_16k-common models will be downloaded automatically.
We provide several dockerized TTS models that can be easily built and run using the provided docker files. By running a docker container, an API server will be started that can be used to generate audio samples from text. Putting this API calling as TTS functions to the web interface, we can easily evaluate the TTS model performance on a given dataset.
- vits-fast-finetuning
- go to
model_runner\vits-ft
directory. - put your
G_latest.pth
andmodified_finetune_speaker.json
files (as named by the vits-ft project authors) in thevits-ft
directory. - build the docker image:
docker build -t vits-ft .
- run the docker container:
docker run -p 6969:6969 vits-ft
- go to