Towards Frame-level Quality Predictions of Synthetic Speech

We provide a small toolkit to evaluate frame-level representations of MOS predictors. The task is to detect local distortions that were inserted into clean speech recordings. We use the intersection-based detection criterion [1, 2] from the sound event detection community for evaluation.

Installation

Via pip

pip install git+https://github.com/fgnt/frame-level-mos.git

From source

git clone https://github.com/fgnt/frame-level-mos.git
cd frame-level-mos
pip install -e .

Data

You can download two modified LibriTTS-R datasets with inserted local distortions from huggingface:

from huggingface_hub import snapshot_download
snapshot_download(repo_id="kuhlmannm/is25-local-distortions", local_dir="is25_local_distortions", repo_type="dataset", ignore_ptterns=["*.txt"])

Checkpoints

You can download the following checkpoints from huggingface:

from huggingface_hub import snapshot_download

Model	repo id	Command
SHEET SSL-MOS: Re+CNN	`kuhlmannm/is25-ssl-mos-cnn`	`snapshot_download(repo_id="kuhlmannm/is25-ssl-mos-cnn", local_dir="ssl_mos_cnn")`
ChunkMOS+BLSTM	`kuhlmannm/is25-chunk-mos-blstm`	`snapshot_download(repo_id="kuhlmannm/is25-chunk-mos-blstm", local_dir="chunk_mos_blstm")`

How to use

The evaluation consists of two steps:

Save the frame-level scores for each speech signal to evaluate to a single file.
Compute the detection score by evaluating the frame-level scores against the ground truth.

Using the provided checkpoints and data

Save the frame-level scores for each speech signal to evaluate to a single file.

python -m frame_level_mos.eval.write_frame_scores /path/to/downloaded/data/<dataset_name> /path/to/downloaded/model

The frame-level scores will be saved under /path/to/downloaded/model/<dataset_name>/scores. See python -m frame_level_mos.eval.write_frame_scores --help for more options.

Compute the detection score by evaluating the frame-level scores against the ground truth.

python -m frame_level_mos.eval.detect /path/to/downloaded/model/<dataset_name>

This will print the intersection-based detection scores. By default, an intersection threshold of 0.5 is used. To use a different threshold, you can use the --dtc-threshold option. See python -m frame_level_mos.eval.detect --help for more options.

Results

dev_clean

Model	DTC=GTC=0.5	DTC=GTC=0.7
SSL-MOS [3]	.693	.555
SHEET SSL-MOS: Re+CNN	.748	.634
ChunkMOS+BLSTM	.638	.337

test_clean

Model	DTC=GTC=0.5	DTC=GTC=0.7
SSL-MOS [3]	.703	.562
SHEET SSL-MOS: Re+CNN	.748	.625
ChunkMOS+BLSTM	.647	.341

Using your own data

If you would like to use your own data, please use the following structure:

/data/root
|-- ground_truth.json
|-- audio_durations.json
|-- audio_files/
    |-- audio1.wav
    |-- audio2.wav
    |-- ...

The audio files should be in WAV format. It is also possible to arrange the audio files in subdirectories.

The file ground_truth.json contains the start and stop of each distortion in the following format:

{
  "data": {
    "audio1": [(0.475, 1.105, "class2"), (2.42, 2.83, "class3")],
    "audio2": [(3.49, 3.89, "class1"), (6.16, 6.56, "class2"), (7.11, 7.51, "class1")],
    ...,
    "meta": {
        "perturbations": [
            "class1",
            "class2",
            "class3",
            ...
        ]
    }
  }
}

The file audio_durations.json contains the duration of each audio file in seconds:

{
  "data": {
    "audio1": 5.04,
    "audio2": 6.10,
    ...
  }
}

Using your own model

Option 1: Write a wrapper for padertorch

Write a wrapper that inherits from padertorch.Module. Its forward method should take a batch of audio signals and their sequence lengths as input and return the utterance-level predictions, frame-level scores, and the sequence lengths of the frame-level scores.

Option 2: Save the frame-level scores yourself

Instead of using write_scores_frame.py, you can save the frame-level scores yourself. See sed_scores_eval for how to do this.

References

[1] C. Bilen, G. Ferroni, F. Tuveri, J. Azcarreta and S. Krstulovic, "A Framework for the Robust Evaluation of Sound Event Detection", in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2020. Paper Link.

[2] J. Ebbers, R. Serizel and R. Haeb-Umbach, "Threshold-Independent Evaluation of Sound Event Detection Scores", in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022. Paper Link.

[3] mos-finetune-ssl

Citation

@inproceedings{kuhlmann25_interspeech,
  title     = {{Towards Frame-level Quality Predictions of Synthetic Speech }},
  author    = {{Michael Kuhlmann and Fritz Seebauer and Petra Wagner and Reinhold Haeb-Umbach}},
  year      = {{2025}},
  booktitle = {{Interspeech 2025}},
  pages     = {{2300--2304}},
  doi       = {{10.21437/Interspeech.2025-2190}},
}

Acknowledgements

This research was funded by Deutsche Forschungsgemeinschaft (DFG), project 446378607.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
frame_level_mos		frame_level_mos
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Towards Frame-level Quality Predictions of Synthetic Speech

Installation

Via pip

From source

Data

Checkpoints

How to use

Using the provided checkpoints and data

Results

dev_clean

test_clean

Using your own data

Using your own model

Option 1: Write a wrapper for padertorch

Option 2: Save the frame-level scores yourself

References

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

License

fgnt/frame-level-mos

Folders and files

Latest commit

History

Repository files navigation

Towards Frame-level Quality Predictions of Synthetic Speech

Installation

Via pip

From source

Data

Checkpoints

How to use

Using the provided checkpoints and data

Results

dev_clean

test_clean

Using your own data

Using your own model

Option 1: Write a wrapper for padertorch

Option 2: Save the frame-level scores yourself

References

Citation

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages