This repository hosts the models for the paper "Analysis of XLS-R for Speech Quality Assessment".
Comparison of model performance on each unseen corpus individually (NISQA, IUB) and combined together (Unseen). The metric is RMSE, lower is better.
V1 Results
Model | NISQA | IUB | Unseen |
---|---|---|---|
XLS-R 300M Layer24 Bi-LSTM [1] | 0.5907 | 0.5067 | 0.5323 |
DNSMOS [2] | 0.8718 | 0.5452 | 0.6565 |
MFCC Transformer | 0.8280 | 0.7775 | 0.7924 |
XLS-R 300M Layer5 Transformer | 0.6256 | 0.5049 | 0.5425 |
XLS-R 300M Layer21 Transformer | 0.5694 | 0.5025 | 0.5227 |
XLS-R 300M Layer5+21 Transformer | 0.5683 | 0.4886 | 0.5129 |
XLS-R 1B Layer10 Transformer | 0.5456 | 0.5815 | 0.5713 |
XLS-R 1B Layer41 Transformer | 0.5657 | 0.4656 | 0.4966 |
XLS-R 1B Layer10+41 Transformer | 0.5748 | 0.5288 | 0.5425 |
XLS-R 2B Layer10 Transformer | 0.6277 | 0.4899 | 0.5334 |
XLS-R 2B Layer41 Transformer | 0.5724 | 0.4897 | 0.5150 |
XLS-R 2B Layer10+41 Transformer | 0.6036 | 0.4743 | 0.5150 |
Human | 0.6738 | 0.6573 | 0.6629 |
V2 Results
UPDATE: the code has been updated to use version 2 of the models. Version 1 used the final model checkpoint by mistake, version 2 uses the checkpoint with the minimum validation loss.
Model | NISQA | IUB | Unseen |
---|---|---|---|
XLS-R 300M Layer24 Bi-LSTM [1] | 0.5907 | 0.5067 | 0.5323 |
DNSMOS [2] | 0.8718 | 0.5452 | 0.6565 |
MFCC Transformer | 0.9291 | 0.7415 | 0.8003 |
XLS-R 300M Layer5 Transformer | 0.6494 | 0.5117 | 0.5550 |
XLS-R 300M Layer21 Transformer | 0.5852 | 0.4838 | 0.5152 |
XLS-R 300M Layer5+21 Transformer | 0.5861 | 0.4768 | 0.5108 |
XLS-R 1B Layer10 Transformer | 0.6217 | 0.4763 | 0.5225 |
XLS-R 1B Layer41 Transformer | 0.5615 | 0.4646 | 0.4946 |
XLS-R 1B Layer10+41 Transformer | 0.6024 | 0.4624 | 0.5068 |
XLS-R 2B Layer10 Transformer | 0.5227 | 0.4447 | 0.4686 |
XLS-R 2B Layer41 Transformer | 0.5295 | 0.4926 | 0.5035 |
XLS-R 2B Layer10+41 Transformer | 0.5191 | 0.4573 | 0.4760 |
Human | 0.6738 | 0.6573 | 0.6629 |
[1] Tamm, B., Balabin, H., Vandenberghe, R., Van hamme, H. (2022) Pre-trained Speech Representations as Feature Extractors for Speech Quality Assessment in Online Conferencing Applications. Proc. Interspeech 2022, 4083-4087, doi: 10.21437/Interspeech.2022-10147
[2] C. K. A. Reddy, V. Gopal and R. Cutler, "DNSMOS: A Non-Intrusive Perceptual Objective Speech Quality Metric to Evaluate Noise Suppressors," ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021, pp. 6493-6497, doi: 10.1109/ICASSP39728.2021.9414878.
MOS predictions on two unseen datasets: NISQA (top) and IU Bloomington (bottom). Our proposed model based on embeddings extracted from the 10th layer of the pre-trained XLS-R 2B outperforms DNSMOS and the MFCC baseline. The human ACRs are also visualized for the IUB corpus.
🔊
Excellent (MOS = 4.808)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-excellent.mp4 |
DNSMOS | 3.699 | -1.109 |
MFCC Transformer | 3.497 | −1.311 | |
XLS-R 2B Layer10 Transformer |
3.935 | -0.873 |
🔊
Good (MOS = 4.104)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-good.mp4 |
DNSMOS | 3.269 | -0.835 |
MFCC Transformer | 2.498 | -1.606 | |
XLS-R 2B Layer10 Transformer |
3.793 | -0.311 |
🔊
Fair (MOS = 3.168)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-fair.mp4 |
DNSMOS | 3.309 | +0.141 |
MFCC Transformer | 3.931 | +0.763 | |
XLS-R 2B Layer10 Transformer |
3.080 | -0.088 |
🔊
Poor (MOS = 2.240)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-poor.mp4 |
DNSMOS | 2.704 | +0.464 |
MFCC Transformer | 1.927 | -0.313 | |
XLS-R 2B Layer10 Transformer |
2.284 | +0.044 |
🔊
Bad (MOS = 1.416)
Audio Sample | Model | Prediction | Error |
---|---|---|---|
iub-bad.mp4 |
DNSMOS | 2.553 | +1.137 |
MFCC Transformer | 1.806 | +0.390 | |
XLS-R 2B Layer10 Transformer |
2.312 | +0.896 |
First, clone the repository.
git clone https://github.com/lcn-kul/xls-r-analysis-sqa.git
Next, install the requirements to a virtual environment of your choice.
cd xls-r-analysis-sqa/
pip3 install -r requirements.txt
Finally, this code uses truncated XLS-R models. These can be obtained by downloading
them from our HuggingFace repositories (recommended, follow
[these instructions]) or by downloading the full
pre-trained models (follow [these instructions]) and running
the script truncate_w2v2.py
.
Warning: the size of the truncated XLS-R models sums to 15GB (times 2 since the .git directory is also a similar size).
A working example is provided in test_e2e_sqa.py.
@INPROCEEDINGS{10248049,
author={Tamm, Bastiaan and Vandenberghe, Rik and Van Hamme, Hugo},
booktitle={2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
title={Analysis of XLS-R for Speech Quality Assessment},
year={2023},
volume={},
number={},
pages={1-5},
doi={10.1109/WASPAA58266.2023.10248049}
}