German STT model evaluation

In search of a "good" STT model for German language I have evaluated all free (as in free beer and open source) models.

tl;dr As of January 2022 NeMo-ASRs Conformer-Transducer model is the overall leader (WER 5.77 / CER 1.46) on GPU, while Jaco-Assistant/Scribosermo model is still a very good choice for CPU (WER 9.43 / CER 3.66).

Vendor / Architecture	Model	WER	CER	RTF	Comment
Jaco-Assistant / Scribosermo	full / Scorer: D37CV	9.43	3.66	0.078	CPU 8 cores
Jaco-Assistant / Scribosermo	quantized / Scorer: D37CV	9.51	3.70	0.096	CPU 8 cores
Mozilla DeepSpeech	deepspeech-german v0.9.0	27.93	11.36	0.209
Mozilla DeepSpeech	Polyglot	14.45	11.36	0.241
Silero	v4 large	18.98	6.67	0.009	RTF is not a typo
Wav2Vec	jonatasgrosman / wav2vec2-large-xlsr-53-german	10.87	2.68	0.06	Batchsize 1
Vosk	0.21	12.84	4.56	0.292
Nvidia NeMo-ASR	Conformer-CTC 1.5.0	7.39	1.80	0.064	GPU w/Apex-AMP
Nvidia NeMo-ASR	Conformer-Transducer 1.6.0	5.77	1.46	0.127	GPU w/Apex-AMP
Nvidia NeMo-ASR	Conformer-Transducer 1.5.0	6.20	1.62	0.124	GPU w/Apex-AMP
Nvidia NeMo-ASR	Citrinet-1024 1.5.0	8.24	2.32	0.069	GPU w/Apex-AMP
Nvidia NeMo-ASR	Contextnet-1024 1.4.0	6.68	1.77	0.098	GPU w/Apex-AMP
Nvidia NeMo-ASR	Quartznet-15x15 1.0.0rc1	13.23	3.53	0.064	GPU w/Apex-AMP

Conclusion

For GPU NeMo-ASRs models are leader of the pack. The Conformer-Transducer model gives you best WER and CER, the Contextnet-1024 and Conformer-CTC models are runner up with still very good values and even better RTF than the Transducer model.

On CPU both Jaco-Assistant/Scribosermo models - full and quantized - give you good WER/CER values and good performance. (Note: Jaco website claims WER 7.5% while I got "only" 9.4%). Silero is blazing fast but WER of 19% makes it impractical for daily use.

Notes on methodology

Word error rate (WER) and character error rate (CER) were calculated (with PyPi-package jiwer==2.2.0) on the Common-Voice test-dataset provided by Huggingface (huggingface/common_voice/de/6.1.0 retrieved with PyPi-package datasets==1.13.3). The real time factor (RTF) has been calculated by running inference on the first 1,000 records of the same dataset as above. Pre- and post-processing times (loading audio files, sample rate conversion, normalizing results, etc.) were excluded.

Evaluation was performed on a Nvidia Xavier AGX 32GB with JetPack 4.6, MAXN mode and jetson-clocks enabled.

You like this page? Then don't be shy and click the star-button. Thanks you.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
jaco-scribosermo		jaco-scribosermo
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

German STT model evaluation

Conclusion

Notes on methodology

About

Languages

License

domcross/german-stt-evaluation

Folders and files

Latest commit

History

Repository files navigation

German STT model evaluation

Conclusion

Notes on methodology

About

Resources

License

Stars

Watchers

Forks

Languages