-
-
Notifications
You must be signed in to change notification settings - Fork 802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quality Benchmarks Between audiotok / webrtcvad / pyannote-audio / silero-vad #604
Comments
Thanks for this comparison though I do not think those are actually fair.
From https://github.com/snakers4/silero-models/wiki/Model-Adaptation: Isn't it what you just did in this benchmark? ;-)
The pyannote.audio pretrained model you used (it is just a wild guess because I cannot seem to find the benchmark script) is trained to process 2s chunks but still outputs a score every few ms. Or did you use the pipeline? (and not just the model?)
pyannote.audio models can run on GPU or CPU: https://github.com/pyannote/pyannote-audio/tree/master/tutorials/pretrained/model
Sorry, I did not understand -- maybe provide a link to the actual benchmark script?
This tells me (wild guess again) that you used the pipeline and not the model -- the model outputs VAD scores, the pipeline outputs hard decisions. Anyway, I'd recommend you use this model that gives you a nicer API for processing short chunks. |
Hi,
There are not so many public voice detectors. In any case I am not quite sure what is the point of these "particular" solutions (in addition to the fact that they depend on datasets behind LDC paywall) anyway. Voice detection is not a difficult task compared to STT not to go general.
We just used this pipeline
We had some issues running on CPU
Sorry for missing those. I assumed that all of the pre-trained models were presented here This model and this models are similar, but trained of different datasets, right?
The script is not a problem |
FYI, I just made public a preprint in which I use my own (probably unfair) benchmark -- comparing |
Hi @hbredin, Many thanks for the shout-out.
I am not sure how precision and recall-like metrics can be calculated on 5s chunks (I have not found a more in-depth explanation in the paper how these rates are calculated - please correct me here). Also given the amount of compute used on these small datasets, the fact that our models behave decently (quite probably applied improperly - since thresholds do require tuning a bit with the default examples) on some academic benchmarks is nice. Also note that we have added adaptive algorithms to eliminate threshold searching, but we have not yet published examples.
In our VAD we are also sinful of this (though we used standard thresholds in the tests), but our model is general, i.e. not tuned on any of these datasets. Also note that we have added adaptive algorithms to eliminate threshold searching, but we have not published examples.
This makes the models much less useful in real life / production assuming that these "first models" in the pipeline should definitely be applicable in a streaming fashion, i.e. be very fast / small / stateless. We also apply our STT models on "large" (1-7s) chunks of audio by design, but the "first" model (i.e. VAD) should be able to work on small short chunks of audio.
Typically, in most applications you either can write 2 channels (phone). Or you have a complex scenario where there is a lot of background noise, you have to do some noise suppression / directional microphones etc. - which in 95% of cases given the attitude of business people makes projects non-viable
This is my personal opinion, but supporting hugging face is a very questionable idea. Tldr - while I understand that most models are published under permissive licenses, they mostly thrive on regurgitating billions of dollars invested by Google / Amazon into training huge models without bringing real added value to the table. I understand that re-implementing Google's code into PyTorch is a boring endeavor, but I find it hilarious that they then re-sell the IP "borrowed" from Google and others to some corporations.
Idk, this is kind of ridiculous for such small datasets and models.
By the way, why did not you add WebRTC into this benchmark? It used to be by far the most popular VAD out there, also available are python bindings. |
|
Instruments
We have compared 4 easy-to-use off-the-shelf instruments for voice activity / audio activity detection with off-the-shelf parameters (we did not hack / rebuild / retrain the tools and just used them as-is):
pyannote
- https://github.com/pyannote/pyannote-audio-hub#speech-activity-detectionCaveats
audiotok
provides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;silero-vad
is geared towards speech detection (as opposed to noise or music);audiotok
andwebrtcvad
use 30-50ms chunks (we used default values of 30 ms forwebrtcvad
and 50 ms foraudiotok
);pyannote
had a ton of performance caveats, i.e. it required a GPU out of the box and worked very slowly for small files (but quite fast for long files). Also as far as we dug, streaming / online application was not possible withpyannote
with standard provided pipeline;Testing Methodology
Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology
Known limitations:
py-webrtcvad
andpyannote
without essentially rebuilding / modifying C++ code forpy-webrtcvad
and without modifying pipeline code forpyannote
. Therefore we had to interpret binary decisions made by these tools into probabilities or just provide one dot on the curve;silero-vad
or to use it with different new domains, you have to set at least some of the params properly. It can be done via provided plotting tool;Testing Results
Finished tests:
Portability, Speed, Production Limitations
webrtcvad
is written inС++
around 2016, so theoretically it can be ported into many platforms;audiotok
is written in plain python, but I guess the algorithm itself can be ported;silero-vad
is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);pyannote
also is built with PyTorch, but it requires GPU out of the box and is extremely slow with short files;Also we ran a 30-minute audio file via
pyannote
andsilero-vad
:silero-vad
- 20 seconds on CPU;pyannote
- 12 seconds on 3090;This is by no means an extensive and full research on the topic, please point out if anything is lacking.
The text was updated successfully, but these errors were encountered: