Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality Benchmarks Between audiotok / webrtcvad / pyannote-audio / silero-vad #604

Closed
snakers4 opened this issue Feb 1, 2021 · 5 comments

Comments

@snakers4
Copy link

snakers4 commented Feb 1, 2021

Instruments

We have compared 4 easy-to-use off-the-shelf instruments for voice activity / audio activity detection with off-the-shelf parameters (we did not hack / rebuild / retrain the tools and just used them as-is):

Caveats

  • Full disclaimer - we are mostly interested in portable production voice detection in phone calls, not just silence detection or dataset preparation where all of these tools will more or less work just fine;
  • In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
  • audiotok provides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;
  • silero-vad is geared towards speech detection (as opposed to noise or music);
  • A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while audiotok and webrtcvad use 30-50ms chunks (we used default values of 30 ms for webrtcvad and 50 ms for audiotok );
  • Maybe we missed something, but using pyannote had a ton of performance caveats, i.e. it required a GPU out of the box and worked very slowly for small files (but quite fast for long files). Also as far as we dug, streaming / online application was not possible with pyannote with standard provided pipeline;

Testing Methodology

Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology

Known limitations:

  • Usually speech is very self-correlated (i.e. if one person speaks, he will continue speaking for some time), but our test is extremely hard, because it essentially gives an algorithm only one chance. Essentially we test how well all of algorithms determine the first / last frame of speech without the luxury of being "in the middle" of speech;
  • Since we wanted to provide PR curves, it was a bit tricky with py-webrtcvad and pyannote without essentially rebuilding / modifying C++ code for py-webrtcvad and without modifying pipeline code for pyannote. Therefore we had to interpret binary decisions made by these tools into probabilities or just provide one dot on the curve;
  • For production usage of silero-vad or to use it with different new domains, you have to set at least some of the params properly. It can be done via provided plotting tool;

Testing Results

Finished tests:

image

Portability, Speed, Production Limitations

  • Looks like originally webrtcvad is written in С++ around 2016, so theoretically it can be ported into many platforms;
  • I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
  • Looks like audiotok is written in plain python, but I guess the algorithm itself can be ported;
  • silero-vad is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);
  • Looks like pyannote also is built with PyTorch, but it requires GPU out of the box and is extremely slow with short files;
Tool Speed Streaming GPU Portability
py-webrtcvad extremely fast yes not required you can build and port
auditok very fast yes not required python only, you can try porting
silero-vad very fast yes not required PyTorch and ONNX
pyannote fast for long files slow for small files no required PyTorch

Also we ran a 30-minute audio file via pyannote and silero-vad:

  • silero-vad - 20 seconds on CPU;
  • pyannote - 12 seconds on 3090;

This is by no means an extensive and full research on the topic, please point out if anything is lacking.

@hbredin
Copy link
Member

hbredin commented Feb 5, 2021

Thanks for this comparison though I do not think those are actually fair.

We have compared 4 easy-to-use off-the-shelf instruments for voice activity / audio activity detection with off-the-shelf parameters (we did not hack / rebuild / retrain the tools and just used them as-is):

From https://github.com/snakers4/silero-models/wiki/Model-Adaptation:
"We often encountered that people wrongfully compare general domain-agnostic solutions with solutions heavily tuned on particular domains."

Isn't it what you just did in this benchmark? ;-)

A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while audiotok and webrtcvad use 30-50ms chunks (we used default values of 30 ms for webrtcvad and 50 ms for audiotok );

The pyannote.audio pretrained model you used (it is just a wild guess because I cannot seem to find the benchmark script) is trained to process 2s chunks but still outputs a score every few ms. Or did you use the pipeline? (and not just the model?)

Maybe we missed something, but using pyannote had a ton of performance caveats, i.e. it required a GPU out of the box and worked very slowly for small files (but quite fast for long files). Also as far as we dug, streaming / online application was not possible with pyannote with standard provided pipeline;

pyannote.audio models can run on GPU or CPU: https://github.com/pyannote/pyannote-audio/tree/master/tutorials/pretrained/model

Usually speech is very self-correlated (i.e. if one person speaks, he will continue speaking for some time), but our test is extremely hard, because it essentially gives an algorithm only one chance. Essentially we test how well all of algorithms determine the first / last frame of speech without the luxury of being "in the middle" of speech;

Sorry, I did not understand -- maybe provide a link to the actual benchmark script?
It would definitely be useful for the community.

Since we wanted to provide PR curves, it was a bit tricky with py-webrtcvad and pyannote without essentially rebuilding / modifying C++ code for py-webrtcvad and without modifying pipeline code for pyannote. Therefore we had to interpret binary decisions made by these tools into probabilities or just provide one dot on the curve;

This tells me (wild guess again) that you used the pipeline and not the model -- the model outputs VAD scores, the pipeline outputs hard decisions.

Anyway, I'd recommend you use this model that gives you a nicer API for processing short chunks.

@snakers4
Copy link
Author

snakers4 commented Feb 5, 2021

Hi,

"We often encountered that people wrongfully compare general domain-agnostic solutions with solutions heavily tuned on particular domains."
Isn't it what you just did in this benchmark? ;-)

There are not so many public voice detectors.
We simply tried comparing the few that have any semblance of examples / docs / examples.

In any case I am not quite sure what is the point of these "particular" solutions (in addition to the fact that they depend on datasets behind LDC paywall) anyway. Voice detection is not a difficult task compared to STT not to go general.

Or did you use the pipeline? (and not just the model?)

We just used this pipeline

pyannote.audio models can run on GPU or CPU: https://github.com/pyannote/pyannote-audio/tree/master/tutorials/pretrained/model

We had some issues running on CPU
@adamnsamdle could you please elaborate

Anyway, I'd recommend you use this model that gives you a nicer API for processing short chunks.

Sorry for missing those. I assumed that all of the pre-trained models were presented here
We will definitely try the "bare" models

This model and this models are similar, but trained of different datasets, right?

Sorry, I did not understand -- maybe provide a link to the actual benchmark script?

The script is not a problem
I will ask people providing the data if they are ok if the validation dataset based on their data is published

@hbredin
Copy link
Member

hbredin commented Apr 14, 2021

FYI, I just made public a preprint in which I use my own (probably unfair) benchmark -- comparing silero-vad with default hyper-parameters and pyannote.audio new pyannote/segmentation pretrained model.

@hbredin hbredin closed this as completed Apr 14, 2021
@snakers4
Copy link
Author

snakers4 commented Apr 14, 2021

Hi @hbredin,

Many thanks for the shout-out.

FYI, I just made public a preprint in which I use my own (probably unfair) benchmark -- comparing silero-vad with default hyper-parameters and pyannote.audio new pyannote/segmentation pretrained model.
Table 1: Voice activity detection //FA = false alarm rate / Miss. = missed detection rate

I am not sure how precision and recall-like metrics can be calculated on 5s chunks (I have not found a more in-depth explanation in the paper how these rates are calculated - please correct me here).
We basically cut audios into small chunks (250 ms) and calculated metrics on those.

Also given the amount of compute used on these small datasets, the fact that our models behave decently (quite probably applied improperly - since thresholds do require tuning a bit with the default examples) on some academic benchmarks is nice.
We use order of magnitude less compute for a general model and on some they even perform within some reasonable margin.

Also note that we have added adaptive algorithms to eliminate threshold searching, but we have not yet published examples.

However, detection thresholds (θon, θoff, δon, and δoff) were tuned specifically for each dataset using their own development set because the manual annotation guides differ from one dataset to another, especially regarding δoff which controls whether to bridge small intra-speaker pauses. For the same rea- sons, detection thresholds were optimized specifically for each task addressed in the paper:

In our VAD we are also sinful of this (though we used standard thresholds in the tests), but our model is general, i.e. not tuned on any of these datasets.

Also note that we have added adaptive algorithms to eliminate threshold searching, but we have not published examples.

The main difference is that our model operates on short audio chunks (5 seconds) but at a much higher temporal resolution (every 16ms)
Our segmentation model ingests 5s audio chunks with a sampling rate of 16kHz (i.e. sequences of 80000 samples)

This makes the models much less useful in real life / production assuming that these "first models" in the pipeline should definitely be applicable in a streaming fashion, i.e. be very fast / small / stateless.

We also apply our STT models on "large" (1-7s) chunks of audio by design, but the "first" model (i.e. VAD) should be able to work on small short chunks of audio.

First, voice activity detection (VAD) removes any region that does not contain speech. Then, speaker change detection (SCD) partitions remaining speech regions into speaker turns, by looking for time instants where a change of speaker occurs

Typically, in most applications you either can write 2 channels (phone).

Or you have a complex scenario where there is a lot of background noise, you have to do some noise suppression / directional microphones etc. - which in 95% of cases given the attitude of business people makes projects non-viable

sharing the pretrained model with the community and integrating it into pyan- note open-source library for reproducibility purposes: huggingface.co/pyannote/segmentation

This is my personal opinion, but supporting hugging face is a very questionable idea.

Tldr - while I understand that most models are published under permissive licenses, they mostly thrive on regurgitating billions of dollars invested by Google / Amazon into training huge models without bringing real added value to the table.

I understand that re-implementing Google's code into PyTorch is a boring endeavor, but I find it hilarious that they then re-sell the IP "borrowed" from Google and others to some corporations.

It took around 3 days using 4 V100 GPUs to reach peak performance.

Idk, this is kind of ridiculous for such small datasets and models.

Note, however, that one should not draw hasty conclusions regarding the performance of silero vad model [21] as it is an off-the-shelf model which was not trained specifically for these datasets

By the way, why did not you add WebRTC into this benchmark?

It used to be by far the most popular VAD out there, also available are python bindings.

@hadware
Copy link
Contributor

hadware commented Apr 14, 2021

FYI, I just made public a preprint in which I use my own (probably unfair) benchmark -- comparing silero-vad with default hyper-parameters and pyannote.audio new pyannote/segmentation pretrained model.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants