Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality benchmarks between audiotok / webrtcvad / silero-vad #68

Open
snakers4 opened this issue Jan 21, 2021 · 5 comments
Open

Quality benchmarks between audiotok / webrtcvad / silero-vad #68

snakers4 opened this issue Jan 21, 2021 · 5 comments

Comments

@snakers4
Copy link

Instruments

We have compared 3 easy-to-use off-the-shelf instruments for voice activity / audio activity detection:

Caveats

  • Full disclaimer - we are mostly interested in voice detection, not just silence detection;
  • In our extensive experiments we noticed that WebRTC is actually much better in detecting silence than detecting speech (probably by design). It has a lot of false positives when detecting speech;
  • audiotok provides Audio Activity Detection, which probably may just mean detecting silence in layman's terms;
  • silero-vad is geared towards speech detection (as opposed to noise or music);
  • A sensible chunk size for our VAD is at least 75-100ms (pauses in speech shorter than 100ms are not very meaningful, but we prefer 150-250ms chunks, see quality comparison here), while audiotok and webrtcvad use 30-50ms chunks (we used default values of 30 ms for webrtcvad and 50 ms for audiotok );
  • We have excluded pyannote-audio for now (https://github.com/pyannote/pyannote-audio), since it features pre-trained models on only limited academic datasets and is mostly a recipe collection / toolkit to build your own tools, not a finished tool per se (also for such a simple task the amount of code bloat is puzzling from a production standpoint, our internal vad training code is just literally 5 python modules);

Methodology

Please refer here - https://github.com/snakers4/silero-vad#vad-quality-metrics-methodology

Quality Benchmarks

Finished tests:

image

Portability and Speed

  • Looks like originally webrtcvad is written in С++ around 2016, so theoretically it can be ported into many platforms;
  • I have inquired in the community, the original VAD seems to have matured and python version is based on 2018 version;
  • Looks like audiotok is written in plain python, but I guess the algorithm itself can be ported;
  • silero-vad is based on PyTorch and ONNX, so it boasts the same portability options both these frameworks feature (mobile, different backends for ONNX, java and C++ inference APIs, graph conversion from ONNX);

This is by no means an extensive and full research on the topic, please point out if anything is lacking.

@matanox
Copy link

matanox commented Jan 21, 2021

You've sure done some thorough work here. Just as a sanity check, looks like the deep neural network model is the only one worth using for real world action, does it not? I wonder in what ways is the WebRTC VAD model even useful for the WebRTC project itself ....

@snakers4
Copy link
Author

Despite the appearance, web rtc is not so bad
You see if you just use web rtc to suppress silence it works just fine

False positives and lack of easy tuning / interpretable parameters / docs / support are the main culprit

Also for this reason we just used standard params - we may be wrong somewhere and it can be tuned better, but 95% of users will not bother

@sharvil
Copy link

sharvil commented Jun 9, 2021

It seems that the Silero VAD and WebRTC VAD make different tradeoffs.

WebRTC produces a VAD decision on 10ms to 30ms frames, whereas Silero produces a VAD decision on 150ms to 250ms frames. While it's true that short silences on the order of 30ms aren't particularly meaningful, the resolution of a VAD decision may be. In some applications, it may not be acceptable to discover up to 125ms late of a transition between speech and silence. WebRTC is designed to provide decisions in low-latency streaming applications where having a 100+ms buffer is not acceptable.

I'm happy to see implementations explore different tradeoffs in the design space. Looking at a PR-curve alone, though, doesn't tell the full story.

@snakers4
Copy link
Author

snakers4 commented Jun 9, 2021

whereas Silero produces a VAD decision on 150ms to 250ms frames

While it is true that we cannot really go below 100ms windows, there is just too much noise
You can ofc use 100ms as well with some quality degradation - snakers4/silero-vad#2 (comment)
On the other hand, we design around this limitation by simply applying our VAD in rolling a window fashion, so you essentially can get x4 - x8 resolution (i.e. 250ms // 4 or 250ms // 8)
The only downside of this is that you essentially have to use more compute
We also designed around that by providing 1m / 100k / 10k param sized models

@snakers4
Copy link
Author

snakers4 commented Jun 9, 2021

Also community provided some illustrative comparisons https://github.com/snakers4/silero-vad#live-demonstration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants