How does silero-vad compares to pyannote and nvidia nemo #152

seekingdeep · 2022-01-04T16:04:46Z

seekingdeep
Jan 4, 2022

Hi there @snakers4
how does silero-vad compares to pyannote and nvidia nemo MarbleNet?

https://github.com/pyannote/pyannote-audio
https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Voice_Activity_Detection.ipynb

Answered by snakers4

Jan 4, 2022

Hi,

As for pyannote, we compared it previously, it was ok for whole audios, but not very fast. But am not sure that it is a out-of-the-box solution (it was trained on limited academic data) and I believe it does not support streaming (or we did not read all of their code to find the streaming examples). If you find some checkpoints and streaming examples, please send a link.

Therefore, we use Google Speech Commands Dataset V2 [22] as speech data

As for MarbleNet I found some streaming examples here with pretrained models (but I am not sure if they were pre-trained just on Google speech commands which is limited).

In any case I am reluctant to invest time in adding this network into our…

View full answer

snakers4 · 2022-01-04T16:33:38Z

snakers4
Jan 4, 2022
Maintainer

Hi,

As for pyannote, we compared it previously, it was ok for whole audios, but not very fast. But am not sure that it is a out-of-the-box solution (it was trained on limited academic data) and I believe it does not support streaming (or we did not read all of their code to find the streaming examples). If you find some checkpoints and streaming examples, please send a link.

Therefore, we use Google Speech Commands Dataset V2 [22] as speech data

As for MarbleNet I found some streaming examples here with pretrained models (but I am not sure if they were pre-trained just on Google speech commands which is limited).

In any case I am reluctant to invest time in adding this network into our benchmark for a couple of reasons:

It was trained on limited data;
It does not support real streaming (the 630ms window approach with 10ms stride seems a bit rough on the edges to me even compared to our first models with 100ms or 250ms windows, especially in production it would mean ~100x more excess computation, i.e. to get 60ms resolution, you need to run approximately 11 * 630ms worth of data compared to just 60ms);
While the code in the notebook does not seem scary, they invested little time to make it easily separable from the Nemo toolkit;

2 replies

hbredin Jan 6, 2022

I wrote a blog post a while back describing how to use pyannote for streaming VAD:

https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html

snakers4 Jan 7, 2022
Maintainer

Each subsequent call to next(buffer) returns the current content of the 5s rolling buffer:
For instance, if we are allowed 2s latency, we could benefit from the multiple overlapping buffers and combine them to get a better estimate of the speech probability in regions where the model is not quite confident (e.g. just before t=4s).

From the example above, looks like the model works on second long audio chunks.
Also I am not sure why erect all of the abstractions in the blog post, when just tensor or list slicing one-liner would do.

seekingdeep · 2022-01-05T15:59:52Z

seekingdeep
Jan 5, 2022
Author

Thanks for the detailed reply,

The pyannote link that you referenced links to pyannote 1,
since then a paper was released in 2021 revealing pyannote 2.0 which includes benchmarks against silero-vad.

Also, i am interested in hearing your thoughts about SpeechBrain's vad

1 reply

snakers4 Jan 5, 2022
Maintainer

since then a paper was released in 2021 revealing pyannote 2.0 which includes benchmarks against silero-vad.
It took around 3 days using 4 V100 GPUs to reach peak performance.

I believe I saw it that when the paper came out.
Most likely they also used a previous version of our VAD at that time and did not tune the hyper-params, which our old VAD required to do per domain.
I noted also that they used ludicrous compute levels for a VAD and overfit on small datasets.

Still, I followed the current link you provided, is there a streaming interface now?

The idea behind an old iteration of our VAD was to fit on a much larger corpus with a model working on kind of long chunks, but then we more drifted towards actually making our VAD usable in real life with little to no setup required (no GPUs, no large external libraries, no bloat, no complex setup or monolithic codebase, no strings attached, no "peeking into the future", short adaptive chunks, etc).

Also, i am interested in hearing your thoughts about SpeechBrain's vad

Just from reading the above page, a lot of words about GPUs, but I cannot see how the trained it. I probably can derive this by reading their code entirely, but I am kind of reluctant to invest time into this.

From the model name, probably this is FCN CNN fit on corpora similar to that of Nvidia's solution.

In any case, the only way to properly answer your questions would be to invest time and effort into setting up each of the above VADs in a separate environment, fix the conflicts and then apply them in a kind of similar fashion, but keeping in mind that they would "cheat" by looking into the future or we would use 100x more compute by applying their VADs on overlapping audio chunks, from which we so much tried to depart from.

It would be an interesting exercise, but is it worth the effort?

These solutions clearly have a different set of values and target criteria (i.e. achieve max quality on a given limited dataset regardless of the practicality as opposed to providing a working simple general solution), though to their credit, their models at least are not huge.

seekingdeep · 2022-01-06T16:24:13Z

seekingdeep
Jan 6, 2022
Author

Thank you for your detailed reply,

Note that SpeechBrain still don't have stream inferencing, nor do i see a deployment pipeline, but one of the things that caught my eye is that their model was able to correctly predict the part where both music and speech are active and classify it as speech in this sample.
I am intrigued in finding silero vad performance on the same sample

For speechbrain here are the links so you can test and compare:

VAD inferencing pipeline:
https://colab.research.google.com/drive/1Msk2cgSEw-jCuXHmz2_-iOrv-gJHCCxu?usp=sharing
SpeechBrain released Vad model download:
https://drive.google.com/drive/folders/1YLYGuiyuTH0D7fXOOp6cMddfQoM74o-Y?usp=sharing

1 reply

snakers4 Jan 7, 2022
Maintainer

that their model was able to correctly predict the part where both music and speech are active and classify it as speech in this sample.
I am intrigued in finding silero vad performance on the same sample

You can do the same in the notebook we provide in 2 or 3 clicks:

Github does not host audios yet, but if you listen to only_speech.wav that the notebook creates out of the box, it works perfectly on this audio file. You can fiddle a bit with hyper-params to get some space around the last syllable, but this is fine out-of-the box as well.

seekingdeep · 2022-01-07T13:49:51Z

seekingdeep
Jan 7, 2022
Author

silero's performance is actually great, you can note that adding an auto-postprocessing step to the pipeline inorder to automatically generate the time per sentence would be even better.
your work is appreciated, wishing you good luck

3 replies

snakers4 Jan 7, 2022
Maintainer

that adding an auto-postprocessing step to the pipeline

The whole model usage is so minimalist, that everything besides these lines is more or less sugar or exists for convenience:

silero-vad/utils_vad.py

Lines 206 to 211 in 76687cb

    
           for current_start_sample in range(0, audio_length_samples, window_size_samples): 
        
               chunk = audio[current_start_sample: current_start_sample + window_size_samples] 
        
               if len(chunk) < window_size_samples: 
        
                   chunk = torch.nn.functional.pad(chunk, (0, int(window_size_samples - len(chunk)))) 
        
               speech_prob = model(chunk, sampling_rate).item() 
        
               speech_probs.append(speech_prob)

Basically you can do whatever with these probs - just plot them and decide.
We actually have post-processing in streaming and non-streaming examples.

in order to automatically generate the time per sentence would be even better.

If I understand you correctly, the return_seconds is what you are looking for:

silero-vad/utils_vad.py

Lines 119 to 128 in 76687cb

    
           def get_speech_timestamps(audio: torch.Tensor, 
        
                                     model, 
        
                                     threshold: float = 0.5, 
        
                                     sampling_rate: int = 16000, 
        
                                     min_speech_duration_ms: int = 250, 
        
                                     min_silence_duration_ms: int = 100, 
        
                                     window_size_samples: int = 1536, 
        
                                     speech_pad_ms: int = 30, 
        
                                     return_seconds: bool = False, 
        
                                     visualize_probs: bool = False):

seekingdeep Jan 7, 2022
Author

i was suggesting:

Auto-hyperparameter tuning
Auto-postprocessing

snakers4 Jan 7, 2022
Maintainer

Auto-hyperparameter tuning

We mostly got rid of the necessity to really tune the params with new release.
If you are seriously using the VAD in some real life scenario, 5 minutes to build a chart and change the prob / silence durations is not a big deal.

Auto-postprocessing

With the examples we provide, post-processing is done automatically.
We provide the most simplistic model API possible, i.e. speech_prob = model(chunk, sampling_rate).
We could of course hide the logic inside of JIT containers, but most likely that would cause more problems because we cannot know all cases in advance.

seekingdeep · 2022-01-07T18:54:22Z

seekingdeep
Jan 7, 2022
Author

5 minutes to build a chart and change the prob / silence durations

can you explain.

1 reply

snakers4 Jan 28, 2022
Maintainer

typically most important VAD hyper-params (i.e. threshold) are tuned once per domain, which takes several minutes, provided the data in your domain is consistent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How does silero-vad compares to pyannote and nvidia nemo #152

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How does silero-vad compares to pyannote and nvidia nemo #152

seekingdeep Jan 4, 2022

Replies: 5 comments · 8 replies

snakers4 Jan 4, 2022 Maintainer

hbredin Jan 6, 2022

snakers4 Jan 7, 2022 Maintainer

seekingdeep Jan 5, 2022 Author

snakers4 Jan 5, 2022 Maintainer

seekingdeep Jan 6, 2022 Author

snakers4 Jan 7, 2022 Maintainer

seekingdeep Jan 7, 2022 Author

snakers4 Jan 7, 2022 Maintainer

seekingdeep Jan 7, 2022 Author

snakers4 Jan 7, 2022 Maintainer

seekingdeep Jan 7, 2022 Author

snakers4 Jan 28, 2022 Maintainer

seekingdeep
Jan 4, 2022

Replies: 5 comments 8 replies

snakers4
Jan 4, 2022
Maintainer

snakers4 Jan 7, 2022
Maintainer

seekingdeep
Jan 5, 2022
Author

snakers4 Jan 5, 2022
Maintainer

seekingdeep
Jan 6, 2022
Author

snakers4 Jan 7, 2022
Maintainer

seekingdeep
Jan 7, 2022
Author

snakers4 Jan 7, 2022
Maintainer

seekingdeep Jan 7, 2022
Author

snakers4 Jan 7, 2022
Maintainer

seekingdeep
Jan 7, 2022
Author

snakers4 Jan 28, 2022
Maintainer