-
Notifications
You must be signed in to change notification settings - Fork 450
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
⚠️Public pre-test of Silero-VAD v5 #448
Comments
"When is the release scheduled for v5?" |
I find the v4 for chinese single word 【bye】,is not good. and the cantonese single word 【喺啊】and 【喺】 is not good. |
I do not have any edge cases but it would be nice if you could change your benchmark methodology. There are a lot models out there by now. Adopting some new datasets like dihard3 etc. and comparing them against other sota models like pyannote would be dope. |
Systematic cases would be: |
Hi, it's me again 😄 We've done some experiments on what we called "model expectation" w.r.t. the LSTM states' reset frequency. Recall from the previous issue that my interest is mainly in always-on scenarios, which consist of a VAD listening all the time to whatever is going on in the environment and triggering only when there's speech, which we'll assume to be a rare event. As such, the model would be expected to trigger only a few times (a day, say) w.r.t. the infinite audio stream that it keeps receiving over time. The experiment consists in feeding long-ish stream of non-speech data to the model and check how often it hallucinates --- i.e., how often it sees speech when there is none. For that, we used Cafe, Home and Car environments from QUT-NOISE dataset, which contains 30-50 minute-long noise-only audio recordings. In theory, we presume that one is advised to reset the model states only after it has seen speech, but we took the liberty to reset at regular time intervals irrespective of whether speech detection has been triggered. The following plots show Scikit learn error rate (1-acc, which goes up to 100% == 1.00), therefore formulating the VAD as a frame-wise binary classification problem. X-axis show the frequency of model state resetting. Finally,
I'll formulate my conclusions later when I have time, just wanted to provide a heads-up asap since it's been a while since this issue has been opened. EDIT: conclusions! First of all, just notice that the graphs are not in the same scale, so the models make way less mistakes in car environments (4% vs. ~20% otherwise), for example.
A possible takeaway could be that this whole speech-expectation thingy reflects the training scheme, since the model has probably not seen (or it has, but very rarely) instances of non-speech-only data after the LSTM states have been initialized. IOW, if the datasets used for training the VAD are the same ones used to train ASR systems, all data contains speech, and that's what the model expects to see at the end of the day. Any feedback on these results would be welcome @snakers4 😄 |
We focused on this scenario when training the new VAD since we had some datasets and our own issues when running noise only / "speechless" audios through the VAD. The new VAD version was released just now - #2 (comment). We changed the way it handles context now - we pass a part of the previous chunk as well as the current chunk and we made the LSTM component 2x smaller but improved the feature pyramid pooling (we has an improper pooling layer). So in theory and in our practice the new VAD should work better with this edge case. Can you please re-run some of your tests, and if the issue persists - please open a new issue referencing this one as context. Many thanks! |
Dear members of the community,
Finally, we are nearing the release of the
v5
version of the VAD.Can you please send your audio edge cases in this ticket so that we could stress test the new release of the VAD in advance.
Ideally we need something like this #369 (which we incorporated into validation when choosing the new models), but any systematic cases where the VAD underperforms will be good as well.
Many thanks!
The text was updated successfully, but these errors were encountered: