Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changelog - V5 just released! #2

Open
snakers4 opened this issue Dec 15, 2020 · 32 comments
Open

Changelog - V5 just released! #2

snakers4 opened this issue Dec 15, 2020 · 32 comments
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@snakers4
Copy link
Owner

Just a handy issue to be notified of latest changes and micro-releases (we will mostly changing the models)

@snakers4 snakers4 added the documentation Improvements or additions to documentation label Dec 15, 2020
@snakers4 snakers4 self-assigned this Dec 15, 2020
@snakers4
Copy link
Owner Author

Initial models, examples, utils for VAD only uploaded (no number detector or language classifier yet)

@snakers4
Copy link
Owner Author

First readable public release

@snakers4
Copy link
Owner Author

Added VAD latency and throughput metrics

@snakers4
Copy link
Owner Author

Updated VAD quality
Before / after (precision / recall)
image

@adamnsandle
Copy link
Collaborator

Added < 250ms compatibility
image

@Sontref
Copy link
Collaborator

Sontref commented Dec 31, 2020

Added number detector

@snakers4
Copy link
Owner Author

snakers4 commented Jan 11, 2021

Language detector example, readme update + FAQ

@snakers4
Copy link
Owner Author

Audiotok benchmarks added
Looks like all energy based solutions are kind of similar

@snakers4
Copy link
Owner Author

snakers4 commented Feb 1, 2021

Added a utility to tune the VAD params properly for a domain

@snakers4
Copy link
Owner Author

snakers4 commented Feb 3, 2021

Some final benchmarks posted here - pyannote/pyannote-audio#604 (comment)
Probably we are done with benchmarks for now

@snakers4
Copy link
Owner Author

Added micro (10k params, 100x smaller) VAD models

@snakers4 snakers4 changed the title Changelog Mirror Changelog Feb 17, 2021
@snakers4
Copy link
Owner Author

Added micro (10k params, 100x smaller) VAD models for 8 kHz audio

@snakers4
Copy link
Owner Author

  • Added mini (100k params) VAD models for 8 kHz and 16 kHz
  • Added adaptive vad iterator

#54

@snakers4
Copy link
Owner Author

  • Fixed examples and notebooks
  • Updated README
  • Added adaptive examples

@snakers4
Copy link
Owner Author

snakers4 commented Jul 9, 2021

  • Added a language classifier for 116 languages
  • It classifies audios into languages and mutually intelligible language groups (i.e. Serbian + Bosnian + Croatian, Russian + Ukranian + others, Hindi + Urdu, etc), see the full list here and here
  • Probably some artifical / unspoken languages will be excluded and a large model will be trained

@snakers4
Copy link
Owner Author

improved language classifier

  • 95 languages (85% accuracy), 58 language groups (90% accuracy)
  • Mutually intelligible languages are united into language groups (i.e. Serbian + Croatian + Bosnian are very similar)
  • Trained on approx 20k hours of data (10k of which are for 5 most popular languages)
  • 4.7M params

@snakers4 snakers4 pinned this issue Sep 16, 2021
@snakers4
Copy link
Owner Author

updated further reading section

@snakers4
Copy link
Owner Author

snakers4 commented Dec 7, 2021

New V3 Silero VAD is Already Here

Main changes

  • One VAD to rule them all! New model includes the functionality of the previous ones with improved quality and speed!
  • Flexible sampling rate, 8000 Hz and 16000 Hz are supported;
  • Flexible chunk size, minimum chunk size is just 30 milliseconds!
  • 100k parameters;
  • GPU and batching are supported;
  • Radically simplified examples;

Migration

Please see the new examples.

New get_speech_timestamps is a simplified and unified version of the old deprecated get_speech_ts or get_speech_ts_adaptive methods.

speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=16000)

New VADIterator class serves as an example for streaming tasks instead of old deprecated VADiterator and VADiteratorAdaptive.

vad_iterator = VADIterator(model)
window_size_samples = 1536

for i in range(0, len(wav), window_size_samples):
   speech_dict = vad_iterator(wav[i: i+ window_size_samples], return_seconds=True)
   if speech_dict:
       print(speech_dict, end=' ')
vad_iterator.reset_states()

@snakers4
Copy link
Owner Author

Even Better V3 Silero VAD

  • Models with even higher quality (just see the plots with metrics!);
  • New model ~ large model >> all previous (even large) models;
  • Now model works properly quality-wise, i.e. 100ms > 60ms > 30ms and16 kHz > 8 kHz;

@snakers4
Copy link
Owner Author

This summarises new progress well

image

@snakers4
Copy link
Owner Author

New V3 ONNX VAD Released

We finally were able to port a model to ONNX:

  • Compact model (~100k params);
  • Both PyTorch and ONNX models are not quantized;
  • Same quality model as the latest best PyTorch release;
  • Only 16kHz available now (ONNX has some issues with if-statements and / or tracing vs scripting) with cryptic errors;
  • In our tests, on short audios (chunks) ONNX is 2-3x faster than PyTorch (this is mitigated with larger batches or long audios);
  • Audio examples and non-core models moved out of the repo to save space;

@snakers4
Copy link
Owner Author

image

image

@adamnsandle
Copy link
Collaborator

New V4 VAD Released

Changes:

  • Improved quality
  • Improved perfomance
  • Both 8k and 16k sampling rates are now supported by the ONNX model
  • Batching is now supported by the ONNX model
  • Added audio_forward method for one-line processing of a single or multiple audio without postprocessing

@snakers4
Copy link
Owner Author

It is worth posting this chart:

image

@snakers4
Copy link
Owner Author

  • Remove picovoice mentions

@snakers4
Copy link
Owner Author

  • Deprecate language classifier and number detector models, since they are not maintained anymore.

@snakers4
Copy link
Owner Author

snakers4 commented Jun 27, 2024

Finally, V5 is here, 3x faster, supporting 6000+ languages!

image

Performance and Model Size

  • 3x faster inference for TorchScript, 10% faster inference for ONNX;
  • Now TorchScript is as fast as ONNX;
  • Model size is 2x larger, 2MB vs. 1MB;

Quality

  • The VAD supports more than 6,000 languages now;
  • Significanly more robust on noisy data;
  • Overall 5-7% quality increase on clean data;
  • Quality difference for 8 kHz and 16 kHz is negligible now;
  • Quality difference for different window sizes is negligible => window size was deprecated;
  • Added benchmarks on 9 unique datasets (2 private) and one holistic multi-domain dataset;

Changes and deprecations

  • ONNX opset 16;
  • window_size_samples is deprecated - now the VAD only works with fixed size window;
  • VAD now works with 8 kHz and 16 kHz sample rates, only with fixed 256 and 512 sample windows respectively;
  • Slightly changed internal logic, now some context (part of previous chunk) is passed along with the current chunk;
  • Sample rates that are a multiple of 16 kHz are still supported;

@snakers4
Copy link
Owner Author

snakers4 commented Jul 9, 2024

V5.1 - Experimental PyPI Package Release

  • Experimental pip-package release;
  • Community PRs to update the examples;

What's Changed

New Contributors

Full Changelog: v5.0...v5.1

@snakers4
Copy link
Owner Author

snakers4 commented Oct 9, 2024

V5.1.2 - Minor Update

  • Minor fixes;
  • Pip version update;

@MonolithFoundation
Copy link

Congrat, how to using v5 out of box

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

4 participants