This repository provides a VGGish model, implemented in Keras with tensorflow backend. This repository is developed based on the model for AudioSet. For more details, please visit the slim version.
They should be downloaded into weights
directory
You can check it with a sub-set of the Speech Commands dataset
Download it and move a sub set (i.e. some folders) of the dataset.
The directory should look like this:
VGGish/dataset/data/class_01/*.wav
VGGish/dataset/data/class_02/*.wav
VGGish/dataset/data/class_03/*.wav
where class_01
, class_02
, and class_03
are classes from the Speech Commands dataset,
and VGGish
is the root of this project. You can check that the dataset is correctly placed, and see a summary
by running python dataset/dataset_utils.py
which, for yes/no categories, should output something like that:
Dataset summary:
training -> 6358 examples
no -> 3130 examples
yes -> 3228 examples
testing -> 824 examples
no -> 405 examples
yes -> 419 examples
validation -> 803 examples
no -> 406 examples
yes -> 397 examples
Then you can run it with python evaluation.py
.
Here you can see an output example for yes/no categories
loading training data...
loading test data...
loading validation data...
Training...
Report for training
precision recall f1-score support
no 0.94 0.99 0.96 2850
yes 0.99 0.94 0.96 2990
accuracy 0.96 5840
macro avg 0.96 0.96 0.96 5840
weighted avg 0.96 0.96 0.96 5840
Report for validation
precision recall f1-score support
no 0.92 0.98 0.95 363
yes 0.98 0.92 0.95 356
accuracy 0.95 719
macro avg 0.95 0.95 0.95 719
weighted avg 0.95 0.95 0.95 719
Report for testing
precision recall f1-score support
no 0.92 0.98 0.95 386
yes 0.98 0.92 0.95 397
accuracy 0.95 783
macro avg 0.95 0.95 0.95 783
weighted avg 0.95 0.95 0.95 783
Notice that I removed some logging info from TF which is not relevant at this point.
-
Gemmeke, J. et. al., AudioSet: An ontology and human-labelled dataset for audio events, ICASSP 2017
-
Hershey, S. et. al., CNN Architectures for Large-Scale Audio Classification, ICASSP 2017