Numpy AudioSet Embeddings Dataset
This is a version of the AudioSet dataset formatted using only python lists and numpy matrices. The original dataset (formatted as tfrecords) is released here: https://research.google.com/audioset/download.html
We found pervasive errors in the test set of this dataset, and released corrected test sets here (see our paper).
Dataset Details
This dataset provides three things for balanced train set, the unbalanced train set, and the eval/test set:
- the features (as a list of numpy matrices)
- each 10 second audio clip is represented as a 128-length 8-bit quantized embedding for every 1 second of audio resulting in a 128x10 matrix representation for all 10 seconds of audio
- the labels (as a list of multi-label lists)
- there are 527 unique labels, denoted as 0, 1, ..., 526
- the video ids of each example (as list of lists). Use these to map to the corrected test sets and label errors released at https://github.com/cgnorthcutt/label-errors.
Download the dataset
Make sure pigz
and wget
are installed:
# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz
Download the Audioset Files
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partab
Decompress the tar.gz file parts into the final dataset:
cat audioset_preprocessed.tar.gz-part?? | unpigz | tar -xvC .
Once decompressed, the preprocessed data should like this like
preprocessed/
βββ βββ bal_train_features.p
βββ βββ bal_train_labels.p
βββ βββ bal_train_video_ids.p
βββ βββ eval_features.p
βββ βββ eval_labels.p
βββ βββ eval_video_ids.p
βββ βββ unbal_train_features.p
βββ βββ unbal_train_labels.p
βββ 'ββ unbal_train_video_ids.p
Recreating this preprocessed dataset from scratch
The original dataset is provided using tfrecord formatting. To reformat the data to python lists of numpy matrices (for correcting test sets, viewing errors, and for training), you need to run this script: https://github.com/cgnorthcutt/label-errors/blob/main/examples/audioset_preprocessing.py
For example, using [this script)], you'd run:
mkdir preprocessed
cd preprocessed
python audioset_preprocessing.py --audioset-dir '/path/to/audioset/audioset_v1_embeddings/'
License
This preprocessed dataset is made available (Copyright (c) Curtis G. Northcutt) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
The original AudioSet dataset is made available (Copyright (c) Google Inc.) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.