Skip to content

Numpy AudioSet Embeddings Dataset

Compare
Choose a tag to compare
@cgnorthcutt cgnorthcutt released this 05 May 18:43
· 66 commits to main since this release
2405c37

This is a version of the AudioSet dataset formatted using only python lists and numpy matrices. The original dataset (formatted as tfrecords) is released here: https://research.google.com/audioset/download.html

We found pervasive errors in the test set of this dataset, and released corrected test sets here (see our paper).

Dataset Details

This dataset provides three things for balanced train set, the unbalanced train set, and the eval/test set:

  • the features (as a list of numpy matrices)
    • each 10 second audio clip is represented as a 128-length 8-bit quantized embedding for every 1 second of audio resulting in a 128x10 matrix representation for all 10 seconds of audio
  • the labels (as a list of multi-label lists)
    • there are 527 unique labels, denoted as 0, 1, ..., 526
  • the video ids of each example (as list of lists). Use these to map to the corrected test sets and label errors released at https://github.com/cgnorthcutt/label-errors.

Download the dataset

Make sure pigz and wget are installed:

# on Mac OS
brew install wget pigz
# on Ubuntu
sudo apt-get install pigz

Download the Audioset Files

wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partaa
wget --continue https://github.com/cgnorthcutt/label-errors/releases/download/numpy-audioset-dataset/audioset_preprocessed.tar.gz-partab

Decompress the tar.gz file parts into the final dataset:

cat audioset_preprocessed.tar.gz-part?? | unpigz | tar -xvC .

Once decompressed, the preprocessed data should like this like

preprocessed/
│   │── bal_train_features.p
│   │── bal_train_labels.p
│   │── bal_train_video_ids.p
│   │── eval_features.p
│   │── eval_labels.p
│   │── eval_video_ids.p
│   │── unbal_train_features.p
│   │── unbal_train_labels.p
│   '── unbal_train_video_ids.p

Recreating this preprocessed dataset from scratch

The original dataset is provided using tfrecord formatting. To reformat the data to python lists of numpy matrices (for correcting test sets, viewing errors, and for training), you need to run this script: https://github.com/cgnorthcutt/label-errors/blob/main/examples/audioset_preprocessing.py

For example, using [this script)], you'd run:

mkdir preprocessed
cd preprocessed
python audioset_preprocessing.py --audioset-dir '/path/to/audioset/audioset_v1_embeddings/'

License

This preprocessed dataset is made available (Copyright (c) Curtis G. Northcutt) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

The original AudioSet dataset is made available (Copyright (c) Google Inc.) under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.