Skip to content

This project investigates the effectiveness of audio embedding models in downstream tasks like genre classification and music captioning.

Notifications You must be signed in to change notification settings

ccahilly/music-embedding

Repository files navigation

Investigating Audio Encoding Models' Effectiveness on Downstream Tasks

Authors

Caroline Cahilly, Caleb Liu, Thomas Yim
Department of Computer Science, Stanford University
Contact: [email protected], [email protected], [email protected]


Overview

This project investigates the effectiveness of audio embedding models in downstream tasks like genre classification and music captioning. We compare three state-of-the-art audio embedding methods:

  • CLAP (Contrastive Language-Audio Pretraining)
  • Wav2Vec 2.0
  • MERT (Acoustic Music Understanding Model)

We evaluate these embeddings' ability to capture rich, musically relevant information for both classification and text generation tasks.


Motivation

Music evokes emotional and cognitive responses in people. From music recommendation systems to music therapy applications, robust audio embeddings can help machine learning systems understand and utilize musical features effectively.

Key challenges include:

  • Capturing semantics like instrumentation, rhythm, and emotional tone.
  • Generating embeddings that generalize across different musical styles.
  • Using embeddings that work on different downstream applications.

Datasets

1. GTZAN Dataset (Genre Classification)

  • 1000 audio clips, each 30 seconds long, spread across 10 genres.
  • Train-test split: 800/200.

2. MusicCaps Dataset (Music Captioning)

  • 5,360 examples: each pairs a 10-second song clip with a text description.
  • Train/val/test split: 70-10-20.
  • Descriptions include instrument details, emotional tone, and sequencing.

Preprocessing Steps:

  • Resampling audio to match embedding model requirements:
    • CLAP: 48 kHz
    • Wav2Vec 2.0: 24 kHz
    • MERT: 24 kHz
  • Converting all audio to mono and normalizing amplitudes to [-1, 1].

Methods

Embedding Models

  1. CLAP: Contrastive learning aligns audio embeddings with text descriptions. It optimizes embeddings using a contrastive loss function that pushes mismatched pairs apart.
  2. Wav2Vec 2.0: Learns audio representations through self-supervised training and fine-tunes them with labeled data.
  3. MERT: Combines acoustic and pitch-based embeddings using transformer layers optimized with classification and reconstruction losses.

Genre Classification Pipeline

  1. Extract embeddings.
  2. Feed embeddings into an MLP classifier.
  3. Optimize with cross-entropy loss.

Music Captioning Pipeline

  1. Extract embeddings.
  2. Use T5 transformer to generate text captions.
  3. Optimize with cross-entropy loss.

Results

Genre Classification Performance

Embedding Model Train Accuracy (%) Test Accuracy (%)
Wav2Vec 2.0 (Frozen) 44.25 43.50
CLAP (Frozen) 37.00 36.63
MERT (Frozen) 72.12 66.50
Wav2Vec 2.0 (Unfrozen) 92.75 67.50
CLAP (Unfrozen) 63.75 43.00
MERT (Unfrozen) 100.00 84.00
  • Key Observations:
    • MERT outperforms others, especially when fine-tuned (84% test accuracy).
    • CLAP struggles due to its general-purpose training, whereas MERT is music-specific.
    • Unfreezing embedding weights improves performance across models.

PCA Analysis

PCA visualizations show clearer cluster separation in embeddings after fine-tuning, especially for MERT. This improved separation aligns with better classification performance.

Caption Generation Task

We evaluated the model's ability to generate textual captions describing music based on audio input. Key results include:

Embedding Model Train Val Test
Wav2Vec 2.0 (Frozen) 0.6158 0.6124 0.6137
CLAP (Frozen) 0.6318 0.6349 0.6282
MERT (Frozen) 0.6193 0.6306 0.6149
Wav2Vec 2.0 (Unfrozen) 0.6121 0.6205 0.6107
CLAP (Unfrozen) 0.5940 0.5895 0.5892
MERT (Unfrozen) 0.5544 0.5600 0.5536
  • Key Observations:
    • All embedding models perform poorly
    • The unfrozen embedding models perform worse than the frozen embedding models
    • The issue likely comes from issues with T5, not the embedding models themselves.

Conclusion

This project highlights the importance of model selection and fine-tuning for audio-related ML tasks. MERT emerges as the best performer, leveraging its music-specific pretraining.

Future Work:

  • Investigating hybrid approaches combining multiple embeddings.
  • Extending evaluation to tasks like emotion recognition and music synthesis.

Repository Structure

.
├── caption_generation/       # Code for caption generation task
├──── models/                 # Models used in caption generation
├──── scripts/                # To be run from caption_generation/ directory
├──── dataset/                # Dataset processing
├──── checkpoints/            # Store model weights here
├── preprocessing/            # Extract wav files given YouTube ids (ytids) from metadata file
├── speecht5/                 # Caption generation attempt with SpeechT5 model (excluded from results)
├── plots/                    # Loss plots from training
├── dpo/                      # Progress towards using direct preference optimization on MusicGen
├── dataset_analysis/         # Investigating metadata in dataset
├── music_samples/            # Sample wav files
├── docs/                     # Contains full final report
├── README.md                 # Project description and setup instructions
└── requirements.txt          # Required Python packages

References

For more details, see the full report in docs/.

About

This project investigates the effectiveness of audio embedding models in downstream tasks like genre classification and music captioning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages