Skip to content

x-y-g/SarcEmotiq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

23 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸŽ™οΈπŸ˜œ SarcEmotiq

πŸ“Œ DOI: 10.1121/2.0001918

SarcEmotiq is a deep learning-based tool for recognizing sarcasm in English audio. It uses pre-trained models trained on open-sourced datasets MUStARD++ but also allows users to retrain the model with their own data.

Model Architecture

SarcEmotiq integrates multiple modalities, acoustic + textual + emotional + sentiment cues, into a unified attention-based fusion model. Below is a summary of the system.

Modalities Used:

Modality Feature source
Audio openSMILE (ComParE_2016)
Text BERT-base-uncased
Emotion wav2vec2-large-xlsr β†’ Speech emotion classifier
Sentiment RoBERTa (sentiment-roberta-large-english) Text sentiment classifier

Fusion Mechanisms:

  1. Contrastive attention
    Aligns emotions (as query) with sentiments (as key-value) to emphasize conflicting affective states -> indicative of sarcasm.

  2. Cross attention
    Aligns textual content (as query) with audio features (as key-value) to capture prosody-semantic mismatches (e.g., flat tone with exaggerated words).

  3. Masked average pooling
    Reduces all embeddings over time to handle variable-length sequences.

  4. Multimodal concatenation + MLP
    Pooled outputs (text, audio, sentiment, emotion, cross, contrastive) are concatenated and passed through an MLP for final classification.

➑️ More information about the model please visit our [published paper] (https://doi.org/10.1121/2.0001918)

Installation

  1. Clone the repository:
    git clone https://github.com/yourusername/SarcEmotiq.git
    cd SarcEmotiq
  2. Set up the environment:
    pip install -r requirements.txt
  3. Download the pretrained model (Place it in the models/ folder)
    wget <https://drive.google.com/file/d/1xqyofUELCl2oBlA6151Vvd-f8pIG2biF/view?usp=drive_link> -O models/model.pth
  4. Run inference
    python src/inference.py --input path/to/data --model path/to/model
    ⚠️ Note: The audio file should be in .wav format, ranging from 1s to 20s. No need to include the contextual sentence. Check the example under /samples/.

Input Requirements

The input audio and associated text should meet the following criteria:

Audio

  • Format: WAV (.wav)
  • Channels: Mono preferred (if stereo, the script will average channels)
  • Sample Rate: 16 kHz is recommended (resampling is not automatic)
  • Duration: 1 to 20 seconds
  • Environment: Clean audio (avoid overlapping speech or heavy background noise)
  • Recommended tool to enhance speech: Adobe Podcast

Transcription

  • For inference, the system will automatically transcribe audio using Whisper.

Retrain the model

You can retrain the model with your own dataset (audio + text). The preprocessing and training are organized as separate steps to maximize flexibility and control for research purpose.

Step 1: Prepare the input dataset

You need:

  • A folder of .wav audio files
  • A transcript CSV file

CSV format:

KEY,SENTENCE
audio1,This is a test sentence.
audio2,Another example of sarcasm.

The text_csv file is expected to contain at least two columns:

  • KEY: This should be a unique identifier for each audio file, corresponding to the file name (without the .wav extension).
  • SENTENCE: This column contains the transcriptions (or textual representations) of the audio files. Each row should correspond to the text spoken in the associated audio file.

Example directory:

/dataset/
β”œβ”€β”€ audio/
β”‚   β”œβ”€β”€ audio1.wav
β”‚   └── audio2.wav
└── transcript.csv

Step 2: Generate feature embeddings

python generate_embeddings.py \
 --audio_directory /path/to/audios \ 
 --text_csv /path/to/text.csv \  #Path to the CSV file containing the audio file keys and text, sample: "/data/mustard++_onlyU.csv"
 --output_directory /path/to/store/tmp_audio_features  #Directory where the temporary extracted audio feature files (LLDs) will be stored and removed after processing.

Step 3: Normalize feature embeddings

Once the embeddings (audio, text, sentiment, and emotion) are extracted, you can normalize them using the following command:

python normalize_embeddings.py --embeddings path/to/embeddings.h5 --output_dir path/to/output_directory

This will generate four normalized files: - normalized_audio.h5 - normalized_text.h5 - normalized_sentiment.h5 - normalized_emotion.h5

Additionally, a separate label.h5 file will be generated containing the unmodified labels.

Step 4: Train the model

When the normalized embeddings are generated, you can use the following command to train the model. The default path for normalized embedding files are under data/.

python train.py --data path/to/data --epochs 20 --batch_size 32 --model_path ./models/model.pth --patience 5 --lr 0.001

You can adjust the following parameters to your needs.

- data: Path to the folder containing preprocessed embeddings.
- epochs: Number of epochs for training (default is 20).
- batch_size: Number of samples per batch (default is 32).
- model_path: Path to save the trained model. Make sure to provide a full file name like ./models/model.pth.
- patience: Number of epochs to wait for improvement before early stopping.
- lr: Learning rate for the optimizer (default is 0.001).
  • The script will output training progress, including the loss on the training and validation sets. The best model will be saved at the specified model path.

  • The training process includes early stopping based on validation loss. If the model doesn't improve for patience number of epochs, training will stop early.

Limitations

While SarcEmotiq shows commendable performance on benchmark data (74% F1-score) for binary sarcasm detection, users should be aware of its current limitations:

  • Domain generalization: Trained primarily on MUStARD++ (scripted, American-accented data). Accuracy may drop for spontaneous speech or unfamiliar accents.
  • Transcript dependency: Model performance degrades with inaccurate transcriptions. Whisper is robust but may misinterpret noisy speech.
  • Emotion/sentiment models: Emotion and sentiment embeddings come from general-purpose models (not sarcasm-specific), which may add noise in some cases.
  • Real-time limitations: This is a batch inference system; not optimized for real-time streaming or large-scale deployment without adaptation.
  • Context-free inference: Each audio clip is processed independently. No dialog or speaker context is used.

(Future work may include context-aware transformers for broader deployment.)

License

This project is licensed under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.

You are free to:

  • Share β€” copy and redistribute the material in any medium or format
  • Adapt β€” remix, transform, and build upon the material

Under the following terms:

  • Attribution β€” You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the authors endorse you or your use.
  • NonCommercial β€” You may not use the material for commercial purposes.

πŸ“„ Full license text: https://creativecommons.org/licenses/by-nc/4.0/

Citation:

If you use SarcEmotiq in your research, publications, or derived work, please cite the following paper:

Xiyuan Gao, Shekhar Nayak, Matt Coler, Improving sarcasm detection from speech and text through attention-based fusion exploiting the interplay of emotions and sentiments. Proc. Mtgs. Acoust. 13 May 2024; 54 (1): 060002. https://doi.org/10.1121/2.0001918

@article{gao2024sarcasm,
  title={Improving sarcasm detection from speech and text through attention-based fusion exploiting the interplay of emotions and sentiments},
  author={Gao, Xiyuan and Nayak, Shekhar and Coler, Matt},
  journal={Proceedings of Meetings on Acoustics},
  volume={54},
  number={1},
  pages={060002},
  year={2024},
  publisher={Acoustical Society of America},
  doi={10.1121/2.0001918}
}

About

SarcEmotiq is a tool for detecting sarcasm in English speech using audio and text.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •