Contains code for SpectroGAN, Final project for USC EE599 Deep Learning for Engineers Spring 2020.
Github link: https://github.com/hegde95/GAN-for-speech-spectrogram
The main objective is to apply style transfer on speech spectrograms in order to change the emotions conveyed in said speech.
Recent studies have successfully shown how style transfer can be applied on images from one domain to another. In this project we attempt to use this technique to embed emotions in spectrogram images. The end goal of the project will be to show that speech audio recorded with the connotation of one emotion can be conveted to another emotion without changing the content/information convayed in the speech.
For this project we chose the RAVDESS RAVDESS data set. The data set contains lexically-matched statements in a neutral North American accent spoken with emotions from anger, calm, disgust, fearful, happy, neutral, sad and surprised. The cleaned and re-arranged data can be found here. For this project, we chose to convert audio from "calm" to other emotions. The entire set of npz files can be found at this links:
calm2surprised- https://drive.google.com/uc?id=15HlogMsEX9juzL1j7HqweDQv9F5tJFuG
calm2sad- https://drive.google.com/uc?id=15HlO9YvZjMtbcEiXajfE9uqmrVS0Unep
calm2happy - https://drive.google.com/uc?id=153PIrQEk_agKiUOP5cujrVyGjnxDqKhd
calm2fearful - https://drive.google.com/uc?id=14scuVs2nlNH29DIWecrNrCcwNAVR0orG
calm2disgust - https://drive.google.com/uc?id=14s7kWrDQP61X9QXYDV-W3W4YIukJs_55
calm2anger - https://drive.google.com/uc?id=14q4aZseMCQO_xbbmX-JRsbRSlGX9bB3E)
fearful2surprised - https://drive.google.com/uc?id=167zknyKgV5r8qO_fLbbFLT1A76WwTWiL
The source and target data format in this project are .wav files, but our GAN's work on images.
- Audio to Image: To convert the audio to spectrograms we sampled the audio at 16000 Hz and performed stft of lenght 512 and used a hop lenght of 256. The source audio files were also trimmed as to obtain a spectrogram of size 257 X 257. This image padded was with 0's to get a 260 X 260 array, which is the input and output to our GAN.
- Image to Audio: To convert the generated spectrograms to audio, we used the griffin-lim algorithm on the clipped image. We made sure that that the fft lenght and the hop lenght used in the istft was the same as before.
For our project we attemted to implement a CycleGAN as this has been shown to perform well on style transfer tasks. Also, to be size (and therefore fft length) independent, we use a PatchGAN model for our descriminator network .This code was based on this link
Here is the link to our presentation
Here is a link to our report
Here is a video showing a demo:
Below are a couple of input and output audio files with the corresponding spectrograms, tested with with different models. (Click on the image to hear the audio.)
The following are results for 3, 6 and 9 ReNet blocks in the the transformer trained for 100 epochs:
Emotion | "Dogs are sitting by the door" (3) | "Dogs are sitting by the door" (6) | "Dogs are sitting by the door" (9) |
---|---|---|---|
Neutral (Original) | |||
Angry |
We found that 3 ResNet blocks performed poorly without noticeable emotion transfer or perfect reconstruction of original audio. The model with 6 ResNet blocks performed better with satisfactory emotion transfer and reconstruction. Although 9 ResNet blocks gave very good results in terms of emotion transfer, the reconstructed audio suffered from noise. And this was computationally expensive too. Hence, we decided to proceed with 6 ResNet blocks which has a good compromise between style transfer, denoising and computational efficiency.
The following are results for 260 X 260 and 520 X 520 spectrograms, trained for 100 epochs:
Emotion | "Dogs are sitting by the door" (260 X 260) | "Dogs are sitting by the door" (520 X 520) |
---|---|---|
Calm (Original) | ||
Angry |
It is evident that the model which generated a 260X260 spectrogram had a better reconstruction compared to the other. This finding also helped us in reducing computation for further experiments.
After the above experimentation, we found that the models showed peak performance at the epochs denoted under the emotion. These are the results after implementing emotion transfer on the same audio file:
From the above table we see two conversions, calm to fearful and calm to surprised, gives the best emotion transfer. Also there were noticeable characteristic changes in the harmonic structure of the input speech.
A few audio files from the dataset were held back for testing and optimizing our model. The spectrograms shown below are generated for these unseen input audio files.
Emotion | "Kids are talking by the door" | "Dogs are sitting by the door" |
---|---|---|
Calm (Original) | ||
Angry | ||
Fearful |
Spectrograms below shows the performance of the model on an audio file of the same script, but by an actor not from the dataset. The model shows good performance even on unseen data.
Emotion | "Dogs are sitting by the door" |
---|---|
Calm (Original) | |
Angry | |
Fearful |
The following results demonstrate the ability of our model to transfer emotions on audio clips of unseen actors speaking lexically similar sentences.
Emotion | "This project is fun" | "Three plus one equals four" |
---|---|---|
Calm (Original) | ||
Angry | ||
Fearful |
We also experimented on audio clips of unseen actors speaking in a different language (Hindi and Kannada). The model did not produce results with sufficient style transfer. However, the model was still able to reconstruct the audio clip of an unseen language without much noise.
Emotion | "Gaadi waala aya ghar se kachra nikal" | "Konegu project mugithu" |
---|---|---|
Calm (Original) | ||
Angry | ||
Fearful |