Project by Sukesh Shenoy and Megha ghosh The project was done in Pyhton 3 in Google Colab notebook Due to the sensitivity of data the dataset is not include which can be downloaded here
Since this is an old project I've not actively updated all the codes of dependencies so downgrading the packages to
- scikit-learn==0.22.2
- tensorflow<2.0
We have used the DAIC-WOZ Depression Database to train our model. This database is part of a larger corpus, the Distress Analysis Interview Corpus (DAIC) (Gratch et al.,2014), that contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. The dataset consists of 189 sessions, averaging 16 minutes, between a participant and virtual interviewer called Ellie, controlled by a human interviewer in another room via a "Wizard of Oz" approach. Prior to the interview, each participant completed a psychiatric questionnaire (PHQ-8), from which a binary "truth" classification (depressed, not depressed) was derived.
We have used Convolutional Neural Networks (CNN) to learn useful characteristic of depression from speech.Our CNN model is designed to extract features from a spectrogram which in turn is used for classification of the audio spectrogram into two classes depressed and non depressed respectively.
This database is part of a larger corpus,the Distress Analysis Interview Corpus (DAIC) (Gratch et al.,2014), that contains clinical interviews designed to support the diagnosis of psychological distress conditions such as anxiety, depression, and post-traumatic stress disorder. These interviews were collected as part of a larger effort to create a computer agent that interviews people and identifies verbal and nonverbal indicators of mental illness (DeVault et al., 2014). Data collected include audio and video recordings and extensive questionnaire responses; this part of the corpus includes the Wizard-of- Oz interviews, conducted by an animated virtual interviewer called Ellie, controlled by a human interviewer in another room. Data has been transcribed and annotated for a variety of verbal and non-verbal features.
A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time. When applied to an audio signal, spectrograms are sometimes called sonographs, voiceprints, or voicegrams. When the data is represented in a 3D plot they may be called waterfalls. Spectrograms are used extensively in the fields of music, sonar, radar, and speech processing, seismology, and others. Spectrograms of audio can be used to identify spoken words phonetically, and to analyse the various calls of animals. A spectrogram can be generated by an optical spectrometer, a bank of band-pass filters, by Fourier transform or by a wavelet transform (in which case it is also known as a scaleogram). A spectrogram is usually depicted as a heat map, i.e., as an image with the intensity shown by varying the colour or brightness. Creating a spectrogram using the FFT is a digital process. Digitally sampled data, in the time domain, is broken up into chunks, which usually overlap, and Fourier transformed to calculate the magnitude of the frequency spectrum for each chunk. Each chunk then corresponds to a vertical line in the image; a measurement of magnitude versus frequency for a specific moment in time (the midpoint of the chunk). These spectrums or time plots are then ”laid side by side” to form the image or a three-dimensional surface,[4] or slightly overlapped in various ways, i.e. windowing. This process essentially corresponds to computing the squared magnitude of the shorttime Fourier transform (STFT) of the signal s(t) that is, for a window width ω , spectrogram(t, ω) =|STFT(t, ω)|^2 spectrogram(t, ω) =|STFT(t, ω)|^2.
Convolutional Neural Networks (CNNs) are analogous to traditional ANNs in that they are comprised of neurons that self-optimise through learning. Each neuron will still receive an input and perform a operation (such as a scalar product followed by a non-linear function) - the basis of countless ANNs. From the input raw image vectors to the final output of the class score, the entire of the network will still express a single perceptive score function (the weight). The last layer will contain loss functions associated with the classes, and all of the regular tips and tricks developed for traditional ANNs still apply. The only notable difference between CNNs and traditional ANNs is that CNNs are primarily used in the field of pattern recognition within images. This allows us to encode image-specific features into the architecture, making the network more suited for image-focused tasks - whilst further reducing the parameters required to set up the model.
CNNs primarily focus on the basis that the input will be comprised of images. This focuses the architecture to be set up in way to best suit the need for dealing with the specific type of data. The layers within the CNN are comprised of neurons organised into three dimensions, the spatial dimensionality of the input (height and the width) and the depth. The depth does not refer to the total number of layers within the ANN, but the third dimension of a activation volume. Unlike standard ANNS, the neurons within any given layer will only connect to a small region of the layer preceding it. CNNs are comprised of three types of layers. These are convolutional layers, pooling layers and fully-connected layers. When these layers are stacked, a CNN architecture has been formed. The basic functionality of a CNN can be broken down into four key areas.
- As found in other forms of ANN, the input layer will hold the pixel values of the image.
- The convolutional layer will determine the output of neurons of which are connected to local regions of the input through the calculation of the scalar product between their weights and the region connected to the input volume. The rectified linear unit (commonly shortened to ReLU) aims to apply an ’element-wise’ activation function such as sigmoid to the output of the activation produced by the previous layer.
- The pooling layer will then simply perform down-sampling along the spatial dimensionality of the given input, further reducing the number of parameters within that activation.
- The fully-connected layers will then perform the same duties found in standard ANNs and attempt to produce class scores from the activa tions, to be used for classification. It is also suggested that ReLU may be used between these layers, as to improve performance. Through this simple method of transformation, CNNs are able to transform the original input layer by layer using convolutional and downsampling techniques to produce class scores for classification and regression purposes.
The data-set contains 192 audio sessions between a animated virtual interviewer called Ellie, and the participant. The features of audio segments of the participants are useful for classification, the segments are split by silence removal and then separated by speaker diarization. these audio segments are of varying lengths such that there is spread in data.
In the data-set, the number of non-depressed subjects is about four times larger than that of depressed ones, which can introduce a classification ”non-depressed” bias. Additional bias can occur due to the considerable range of interview durations from 7-33 minutes because a larger volume of signal from an individual may emphasize some characteristics that are person specific.To rectify this imbalance, audio segments are randomly sampled in equal numbers.
The sampled audio segments are then converted to spectrogram images of size 512 X 512 pixels. These images are put into different folders corresponding different classes and then the folders are split into training and validation data with ratio 8:2.
These images are converted to TensorFlow tensor using flow from directory of image processing datagenrator method. The imae tensor is normalised and fed into the Convolutional Neural Network.
The Convolutional Neural Network consists of six layers:
2D Convolutional layer 1:
Input= 512X512X3 image tensor
Activation = ReLU
No of filters = 32
Filter size = 5X
Strides = 0
Padding = 0
MaxPooling layer :
Pooling size = 4X
Strides = 4
2D Convolutional layer 2 :
Activation = ReLU
No of filters = 32
Filter size = 3X
L2 regulariser (0.01)
MaxPooling layer :
Pooling size = 1X
Strides = 1,
Flatten layer
Dense layer 1 :
Nodes = 128
Activation = linear
L2 regulariser (0.01)
Dropout layer :
Dropout = 60%
Dense layer 2 :
Nodes = 256
Activation = ReLU
L2 regulariser (0.001)
Dropout layer :
Dropout = 80%
Dense layer 3 : (output)
Node = 1
Activation = Sigmoid
Though there are some differences, the actual architecture employed in this effort was largely inspired by a paper on Environmental Sound Classification with CNNs.
The Model was trained using approximately 6000 image belonging to two classes with Stochastic Gradient Descent Optimizer over Binary Cross Entropy Loss. The Model was trained until the Validation accuracy saturated.
The network gave an accuracy of about 75% on testing. We ultimately envision the model being implemented in a wearable device (Apple Watch, Garmin) or home device (Amazon Echo). The device could prompt you to answer a simple question in the morning and a simple question before bed on a daily basis. The model stores your predicted depression score and tracks it over time, such that the model can learn from your baseline. If a threshold is crossed, it notifies you to seek help, or in extreme cases, notifies an emergency contact to help you help yourself. This initial model provides a solid foundation and promising directions for detecting depression with spectrograms. Depression moves across a spectrum, so deriving a binary classification (depressed, not depressed) from a single test (PHQ-8) is somewhat na ̈ıve and perhaps unrealistic. The threshold for a depression classification was a score of 10, but how much difference in depression-related speech prosody exists between a score of 9 (classified as not depressed) and a 10 (classified as depressed)? For this reason, the problem may be better approached by using regression techniques to predict participants’ PHQ-8 scores and scoring the model based on RMSE. We would prioritize future efforts as follows:
- Implementing the model for Indian languages.
- Sampling methods to increase training size without introducing class or speaker bias.
[1] An Introduction to Convolutional Neural Networks: Keiron O’Shea and Ryan Nash