Skip to content

Conversation

@vimalmanohar
Copy link
Contributor

Objective

Create a general purpose Speech Activity Detector that can be used for a lot of different tasks, but primarily focusing on Automatic Speech Recognition (ASR) and iVector extraction. We would like to create a system that could be used in an online setting.

SNR Predictor

  • A time-delay Neural network is trained on artificially corrupted MFCC
    features to predict sub-band SNRs. The network is currently set to have
    6 hidden layers with a context of about 200ms.
  • A squared-error objective is used as the training criterion. The targets are
    sub-band SNRs in log-domain.
  • Initially, we selected WSJ as the base speech corpus because it was recorded
    in clean and controlled conditions. But, we found that it is better
    to train the SNR predictor on a conversational speech corpus like
    Fisher since most of the applications would be from that domain.
  • Room impulse responses, collected from various sources like Reverb2014, RWCP, AIRD etc.
    are used add reverberation to the clean speech waveform. As of now, we use this
    reverberated speech signal as the ‘clean’ speech signal.
  • Many different kinds of noises, both foreground and background, were collected
    and randomly added to the reverberated speech signals to create corrupted
    speech signals. Some stationary noises were part of the RIR source corpora.
    A lot of non-stationary noises were added from
    MUSAN corpus.
  • The noise energy was scaled randomly to create corrupted speech signals at
    various SNR levels ranging from 20dB to -5dB. Additionally, we perturbed
    the volumes of the corrupted speech signals in order to make the SNR predictor
    invariant to energy of the input speech.
  • Initially, the noise was added randomly without considering if it is
    a background noise or foreground noise. But it is more sensible to separate the
    two kinds of noises and create corrupted speech with foreground noise on top of
    background noise. Thus there are two SNR levels, one for background noise
    and one for foreground noise, that are randomly chosen to perturbed speech data.
    The background noise is added to the entire conversation at a fixed SNR level,
    that is randomly chosen for each conversation.
    Several different foreground noises are selected randomly, along with a
    randomly chosen SNR level for each instance of foreground noise that is added.
  • This method was further improved to prepare the training data more carefully.

Speech Activity Detector

  • The Speech Activity Detector is a single hidden-layer neural network that
    is trained on the sub-band SNRs predicted by the SNR predictor to classify a
    frame as speech / non-speech.
  • A relu neural network with 100 hidden units is trained with input features
    (sub-band SNRs) spliced over frames -6 to +2. Frame SNR, which is obtained by
    aggregating the sub-band SNR, is added as an additional feature.
  • The training data is again the same corrupted Fisher data. The VAD labels
    for training are obtained from a standard LDA+MLLT HMM-GMM system. Some
    amount of post-processing needed to be done to account for the regions
    in Fisher corpus that are not in the segments provided with corpus. All these
    steps are completely automatic (no manual processing).

Segmentation on test data

  • The sub-band SNRs are first obtained by passing through the SNR predictor.
  • This is given to the SAD network to get pseudo-likelihoods.
  • The likelihoods are converted to segments using two approaches:
    • Adhoc smoothing: Frames with probability > 0.5 are named speech. Nearby
      segments of speech separated by small amount of silence are merge to speech
      segment. Speech that is less than 10 frames is removed. Speech longer than
      10s is split into overlapping segments with an overlap of 1s.
    • Viterbi smoothing: The log-likelihoods are fed to a HMM decoder with minimum
      duration constraint of 0.3s on both speech and silence. Further, nearby
      segments of speech separated by small amount of silence are merge to speech
      segment and speech longer than 10s is split into overlapping segments with
      an overlap of 1s.
  • The speech likelihoods can also be converted into posterior probabilities
    using an appropriate prior for various uses. For e.g., a speech prior of 0.2
    was used to identify speech frames for iVector extraction on Aspire dataset.

Results on Aspire

The automatically generated segments using the above method, along with the
iVector extraction gave a WER performance (30.9%) matching to the two-pass
decoding used in Aspire challenge

… several conversion scripts to work with rttm and vad added
@vimalmanohar
Copy link
Contributor Author

Cleaner PR is made at
vimalmanohar#2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants