WIP Speech Activity Detection using SNR Prediction #353

vimalmanohar · 2015-11-21T21:13:18Z

Objective

Create a general purpose Speech Activity Detector that can be used for a lot of different tasks, but primarily focusing on Automatic Speech Recognition (ASR) and iVector extraction. We would like to create a system that could be used in an online setting.

SNR Predictor

A time-delay Neural network is trained on artificially corrupted MFCC
features to predict sub-band SNRs. The network is currently set to have
6 hidden layers with a context of about 200ms.
A squared-error objective is used as the training criterion. The targets are
sub-band SNRs in log-domain.
Initially, we selected WSJ as the base speech corpus because it was recorded
in clean and controlled conditions. But, we found that it is better
to train the SNR predictor on a conversational speech corpus like
Fisher since most of the applications would be from that domain.
Room impulse responses, collected from various sources like Reverb2014, RWCP, AIRD etc.
are used add reverberation to the clean speech waveform. As of now, we use this
reverberated speech signal as the ‘clean’ speech signal.
Many different kinds of noises, both foreground and background, were collected
and randomly added to the reverberated speech signals to create corrupted
speech signals. Some stationary noises were part of the RIR source corpora.
A lot of non-stationary noises were added from
MUSAN corpus.
The noise energy was scaled randomly to create corrupted speech signals at
various SNR levels ranging from 20dB to -5dB. Additionally, we perturbed
the volumes of the corrupted speech signals in order to make the SNR predictor
invariant to energy of the input speech.
Initially, the noise was added randomly without considering if it is
a background noise or foreground noise. But it is more sensible to separate the
two kinds of noises and create corrupted speech with foreground noise on top of
background noise. Thus there are two SNR levels, one for background noise
and one for foreground noise, that are randomly chosen to perturbed speech data.
The background noise is added to the entire conversation at a fixed SNR level,
that is randomly chosen for each conversation.
Several different foreground noises are selected randomly, along with a
randomly chosen SNR level for each instance of foreground noise that is added.
This method was further improved to prepare the training data more carefully.

Speech Activity Detector

The Speech Activity Detector is a single hidden-layer neural network that
is trained on the sub-band SNRs predicted by the SNR predictor to classify a
frame as speech / non-speech.
A relu neural network with 100 hidden units is trained with input features
(sub-band SNRs) spliced over frames -6 to +2. Frame SNR, which is obtained by
aggregating the sub-band SNR, is added as an additional feature.
The training data is again the same corrupted Fisher data. The VAD labels
for training are obtained from a standard LDA+MLLT HMM-GMM system. Some
amount of post-processing needed to be done to account for the regions
in Fisher corpus that are not in the segments provided with corpus. All these
steps are completely automatic (no manual processing).

Segmentation on test data

The sub-band SNRs are first obtained by passing through the SNR predictor.
This is given to the SAD network to get pseudo-likelihoods.
The likelihoods are converted to segments using two approaches:
- Adhoc smoothing: Frames with probability > 0.5 are named speech. Nearby
  segments of speech separated by small amount of silence are merge to speech
  segment. Speech that is less than 10 frames is removed. Speech longer than
  10s is split into overlapping segments with an overlap of 1s.
- Viterbi smoothing: The log-likelihoods are fed to a HMM decoder with minimum
  duration constraint of 0.3s on both speech and silence. Further, nearby
  segments of speech separated by small amount of silence are merge to speech
  segment and speech longer than 10s is split into overlapping segments with
  an overlap of 1s.
The speech likelihoods can also be converted into posterior probabilities
using an appropriate prior for various uses. For e.g., a speech prior of 0.2
was used to identify speech frames for iVector extraction on Aspire dataset.

Results on Aspire

The automatically generated segments using the above method, along with the
iVector extraction gave a WER performance (30.9%) matching to the two-pass
decoding used in Aspire challenge

…nal program for dense target egs

…d on external data

…ailed overall.

… several conversion scripts to work with rttm and vad added

… missed in the last commit

…ow they work on switchboard

…d pitch as features temp

…ersion

…for a few minor differences.

…magical

… Babel. Also Aspire segmentation scripts

…ataset kinds'

…bilitites for Aspire

…g egs

vimalmanohar · 2016-11-30T21:14:23Z

Cleaner PR is made at
vimalmanohar#2.

vimalmanohar added 30 commits August 10, 2015 20:16

snr: Added program to get nnet3 examples with dense labels

83653b4

snr: Merging changes from upstream

3c32853

snr: Merging changes from upstream, but nnet3bin makefile has additio…

6c4ac46

…nal program for dense target egs

snr: Made some scripts like copy_data_dir.sh to support extra files

295932f

Merge branch 'nnet3' of github.com:kaldi-asr/kaldi into snr

5b938ae

snr: Raw tdnn training in nnet3 scripts

76cf939

snr: Made working scripts and config for training of raw tdnn in nnet3

ce6a1ed

Merge branch 'master' of github.com:kaldi-asr/kaldi into snr

4997a57

kaldi-git/diarization: Creating the first diarization scripts

747abcd

kaldi-git/diarization: Modified vad to use top frames

b3a80c4

kaldi-git/diarization: Added a VAD training script on external data

a800ec6

kaldi-git/diarization: Modified VAD to be initialized from GMM traine…

83ba4ab

…d on external data

kaldi-git/diarization: Tried to make vad better on the VT file. But f…

7ab3064

…ailed overall.

kaldi-git-diarization: Added several features related programs for VAD

d611e2e

kaldi-git-diarization: Scripts for doing VAD. Not evaluated yet. Also…

51de406

… several conversion scripts to work with rttm and vad added

kaldi-git-diarization: Completing VAD related code additions that was…

076592c

… missed in the last commit

kaldi-git-diarization: Data preparation scripts for RT04

c1e48cf

kaldi-git/diarization: Modified both ICSI and NTU versions based on h…

fe691fa

…ow they work on switchboard

kaldi-git/diarization: Added NCCF based selection of speech frames an…

20d6108

…d pitch as features temp

kaldi-git/diarization: Modifying the VAD to be more similar to ICSI v…

6f5d71c

…ersion

kaldi-git/diarization: Added segmentation data structures

12d08c4

kaldi-git/diarization: Added segmenation class

295f4cb

kaldi-git/diarization: Implemented the ICSI system completely except …

a3057ea

…for a few minor differences.

kaldi-git/diarization: Working version of ICSI implementation

acb7e7f

kaldi-git/diarization: ICSI implementation testing on RT'05 and Babel

8489e11

kaldi-git/diarization: Minor fixes

fab87c8

kaldi-git/diarization: Modifying ICSI implementation to make it less …

aa96d3e

…magical

kaldi-git/diarization: ICSI system with 3 initial models trained from…

bbe90f4

… Babel. Also Aspire segmentation scripts

kaldi-git/diarization: Fix run-4-anydecode.sh to accept any kind of d…

04e35e4

…ataset kinds'

kaldi-git/diarization: Added programs for doing VAD to get only proba…

e9d406f

…bilitites for Aspire

vimalmanohar added 14 commits January 25, 2016 12:39

snr: Fixing top level snr training scripts

6e808ec

snr: Added length tolerance to nnet3-get-egs

507da74

snr: Added length-tolerance to some programs

cedc120

snr: Modified some snr related scripts

02012f9

snr: modified aspire vad script

8a624c3

snr: Mergin from master

25452f1

snr: minor upgrade to snr scripts

480a394

snr: Added DerivWeights into NnetIo class

6eaa392

snr: Modify scripts to read DerivWeights for NnetIo class

24642ea

snr: Fixed merge error in Makefile

05aa1a2

snr: Updated segmentation script to use on arbitrary data dirs

2d8e7d9

snr: Fixed weights for ivector binary

5a59985

snr: Minor fix to segmentation function

7043831

snr: Added deriv weights for nnet3-get-egs

c908e4b

vimalmanohar force-pushed the snr branch from 90bb511 to c908e4b Compare February 16, 2016 22:36

vimalmanohar added 13 commits February 17, 2016 01:40

snr: Removing some unused quantization related programs

e4f9b29

snr: Added support to create snr targets from uncorrupted signal

4f508aa

snr: Added config generation script for snr_predictor

36503c9

aspire: Aspire VAD recipe configs for ivector

207b2f5

snr: Merging from golden

b82145a

snr: Using deriv weights for lda stats and also to ignore when writin…

385c28c

…g egs

snr: Send valid egs creation to background

8808d99

snr: Python configs modified

c3410c8

snr: Modify run scripts

704b726

snr: Raw pov features

5e314b5

snr: Move chain functions to a different example-utils

9274f23

snr: Removing positivity constraint in nnet3 training

28b6347

snr: Added utt2uniq in copy_data_dir.sh

96e1ba9

vimalmanohar closed this Nov 30, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP Speech Activity Detection using SNR Prediction #353

WIP Speech Activity Detection using SNR Prediction #353

Uh oh!

vimalmanohar commented Nov 21, 2015

Uh oh!

vimalmanohar commented Nov 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP Speech Activity Detection using SNR Prediction #353

WIP Speech Activity Detection using SNR Prediction #353

Uh oh!

Conversation

vimalmanohar commented Nov 21, 2015

Objective

SNR Predictor

Speech Activity Detector

Segmentation on test data

Results on Aspire

Uh oh!

vimalmanohar commented Nov 30, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants