-
Notifications
You must be signed in to change notification settings - Fork 5.4k
WIP Speech Activity Detection using SNR Prediction #353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…nal program for dense target egs
…d on external data
… several conversion scripts to work with rttm and vad added
… missed in the last commit
…ow they work on switchboard
…d pitch as features temp
…for a few minor differences.
… Babel. Also Aspire segmentation scripts
…bilitites for Aspire
Contributor
Author
|
Cleaner PR is made at |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Objective
Create a general purpose Speech Activity Detector that can be used for a lot of different tasks, but primarily focusing on Automatic Speech Recognition (ASR) and iVector extraction. We would like to create a system that could be used in an online setting.
SNR Predictor
features to predict sub-band SNRs. The network is currently set to have
6 hidden layers with a context of about 200ms.
sub-band SNRs in log-domain.
in clean and controlled conditions. But, we found that it is better
to train the SNR predictor on a conversational speech corpus like
Fisher since most of the applications would be from that domain.
are used add reverberation to the clean speech waveform. As of now, we use this
reverberated speech signal as the ‘clean’ speech signal.
and randomly added to the reverberated speech signals to create corrupted
speech signals. Some stationary noises were part of the RIR source corpora.
A lot of non-stationary noises were added from
MUSAN corpus.
various SNR levels ranging from 20dB to -5dB. Additionally, we perturbed
the volumes of the corrupted speech signals in order to make the SNR predictor
invariant to energy of the input speech.
a background noise or foreground noise. But it is more sensible to separate the
two kinds of noises and create corrupted speech with foreground noise on top of
background noise. Thus there are two SNR levels, one for background noise
and one for foreground noise, that are randomly chosen to perturbed speech data.
The background noise is added to the entire conversation at a fixed SNR level,
that is randomly chosen for each conversation.
Several different foreground noises are selected randomly, along with a
randomly chosen SNR level for each instance of foreground noise that is added.
Speech Activity Detector
is trained on the sub-band SNRs predicted by the SNR predictor to classify a
frame as speech / non-speech.
(sub-band SNRs) spliced over frames -6 to +2. Frame SNR, which is obtained by
aggregating the sub-band SNR, is added as an additional feature.
for training are obtained from a standard LDA+MLLT HMM-GMM system. Some
amount of post-processing needed to be done to account for the regions
in Fisher corpus that are not in the segments provided with corpus. All these
steps are completely automatic (no manual processing).
Segmentation on test data
segments of speech separated by small amount of silence are merge to speech
segment. Speech that is less than 10 frames is removed. Speech longer than
10s is split into overlapping segments with an overlap of 1s.
duration constraint of 0.3s on both speech and silence. Further, nearby
segments of speech separated by small amount of silence are merge to speech
segment and speech longer than 10s is split into overlapping segments with
an overlap of 1s.
using an appropriate prior for various uses. For e.g., a speech prior of 0.2
was used to identify speech frames for iVector extraction on Aspire dataset.
Results on Aspire
The automatically generated segments using the above method, along with the
iVector extraction gave a WER performance (30.9%) matching to the two-pass
decoding used in Aspire challenge