-
Notifications
You must be signed in to change notification settings - Fork 5.4k
Cleanup and segment long utterances using nnet3 #2581
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks, Is it related to or can it be used for cleaning corrupt training data. |
|
Yes, it would be greate if you can test it.
On Fri, Jul 27, 2018 at 1:33 PM Ashish Arora ***@***.***> wrote:
Thanks, Is it related to or can it be used for cleaning corrupt training
data.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2581 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEATVxCam7nEMDXzEPK9F0o4BhkQJ2t4ks5uK072gaJpZM4Vj-7w>
.
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
|
|
Ok, thank you. I will test it with Yomdle datasets. |
|
Did this perform better than the existing data-cleanup script based on GMMs?
I can see that, regardless, it could be useful in situations where GMMs do
not work, like OCR.
…On Fri, Jul 27, 2018 at 10:40 AM, Ashish Arora ***@***.***> wrote:
Ok, thank you. I will test it with Yomdle datasets.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#2581 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu_rUfX9OK_bWHlvT2VnSy5GesGY-ks5uK1CTgaJpZM4Vj-7w>
.
|
|
|
||
| # Copyright 2017 Vimal Manohar | ||
| # Apache 2.0 | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vimalmanohar, this VAD and its use when extracting i-vectors... what scenario was this helpful for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be useful for segmenting long utterances when there is a lot of silence in the recordings. I have not tried using without this VAD. But it needs to be tested.
|
This only affects the i-vectors, right?
…On Mon, Aug 20, 2018 at 11:51 PM Vimal Manohar ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In egs/wsj/s5/steps/compute_vad_decision.sh
<#2581 (comment)>:
> @@ -0,0 +1,86 @@
+#!/bin/bash
+
+# Copyright 2017 Vimal Manohar
+# Apache 2.0
+
It might be useful for segmenting long utterances when there is a lot of
silence in the recordings. I have not tried using without this VAD. But it
needs to be tested.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2581 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu39ZqLykxSWkW3TX8zliesDrh5O2ks5uS64LgaJpZM4Vj-7w>
.
|
|
Yes, only i-vectors.
On Tue, Aug 21, 2018 at 2:23 PM Daniel Povey <[email protected]>
wrote:
This only affects the i-vectors, right?
On Mon, Aug 20, 2018 at 11:51 PM Vimal Manohar ***@***.***>
wrote:
> ***@***.**** commented on this pull request.
> ------------------------------
>
> In egs/wsj/s5/steps/compute_vad_decision.sh
> <#2581 (comment)>:
>
> > @@ -0,0 +1,86 @@
> +#!/bin/bash
> +
> +# Copyright 2017 Vimal Manohar
> +# Apache 2.0
> +
>
> It might be useful for segmenting long utterances when there is a lot of
> silence in the recordings. I have not tried using without this VAD. But
it
> needs to be tested.
>
> —
> You are receiving this because you commented.
> Reply to this email directly, view it on GitHub
> <#2581 (comment)>,
or mute
> the thread
> <
https://github.com/notifications/unsubscribe-auth/ADJVu39ZqLykxSWkW3TX8zliesDrh5O2ks5uS64LgaJpZM4Vj-7w
>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2581 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEATV6CQ-4svTTBg91_OYOcjKhvPMJ1Vks5uTFA4gaJpZM4Vj-7w>
.
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
|
|
@david-ryan-snyder, what do you think about the VAD-related part of this? |
|
If I understand this PR correctly, a) a SAD system is used to segment long recordings, b) I-vectors are included in the input and c) because of a concern about the effect of nonspeech in the i-vectors, a frame-level weighting of the i-vector stats is introduced. (@vimalmanohar, please correct me if I misunderstood what's happening here...) In my view, SAD should be a lightweight system. If it works adequately, I'd rather see a simple SAD model that just uses MFCCs and and nothing else. Adding in a component (i-vectors) which in turn requires an energy SAD seems a bit heavy weight to me. Since @danpovey asked my opinion on this, I'd be interested to know the following:
|
| fi | ||
|
|
||
|
|
||
| echo "Created VAD output for $name" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this script do the same thing that https://github.com/kaldi-asr/kaldi/blob/master/egs/sre08/v1/sid/compute_vad_decision.sh does? Why not use that one?
|
I think there is a misunderstanding. This is for segmenting recordings
using the transcript based on decoding using chain model, which already
uses i-vectors (trained without silence by default). The recordings might
have long silences or noise in them, and there is no SAD system for
pre-processing. I am using 30s uniform segmentation similar to the one in
Aspire. In Aspire, there was a two-stage decoding done with a first stage
decoding to get the frame weights for i-vector extraction and this was very
important for good performance. But, here I am just adding the energy VAD
to alleviate the mismatch problem.
However, it would be better to add a separate SAD system as a pre-processor
to segmenting based on transcripts. This is an extra top-level addition,
which can be done to both the recipe for segmenting long utterances using
GMM as well as the one in this PR that uses chain model. This would involve
modification to a run-level script that first creates segments using SAD, and
then skip the uniform segmentation stage. I will add the run-level script doing
this.
It might be the same as
https://github.com/kaldi-asr/kaldi/blob/master/egs/sre08/v1/sid/compute_vad_decision.sh.
I can move the script in sid to steps and create a soft link there.
On Wed, Aug 22, 2018 at 1:25 PM David Snyder ***@***.***> wrote:
If I understand this PR correctly, a) a SAD system is used to segment long
recordings, b) I-vectors are included in the input and c) because of a
concern about the effect of nonspeech in the i-vectors, a frame-level
weighting of the i-vector stats is introduced.
In my view, SAD should be a lightweight system. If it works adequately,
I'd rather see a simple SAD model that just uses MFCCs and and nothing
else. Adding in a component (i-vectors) which in turn requires an energy
SAD seems a bit heavy weight to me.
Since @danpovey <https://github.com/danpovey> asked my opinion on this,
I'd be interested to know the following:
-
Has anyone looked at the performance of the SAD system without
i-vectors (or simple alternatives like pooling layer in the DNN)? I think
it makes the most sense to measure the downstream performance on an
application like ASR (rather than something like DER). If you've determined
that there's not much difference, you can do away with the i-vector
subsystem, and eliminate the need for so many code changes.
-
If you've determined that the benefit from including i-vectors in the
SAD system is large enough to outweigh the added complexity and code
changes, have you performed any experiment to determine if the frame-level
weights are necessary? If there's not much difference in performance, I
would again prefer the simpler system. It also removes some lines of code
from this PR.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2581 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AEATV53sgKqEBx15RC1Roz17-ZFHWjweks5uTZPzgaJpZM4Vj-7w>
.
--
Vimal Manohar
PhD Student
Electrical & Computer Engineering
Johns Hopkins University
|
|
@vimalmanohar, thanks for the explanation. Yes, I misunderstood what the PR is doing, but I think I see now. |
|
Vimal, please put comments at the top of any example scripts stating
clearly what the scenario is, so it's clear to others as well.
…On Wed, Aug 22, 2018 at 11:14 AM David Snyder ***@***.***> wrote:
@vimalmanohar <https://github.com/vimalmanohar>, thanks for the
explanation. Yes, I misunderstood what the PR is doing, but I think I see
now.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2581 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADJVu4ySoSIK41xUWHGG8otz4rmss7QAks5uTZ9zgaJpZM4Vj-7w>
.
|
|
In that case, I think using the energy SAD for this purpose is reasonable, and the corresponding changes to ivector-extract-online2.cc are also reasonable. |
…asr#2581) This came from Vimal's work on the MGB-3 challenge. Interface is similar to the existing GMM-based cleanup/segmentation scripts.
…asr#2581) This came from Vimal's work on the MGB-3 challenge. Interface is similar to the existing GMM-based cleanup/segmentation scripts.
Adding the scripts used during MGB-3 challenge. This could be tested by someone needing this for e2e training.