Skip to content

The processed file sounds correctly after the VAD, but the model still seems to know where the big silence gap was and stops at this place. #312

Answered by ghost
ghost asked this question in Q&A
Discussion options

You must be logged in to vote

Not sure I can understand. Can you maybe post a minimal example?

The idea is as follows: apply VAD to the original file audio.wav, and use the collect_chunks() function to get the processed file without non-speech fragments - only_speech.wav. Then pass this only_speech.wav file as the input to the wav2vec model.
But the thing is, if the original audio file contains longer pauses (about 2 sec), VAD process them successfully and sounds correctly, but the wav2vec model still seems to know where this pause was and stops at this place as if identifying it as the end of the file.
And I can't figure out how the model identifies this pause in the only_speech.wav file. As if there is some marker...

Replies: 3 comments 2 replies

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
2 replies
@snakers4
Comment options

@ghost
Comment options

Answer selected
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
help wanted Extra attention is needed
1 participant
Converted from issue

This discussion was converted from issue #311 on March 21, 2023 03:05.