Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

This vad algorithm does not work well on Chinese data sets #449

Closed
Coconut059 opened this issue Apr 30, 2024 · 4 comments
Closed

This vad algorithm does not work well on Chinese data sets #449

Coconut059 opened this issue Apr 30, 2024 · 4 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@Coconut059
Copy link

I have tried two Chinese speaker diarization data sets but their results are not good, especially when the human voice is removed as noise. Can this be fine-tuned?

The code I used:
USE_ONNX = False # change this to True if you want to test onnx model
if USE_ONNX:
!pip install -q onnxruntime

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True,
onnx=USE_ONNX)

(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils
wav = read_audio('S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)

get speech timestamps from full audio file

speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
pprint(speech_timestamps)

using VADIterator class

vad_iterator = VADIterator(model)
wav = read_audio(f'S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)

window_size_samples = 1536 # number of samples in a single audio chunk
for i in range(0, len(wav), window_size_samples):
chunk = wav[i: i+ window_size_samples]
if len(chunk) < window_size_samples:
break
speech_dict = vad_iterator(chunk, return_seconds=True)
if speech_dict:
print(speech_dict, end=' ')
vad_iterator.reset_states() # reset model states after each audio

The result on Alimeeting-Test:
MS: 20.299598, FA: 1.372215, SER: 1.088590, DER: 22.760403
MS: 31.277793, FA: 2.150170, SER: 1.933873, DER: 35.361836
MS: 31.944428, FA: 0.511342, SER: 2.276318, DER: 34.732088
MS: 47.038586, FA: 0.163343, SER: 9.470302, DER: 56.672231
MS: 74.286394, FA: 0.007934, SER: 3.434961, DER: 77.729289
MS: 30.688677, FA: 0.704153, SER: 2.770183, DER: 34.163013
MS: 59.316559, FA: 0.324209, SER: 8.123554, DER: 67.764322
MS: 98.369565, FA: 0.000000, SER: 0.562652, DER: 98.932217
MS: 99.417771, FA: 0.000000, SER: 0.058597, DER: 99.476368
MS: 99.910412, FA: 0.000000, SER: 0.000000, DER: 99.910412
MS: 99.493029, FA: 0.000000, SER: 0.120111, DER: 99.613140
MS: 61.856814, FA: 0.623673, SER: 0.184956, DER: 62.665443
MS: 19.090301, FA: 4.226608, SER: 3.039757, DER: 26.356666
MS: 33.685372, FA: 0.338829, SER: 0.267496, DER: 34.291696
MS: 15.374482, FA: 4.018866, SER: 0.518013, DER: 19.911360
MS: 42.467802, FA: 1.968425, SER: 0.268384, DER: 44.704612
MS: 17.370355, FA: 0.626849, SER: 0.326430, DER: 18.323634
MS: 67.082939, FA: 0.626243, SER: 0.180605, DER: 67.889787
MS: 72.216975, FA: 0.557994, SER: 0.130966, DER: 72.905935
MS: 14.936698, FA: 1.236910, SER: 0.225926, DER: 16.399534

The result on Aishell-4:
MS: 79.665430, FA: 0.012366, SER: 5.601830, DER: 85.279626
MS: 67.227370, FA: 0.132288, SER: 1.020209, DER: 68.379866
MS: 61.530820, FA: 18.205761, SER: 5.297353, DER: 85.033934
MS: 54.602609, FA: 0.152443, SER: 2.483539, DER: 57.238590
MS: 67.082935, FA: 0.078205, SER: 2.599719, DER: 69.760859
MS: 51.416720, FA: 0.204723, SER: 1.379586, DER: 53.001029
MS: 56.959476, FA: 0.203365, SER: 7.326404, DER: 64.489246
MS: 36.057926, FA: 0.157853, SER: 1.157691, DER: 37.373470
MS: 79.330646, FA: 0.097513, SER: 0.407194, DER: 79.835354
MS: 81.295235, FA: 0.062895, SER: 1.192822, DER: 82.550952
MS: 60.887943, FA: 0.599634, SER: 2.776542, DER: 64.264119
MS: 70.418660, FA: 0.084877, SER: 3.336644, DER: 73.840181
MS: 11.451400, FA: 0.658543, SER: 3.846325, DER: 15.956268
MS: 21.339103, FA: 0.351577, SER: 0.758447, DER: 22.449127
MS: 22.068026, FA: 0.588110, SER: 6.252810, DER: 28.908947
MS: 21.507885, FA: 0.162660, SER: 1.766586, DER: 23.437131
MS: 28.836928, FA: 0.203312, SER: 0.167732, DER: 29.207972
MS: 18.727860, FA: 0.238973, SER: 1.228832, DER: 20.195666
MS: 17.108661, FA: 0.269604, SER: 0.083678, DER: 17.461943
MS: 13.953794, FA: 0.308104, SER: 1.880523, DER: 16.142421

@Coconut059 Coconut059 added the help wanted Extra attention is needed label Apr 30, 2024
@adamnsandle
Copy link
Collaborator

Thanks for your comment!
We will add these datasets to our validation for more stable future models.

@Coconut059
Copy link
Author

Hi! Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set?

@yuGAN6
Copy link
Contributor

yuGAN6 commented Jun 3, 2024

Hi! Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set?

tuning the parameters based on your dataset is necessary. If yours is quiet overall, try lower threshold and longer min_silence_samples, otherwise higher / shorter

@snakers4
Copy link
Owner

The new VAD version was released just now - #2 (comment).

Now it was trained on more than 6,000 languages.

Can you please test is on your data again.

If the issue persists, please open a new issue referencing this one.

Many thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

4 participants