This vad algorithm does not work well on Chinese data sets #449

Coconut059 · 2024-04-30T09:27:35Z

I have tried two Chinese speaker diarization data sets but their results are not good, especially when the human voice is removed as noise. Can this be fine-tuned？

The code I used：
USE_ONNX = False # change this to True if you want to test onnx model
if USE_ONNX:
!pip install -q onnxruntime

model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True,
onnx=USE_ONNX)

(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils
wav = read_audio('S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)

get speech timestamps from full audio file

speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
pprint(speech_timestamps)

using VADIterator class

vad_iterator = VADIterator(model)
wav = read_audio(f'S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)

window_size_samples = 1536 # number of samples in a single audio chunk
for i in range(0, len(wav), window_size_samples):
chunk = wav[i: i+ window_size_samples]
if len(chunk) < window_size_samples:
break
speech_dict = vad_iterator(chunk, return_seconds=True)
if speech_dict:
print(speech_dict, end=' ')
vad_iterator.reset_states() # reset model states after each audio

The result on Alimeeting-Test：
MS: 20.299598, FA: 1.372215, SER: 1.088590, DER: 22.760403
MS: 31.277793, FA: 2.150170, SER: 1.933873, DER: 35.361836
MS: 31.944428, FA: 0.511342, SER: 2.276318, DER: 34.732088
MS: 47.038586, FA: 0.163343, SER: 9.470302, DER: 56.672231
MS: 74.286394, FA: 0.007934, SER: 3.434961, DER: 77.729289
MS: 30.688677, FA: 0.704153, SER: 2.770183, DER: 34.163013
MS: 59.316559, FA: 0.324209, SER: 8.123554, DER: 67.764322
MS: 98.369565, FA: 0.000000, SER: 0.562652, DER: 98.932217
MS: 99.417771, FA: 0.000000, SER: 0.058597, DER: 99.476368
MS: 99.910412, FA: 0.000000, SER: 0.000000, DER: 99.910412
MS: 99.493029, FA: 0.000000, SER: 0.120111, DER: 99.613140
MS: 61.856814, FA: 0.623673, SER: 0.184956, DER: 62.665443
MS: 19.090301, FA: 4.226608, SER: 3.039757, DER: 26.356666
MS: 33.685372, FA: 0.338829, SER: 0.267496, DER: 34.291696
MS: 15.374482, FA: 4.018866, SER: 0.518013, DER: 19.911360
MS: 42.467802, FA: 1.968425, SER: 0.268384, DER: 44.704612
MS: 17.370355, FA: 0.626849, SER: 0.326430, DER: 18.323634
MS: 67.082939, FA: 0.626243, SER: 0.180605, DER: 67.889787
MS: 72.216975, FA: 0.557994, SER: 0.130966, DER: 72.905935
MS: 14.936698, FA: 1.236910, SER: 0.225926, DER: 16.399534

The result on Aishell-4：
MS: 79.665430, FA: 0.012366, SER: 5.601830, DER: 85.279626
MS: 67.227370, FA: 0.132288, SER: 1.020209, DER: 68.379866
MS: 61.530820, FA: 18.205761, SER: 5.297353, DER: 85.033934
MS: 54.602609, FA: 0.152443, SER: 2.483539, DER: 57.238590
MS: 67.082935, FA: 0.078205, SER: 2.599719, DER: 69.760859
MS: 51.416720, FA: 0.204723, SER: 1.379586, DER: 53.001029
MS: 56.959476, FA: 0.203365, SER: 7.326404, DER: 64.489246
MS: 36.057926, FA: 0.157853, SER: 1.157691, DER: 37.373470
MS: 79.330646, FA: 0.097513, SER: 0.407194, DER: 79.835354
MS: 81.295235, FA: 0.062895, SER: 1.192822, DER: 82.550952
MS: 60.887943, FA: 0.599634, SER: 2.776542, DER: 64.264119
MS: 70.418660, FA: 0.084877, SER: 3.336644, DER: 73.840181
MS: 11.451400, FA: 0.658543, SER: 3.846325, DER: 15.956268
MS: 21.339103, FA: 0.351577, SER: 0.758447, DER: 22.449127
MS: 22.068026, FA: 0.588110, SER: 6.252810, DER: 28.908947
MS: 21.507885, FA: 0.162660, SER: 1.766586, DER: 23.437131
MS: 28.836928, FA: 0.203312, SER: 0.167732, DER: 29.207972
MS: 18.727860, FA: 0.238973, SER: 1.228832, DER: 20.195666
MS: 17.108661, FA: 0.269604, SER: 0.083678, DER: 17.461943
MS: 13.953794, FA: 0.308104, SER: 1.880523, DER: 16.142421

adamnsandle · 2024-05-03T09:36:02Z

Thanks for your comment!
We will add these datasets to our validation for more stable future models.

Coconut059 · 2024-05-03T14:03:58Z

Hi！ Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set？

yuGAN6 · 2024-06-03T01:14:14Z

Hi！ Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set？

tuning the parameters based on your dataset is necessary. If yours is quiet overall, try lower threshold and longer min_silence_samples, otherwise higher / shorter

snakers4 · 2024-06-27T19:00:03Z

The new VAD version was released just now - #2 (comment).

Now it was trained on more than 6,000 languages.

Can you please test is on your data again.

If the issue persists, please open a new issue referencing this one.

Many thanks!

Coconut059 added the help wanted Extra attention is needed label Apr 30, 2024

Coconut059 assigned snakers4 Apr 30, 2024

snakers4 closed this as completed Jun 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This vad algorithm does not work well on Chinese data sets #449

This vad algorithm does not work well on Chinese data sets #449

Coconut059 commented Apr 30, 2024

adamnsandle commented May 3, 2024

Coconut059 commented May 3, 2024

yuGAN6 commented Jun 3, 2024 •

edited

Loading

snakers4 commented Jun 27, 2024

This vad algorithm does not work well on Chinese data sets #449

This vad algorithm does not work well on Chinese data sets #449

Comments

Coconut059 commented Apr 30, 2024

get speech timestamps from full audio file

using VADIterator class

adamnsandle commented May 3, 2024

Coconut059 commented May 3, 2024

yuGAN6 commented Jun 3, 2024 • edited Loading

snakers4 commented Jun 27, 2024

yuGAN6 commented Jun 3, 2024 •

edited

Loading