You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have tried two Chinese speaker diarization data sets but their results are not good, especially when the human voice is removed as noise. Can this be fine-tuned?
The code I used:
USE_ONNX = False # change this to True if you want to test onnx model
if USE_ONNX:
!pip install -q onnxruntime
window_size_samples = 1536 # number of samples in a single audio chunk
for i in range(0, len(wav), window_size_samples):
chunk = wav[i: i+ window_size_samples]
if len(chunk) < window_size_samples:
break
speech_dict = vad_iterator(chunk, return_seconds=True)
if speech_dict:
print(speech_dict, end=' ')
vad_iterator.reset_states() # reset model states after each audio
Hi! Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set?
Hi! Can you tell me what is the reason why the voice activity detection module is so poor?Do the effects of this module depend heavily on the data set?
tuning the parameters based on your dataset is necessary. If yours is quiet overall, try lower threshold and longer min_silence_samples, otherwise higher / shorter
I have tried two Chinese speaker diarization data sets but their results are not good, especially when the human voice is removed as noise. Can this be fine-tuned?
The code I used:
USE_ONNX = False # change this to True if you want to test onnx model
if USE_ONNX:
!pip install -q onnxruntime
model, utils = torch.hub.load(repo_or_dir='snakers4/silero-vad',
model='silero_vad',
force_reload=True,
onnx=USE_ONNX)
(get_speech_timestamps,
save_audio,
read_audio,
VADIterator,
collect_chunks) = utils
wav = read_audio('S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)
get speech timestamps from full audio file
speech_timestamps = get_speech_timestamps(wav, model, sampling_rate=SAMPLING_RATE)
pprint(speech_timestamps)
using VADIterator class
vad_iterator = VADIterator(model)
wav = read_audio(f'S_R004S04C01.wav', sampling_rate=SAMPLING_RATE)
window_size_samples = 1536 # number of samples in a single audio chunk
for i in range(0, len(wav), window_size_samples):
chunk = wav[i: i+ window_size_samples]
if len(chunk) < window_size_samples:
break
speech_dict = vad_iterator(chunk, return_seconds=True)
if speech_dict:
print(speech_dict, end=' ')
vad_iterator.reset_states() # reset model states after each audio
The result on Alimeeting-Test:
MS: 20.299598, FA: 1.372215, SER: 1.088590, DER: 22.760403
MS: 31.277793, FA: 2.150170, SER: 1.933873, DER: 35.361836
MS: 31.944428, FA: 0.511342, SER: 2.276318, DER: 34.732088
MS: 47.038586, FA: 0.163343, SER: 9.470302, DER: 56.672231
MS: 74.286394, FA: 0.007934, SER: 3.434961, DER: 77.729289
MS: 30.688677, FA: 0.704153, SER: 2.770183, DER: 34.163013
MS: 59.316559, FA: 0.324209, SER: 8.123554, DER: 67.764322
MS: 98.369565, FA: 0.000000, SER: 0.562652, DER: 98.932217
MS: 99.417771, FA: 0.000000, SER: 0.058597, DER: 99.476368
MS: 99.910412, FA: 0.000000, SER: 0.000000, DER: 99.910412
MS: 99.493029, FA: 0.000000, SER: 0.120111, DER: 99.613140
MS: 61.856814, FA: 0.623673, SER: 0.184956, DER: 62.665443
MS: 19.090301, FA: 4.226608, SER: 3.039757, DER: 26.356666
MS: 33.685372, FA: 0.338829, SER: 0.267496, DER: 34.291696
MS: 15.374482, FA: 4.018866, SER: 0.518013, DER: 19.911360
MS: 42.467802, FA: 1.968425, SER: 0.268384, DER: 44.704612
MS: 17.370355, FA: 0.626849, SER: 0.326430, DER: 18.323634
MS: 67.082939, FA: 0.626243, SER: 0.180605, DER: 67.889787
MS: 72.216975, FA: 0.557994, SER: 0.130966, DER: 72.905935
MS: 14.936698, FA: 1.236910, SER: 0.225926, DER: 16.399534
The result on Aishell-4:
MS: 79.665430, FA: 0.012366, SER: 5.601830, DER: 85.279626
MS: 67.227370, FA: 0.132288, SER: 1.020209, DER: 68.379866
MS: 61.530820, FA: 18.205761, SER: 5.297353, DER: 85.033934
MS: 54.602609, FA: 0.152443, SER: 2.483539, DER: 57.238590
MS: 67.082935, FA: 0.078205, SER: 2.599719, DER: 69.760859
MS: 51.416720, FA: 0.204723, SER: 1.379586, DER: 53.001029
MS: 56.959476, FA: 0.203365, SER: 7.326404, DER: 64.489246
MS: 36.057926, FA: 0.157853, SER: 1.157691, DER: 37.373470
MS: 79.330646, FA: 0.097513, SER: 0.407194, DER: 79.835354
MS: 81.295235, FA: 0.062895, SER: 1.192822, DER: 82.550952
MS: 60.887943, FA: 0.599634, SER: 2.776542, DER: 64.264119
MS: 70.418660, FA: 0.084877, SER: 3.336644, DER: 73.840181
MS: 11.451400, FA: 0.658543, SER: 3.846325, DER: 15.956268
MS: 21.339103, FA: 0.351577, SER: 0.758447, DER: 22.449127
MS: 22.068026, FA: 0.588110, SER: 6.252810, DER: 28.908947
MS: 21.507885, FA: 0.162660, SER: 1.766586, DER: 23.437131
MS: 28.836928, FA: 0.203312, SER: 0.167732, DER: 29.207972
MS: 18.727860, FA: 0.238973, SER: 1.228832, DER: 20.195666
MS: 17.108661, FA: 0.269604, SER: 0.083678, DER: 17.461943
MS: 13.953794, FA: 0.308104, SER: 1.880523, DER: 16.142421
The text was updated successfully, but these errors were encountered: