MSDD Model `call` method throws error #7975

benhoff · 2023-12-05T18:38:48Z

Describe the bug

If you use the __call__ method from NeuralDiarizer instead of the diarize method, you get the below stack trace.

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:1514, in NeuralDiarizer.__call__(self, audio_filepath, batch_size, num_workers, max_speakers, num_speakers, out_dir, verbose)
   1503     self._initialize_configs(
   1504         manifest_path=manifest_path,
   1505         max_speakers=max_speakers,
   (...)
   1510         verbose=verbose,
   1511     )
   1513     self.msdd_model.cfg.test_ds.manifest_filepath = manifest_path
-> 1514     self.diarize()
   1516     pred_labels_clus = rttm_to_labels(f'{tmpdir}/pred_rttms/{Path(audio_filepath).stem}.rttm')
   1517 return labels_to_pyannote_object(pred_labels_clus)

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:1180, in NeuralDiarizer.diarize(self)
   1173 @torch.no_grad()
   1174 def diarize(self) -> Optional[List[Optional[List[Tuple[DiarizationErrorRate, Dict]]]]]:
   1175     """
   1176     Launch diarization pipeline which starts from VAD (or a oracle VAD stamp generation), initialization clustering and multiscale diarization decoder (MSDD).
   1177     Note that the result of MSDD can include multiple speakers at the same time. Therefore, RTTM output of MSDD needs to be based on `make_rttm_with_overlap()`
   1178     function that can generate overlapping timestamps. `self.run_overlap_aware_eval()` function performs DER evaluation.
   1179     """
-> 1180     self.clustering_embedding.prepare_cluster_embs_infer()
   1181     self.msdd_model.pairwise_infer = True
   1182     self.get_emb_clus_infer(self.clustering_embedding)

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:699, in ClusterEmbedding.prepare_cluster_embs_infer(self)
    695 """
    696 Launch clustering diarizer to prepare embedding vectors and clustering results.
    697 """
    698 self.max_num_speakers = self.cfg_diar_infer.diarizer.clustering.parameters.max_num_speakers
--> 699 self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
    700     self._cfg_msdd.test_ds.manifest_filepath, self._cfg_msdd.test_ds.emb_dir
    701 )

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:879, in ClusterEmbedding.run_clustering_diarizer(self, manifest_filepath, emb_dir)
    875 self._embs_and_timestamps = get_embs_and_timestamps(
    876     self.clus_diar_model.multiscale_embeddings_and_timestamps, self.clus_diar_model.multiscale_args_dict
    877 )
    878 session_scale_mapping_dict = self.get_scale_map(self._embs_and_timestamps)
--> 879 emb_scale_seq_dict = self.load_emb_scale_seq_dict(emb_dir)
    880 clus_labels = self.load_clustering_labels(emb_dir)
    881 emb_sess_avg_dict, base_clus_label_dict = self.get_cluster_avg_embs(
    882     emb_scale_seq_dict, clus_labels, speaker_mapping_dict, session_scale_mapping_dict
    883 )

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:961, in ClusterEmbedding.load_emb_scale_seq_dict(self, out_dir)
    957 pickle_path = os.path.join(
    958     out_dir, 'speaker_outputs', 'embeddings', f'subsegments_scale{scale_index}_embeddings.pkl'
    959 )
    960 logging.info(f"Loading embedding pickle file of scale:{scale_index} at {pickle_path}")
--> 961 with open(pickle_path, "rb") as input_file:
    962     emb_dict = pkl.load(input_file)
    963 for key, val in emb_dict.items():

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpil5i2a_2/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl'

Steps/Code to reproduce bug

If you go to the Speaker_Diarization_Inference.ipynb file, and run everything, you can change the code to system_vad_msdd_model('data/an4_diarize_test.wav'), and the error will get thrown.

The text was updated successfully, but these errors were encountered:

tango4j · 2023-12-06T05:47:16Z

Hi, @benhoff.
__call__() function should be preceded by
diarize = NeuralDiarizer.from_pretrained(model_name='diar_msdd_telephonic')
To run diarization as the simplest form, e.g., diarize('audio.wav')

This call function is designed for hugging face API, to minimize the preparation steps.
PR #5945 is explaining the use case of this.
if you have to load the model regularly with cfg as in the tutorial, please use .diarize() function.

We will add the error handling for this case of illegal call.

benhoff · 2023-12-06T14:54:16Z

Hey @tango4j ! Been following your recently as I chunk through some of this stuff, looking forward to seeing the online diarizer PR land as it's closer to my use case.

I'm looking to diarize short chunks of audio (<30 seconds) on a server. I've wrapped this code base with an API and I'm rewriting/updating the mainfest.json repeatably to use the diarize method.

Is that the recommended way to do that?

Or would it be better to use the from_pretrained and then the __call__ method?

tango4j · 2023-12-06T18:07:14Z

@benhoff
We are preparing online diarization PR #7896 .
I apologize that I did not have bandwidth to finalize this.

Online diarization system requires more sophisticated system where we need to implement history buffer mechanism to mermorize the past speaker profiles.

We won't be designing the online diarization to have such a long (30se) buffer, we will make it have 1~2 second of frame inputs. However, you will be able to tweak the system to serve as you intend.

In Part-2 PR, there will be tutorial, example and yaml file.

For the time being, I think doing offline diarization of all the cumulated audio and perform offline diarization is only way. you might want to match the speakers among multiple sequential outputs.

benhoff · 2023-12-07T15:31:46Z

@tango4j , no worries on the PR, though good to know that the use case is targeting 1-2 seconds, I won't wait for it to land and take the offline approach you suggested instead.

I was thinking about implementing the historical buffer myself, but I was surprised that most of the approaches don't prune some of the more spurious data or smaller data (for example, everything less than 0.5 seconds) automatically.

Maybe the algorithmic clustering approaches do this via math, but in a meeting use case, my gut instinct is to throw away some of the noise and only keep longer, more robust phrases so that you can get a clean fingerprint of someone's voice.

github-actions · 2024-01-07T01:48:50Z

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

benhoff added the bug Something isn't working label Dec 5, 2023

github-actions bot added the stale label Jan 7, 2024

benhoff closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MSDD Model `call` method throws error #7975

MSDD Model `call` method throws error #7975

benhoff commented Dec 5, 2023

tango4j commented Dec 6, 2023 •

edited

Loading

benhoff commented Dec 6, 2023

tango4j commented Dec 6, 2023 •

edited

Loading

benhoff commented Dec 7, 2023

github-actions bot commented Jan 7, 2024

MSDD Model __call__ method throws error #7975

MSDD Model __call__ method throws error #7975

Comments

benhoff commented Dec 5, 2023

tango4j commented Dec 6, 2023 • edited Loading

benhoff commented Dec 6, 2023

tango4j commented Dec 6, 2023 • edited Loading

benhoff commented Dec 7, 2023

github-actions bot commented Jan 7, 2024

MSDD Model `call` method throws error #7975

MSDD Model `call` method throws error #7975

tango4j commented Dec 6, 2023 •

edited

Loading

tango4j commented Dec 6, 2023 •

edited

Loading