Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MSDD Model __call__ method throws error #7975

Closed
benhoff opened this issue Dec 5, 2023 · 5 comments
Closed

MSDD Model __call__ method throws error #7975

benhoff opened this issue Dec 5, 2023 · 5 comments
Labels
bug Something isn't working stale

Comments

@benhoff
Copy link

benhoff commented Dec 5, 2023

Describe the bug

If you use the __call__ method from NeuralDiarizer instead of the diarize method, you get the below stack trace.

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:1514, in NeuralDiarizer.__call__(self, audio_filepath, batch_size, num_workers, max_speakers, num_speakers, out_dir, verbose)
   1503     self._initialize_configs(
   1504         manifest_path=manifest_path,
   1505         max_speakers=max_speakers,
   (...)
   1510         verbose=verbose,
   1511     )
   1513     self.msdd_model.cfg.test_ds.manifest_filepath = manifest_path
-> 1514     self.diarize()
   1516     pred_labels_clus = rttm_to_labels(f'{tmpdir}/pred_rttms/{Path(audio_filepath).stem}.rttm')
   1517 return labels_to_pyannote_object(pred_labels_clus)

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:1180, in NeuralDiarizer.diarize(self)
   1173 @torch.no_grad()
   1174 def diarize(self) -> Optional[List[Optional[List[Tuple[DiarizationErrorRate, Dict]]]]]:
   1175     """
   1176     Launch diarization pipeline which starts from VAD (or a oracle VAD stamp generation), initialization clustering and multiscale diarization decoder (MSDD).
   1177     Note that the result of MSDD can include multiple speakers at the same time. Therefore, RTTM output of MSDD needs to be based on `make_rttm_with_overlap()`
   1178     function that can generate overlapping timestamps. `self.run_overlap_aware_eval()` function performs DER evaluation.
   1179     """
-> 1180     self.clustering_embedding.prepare_cluster_embs_infer()
   1181     self.msdd_model.pairwise_infer = True
   1182     self.get_emb_clus_infer(self.clustering_embedding)

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:699, in ClusterEmbedding.prepare_cluster_embs_infer(self)
    695 """
    696 Launch clustering diarizer to prepare embedding vectors and clustering results.
    697 """
    698 self.max_num_speakers = self.cfg_diar_infer.diarizer.clustering.parameters.max_num_speakers
--> 699 self.emb_sess_test_dict, self.emb_seq_test, self.clus_test_label_dict, _ = self.run_clustering_diarizer(
    700     self._cfg_msdd.test_ds.manifest_filepath, self._cfg_msdd.test_ds.emb_dir
    701 )

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:879, in ClusterEmbedding.run_clustering_diarizer(self, manifest_filepath, emb_dir)
    875 self._embs_and_timestamps = get_embs_and_timestamps(
    876     self.clus_diar_model.multiscale_embeddings_and_timestamps, self.clus_diar_model.multiscale_args_dict
    877 )
    878 session_scale_mapping_dict = self.get_scale_map(self._embs_and_timestamps)
--> 879 emb_scale_seq_dict = self.load_emb_scale_seq_dict(emb_dir)
    880 clus_labels = self.load_clustering_labels(emb_dir)
    881 emb_sess_avg_dict, base_clus_label_dict = self.get_cluster_avg_embs(
    882     emb_scale_seq_dict, clus_labels, speaker_mapping_dict, session_scale_mapping_dict
    883 )

File ~/swdev/speech_processing/venvNemo/lib/python3.11/site-packages/nemo/collections/asr/models/msdd_models.py:961, in ClusterEmbedding.load_emb_scale_seq_dict(self, out_dir)
    957 pickle_path = os.path.join(
    958     out_dir, 'speaker_outputs', 'embeddings', f'subsegments_scale{scale_index}_embeddings.pkl'
    959 )
    960 logging.info(f"Loading embedding pickle file of scale:{scale_index} at {pickle_path}")
--> 961 with open(pickle_path, "rb") as input_file:
    962     emb_dict = pkl.load(input_file)
    963 for key, val in emb_dict.items():

FileNotFoundError: [Errno 2] No such file or directory: '/tmp/tmpil5i2a_2/speaker_outputs/embeddings/subsegments_scale0_embeddings.pkl'

Steps/Code to reproduce bug

If you go to the Speaker_Diarization_Inference.ipynb file, and run everything, you can change the code to system_vad_msdd_model('data/an4_diarize_test.wav'), and the error will get thrown.

@benhoff benhoff added the bug Something isn't working label Dec 5, 2023
@tango4j
Copy link
Collaborator

tango4j commented Dec 6, 2023

Hi, @benhoff.
__call__() function should be preceded by
diarize = NeuralDiarizer.from_pretrained(model_name='diar_msdd_telephonic')
To run diarization as the simplest form, e.g., diarize('audio.wav')

This call function is designed for hugging face API, to minimize the preparation steps.
PR #5945 is explaining the use case of this.
if you have to load the model regularly with cfg as in the tutorial, please use .diarize() function.

We will add the error handling for this case of illegal call.

@benhoff
Copy link
Author

benhoff commented Dec 6, 2023

Hey @tango4j ! Been following your recently as I chunk through some of this stuff, looking forward to seeing the online diarizer PR land as it's closer to my use case.

I'm looking to diarize short chunks of audio (<30 seconds) on a server. I've wrapped this code base with an API and I'm rewriting/updating the mainfest.json repeatably to use the diarize method.

Is that the recommended way to do that?

Or would it be better to use the from_pretrained and then the __call__ method?

@tango4j
Copy link
Collaborator

tango4j commented Dec 6, 2023

@benhoff
We are preparing online diarization PR #7896 .
I apologize that I did not have bandwidth to finalize this.

Online diarization system requires more sophisticated system where we need to implement history buffer mechanism to mermorize the past speaker profiles.

We won't be designing the online diarization to have such a long (30se) buffer, we will make it have 1~2 second of frame inputs. However, you will be able to tweak the system to serve as you intend.

In Part-2 PR, there will be tutorial, example and yaml file.

For the time being, I think doing offline diarization of all the cumulated audio and perform offline diarization is only way. you might want to match the speakers among multiple sequential outputs.

@benhoff
Copy link
Author

benhoff commented Dec 7, 2023

@tango4j , no worries on the PR, though good to know that the use case is targeting 1-2 seconds, I won't wait for it to land and take the offline approach you suggested instead.

I was thinking about implementing the historical buffer myself, but I was surprised that most of the approaches don't prune some of the more spurious data or smaller data (for example, everything less than 0.5 seconds) automatically.

Maybe the algorithmic clustering approaches do this via math, but in a meeting use case, my gut instinct is to throw away some of the noise and only keep longer, more robust phrases so that you can get a clean fingerprint of someone's voice.

Copy link
Contributor

github-actions bot commented Jan 7, 2024

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Jan 7, 2024
@benhoff benhoff closed this as completed Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale
Projects
None yet
Development

No branches or pull requests

2 participants