Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use speaker diarization without writing input to manifest and output to .rttm file ? #6271

Closed
dungnguyen98 opened this issue Mar 22, 2023 · 4 comments
Assignees

Comments

@dungnguyen98
Copy link

Thanks for your great repository. I have a question about speaker diarization.
I do the inference steps according to this tutorial: https://github.com/NVIDIA/NeMo/blob/main/tutorials/speaker_tasks/Speaker_Diarization_Inference.ipynb

According to this tutorial, audio will be written to manifest file, output will be saved in .rttm format in out_dir ( manifest file and out_dir are defined in diar_infer_telephonic.yaml . Is there a direct way to pass the input as audio metadata (path, duration, ...) and give the results without writing the manifest and the .rttm files?

@dungnguyen98 dungnguyen98 changed the title Speaker diarization how to use speaker diarization without writing input to manifest and output to .rttm file ? Mar 22, 2023
@dungnguyen98 dungnguyen98 changed the title how to use speaker diarization without writing input to manifest and output to .rttm file ? How to use speaker diarization without writing input to manifest and output to .rttm file ? Mar 22, 2023
@tango4j
Copy link
Collaborator

tango4j commented Mar 23, 2023

You can perform speaker diarization without such manifest/rttm with the following feature:

#5945

Try following the example in the above PR.

@dungnguyen98
Copy link
Author

Is this feature has not been integrated in nemo library via pip ? I installed newest version (version 1.16 at this point) and I realized that I can not import NeuralDiarizer using command "from nemo.collections.asr.models import NeuralDiarizer", but main branch on github can do that.

@dungnguyen98
Copy link
Author

dungnguyen98 commented Mar 24, 2023

I have read code in NeuralDiarizer class. I realized that the audio files still must be written into manifest file. When I run multi-processes, the processes will overwrite this file. This will lead to conflig problem. I think I need to customize code due to my need using monkey patching :(

@magicse
Copy link

magicse commented Nov 21, 2023

@dungnguyen98

import os
import sys
from nemo.collections.asr.models.msdd_models import NeuralDiarizer
import torch
from pydub import AudioSegment
import pandas as pd

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)
model_path = "diarization_model"
os.environ["NEMO_CACHE_DIR"] = model_path

test_model = NeuralDiarizer.from_pretrained("diar_msdd_telephonic").to(device)
annotation = test_model("denoised_vocals.wav", num_workers=0, batch_size=16)
rttm=annotation.to_rttm()

df = pd.DataFrame(columns=['start_time', 'end_time', 'speaker', 'text'])
lines = rttm.splitlines()
if len(lines) == 0:
    df.loc[0] = 0, 0, 'No speaker found'
else:
    start_time, duration, prev_speaker = float(lines[0].split()[3]), float(lines[0].split()[4]), lines[0].split()[7]
    end_time = float(start_time) + float(duration)
    df.loc[0] = start_time, end_time, prev_speaker, ''

    for line in lines[1:]:
        split = line.split()
        start_time, duration, cur_speaker = float(split[3]), float(split[4]), split[7]
        end_time = float(start_time) + float(duration)
        if cur_speaker == prev_speaker:
            df.loc[df.index[-1], 'end_time'] = end_time
        else:
            df.loc[len(df)] = start_time, end_time, cur_speaker, ''
        prev_speaker = cur_speaker
    
print(df.to_string(index=False))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants