Speaker diarization with ASR ouputs #3708

demsarjure · 2022-02-18T11:08:08Z

Is your feature request related to a problem? Please describe.

I am developing an API for a speaker diarization task with ASR (/examples/speaker_tasks/diarization/offline_diarization_with_asr.py). For my use case the script generates two useful outputs a .json that looks something like this:

{
	"status": "Success",
	"session_id": "example",
	"transcription": "thank you sunny day",
	"speaker_count": 4,
	"words": [
		{
			"word": "thank",
			"start_time": 0.0,
			"end_time": 0.6,
			"speaker_label": "speaker_1"
		},
		{
			"word": "you",
			"start_time": 0.7,
			"end_time": 1.1,
			"speaker_label": "speaker_1"
		},
		{
			"word": "sunny",
			"start_time": 1.5,
			"end_time": 2.1,
			"speaker_label": "speaker_2"
		},
		{
			"word": "day",
			"start_time": 2.2,
			"end_time": 2.3,
			"speaker_label": "speaker_2"
		}
	]
}

So we have the whole transcription which is very useful along with speaker labels for each of the spoken words. For API purposes this JSON is very handy however its contents are not very useful for practical applications. For diarization purposes and practical applications the script's .txt output is much more convenient:

[00:00.00 - 00:01.17] speaker_1: thank you
[00:01.54 - 00:02.33] speaker_2: sunny day

Describe the solution you'd like

Would it be possible to add the information from the .txt output to the JSON? E.g., something like:

	"diraization": [
		{
			"transcription": "thank you",
			"start_time": 0.0,
			"end_time": 1.17,
			"speaker_label": "speaker_1"
		},
		{
			"transcription": "sunny day",
			"start_time": 1.54,
			"end_time": 2.33,
			"speaker_label": "speaker_2"
		}
	]

Describe alternatives you've considered

I took a look at the code and I believe I would could code this myself, would you be interested in a pull request that modifies the output JSON?

The text was updated successfully, but these errors were encountered:

nithinraok · 2022-02-18T21:35:57Z

Thanks for the suggestion.

The purpose of <uniq_name>.json file is to provide word level assignment of speaker labels, and the purpose of <uniq_name>.txt file is to provide sentence level assignment of speaker labels, this is generally the convention followed in diarization domain.

Is there any reason you would suggest the format to be in json for sentence level assignments than in txt format?

if you feel the need, I would suggest you to send a PR to add sentence level transcriptions to same <uniq_name>.json with added "sentences" key along with "words" key.

demsarjure · 2022-02-19T10:41:25Z

Hi! Thanks for the reply.

I am interested in this functionality because of a practical application. Like I mentioned I am developing an API for speaker diarization and the return result of the API is a JSON file. One of the use cases we have for it is changing the color of automatically generated subtitles depending on who is speaking. Here we need number of speakers (provided in the JSON), transcript (provided in the JSON) and sentence level diarization (provided in the .txt file). So we need to prepare a new JSON on the API side that includes everything. Since our use case is not uncommon for speaker diarization I was wondering whether it would make sense to do this NeMo side and save time of other NeMo users when they need this as well.

nithinraok · 2022-02-22T16:45:24Z

Yes, please feel free to send a PR to add sentence level transcriptions to same <uniq_name>.json with added "sentences" key along with "words" key.
Final json will have following keys:

status
session_id
transcription
speaker_count
words
sentences

demsarjure · 2022-03-03T11:53:48Z

Hi!

The PR is at #3791. Let me know if it needs any changes.

Cheers, Jure

okuchaiev · 2022-04-06T00:09:06Z

looks like the relevant PR was merged #3897

demsarjure added the feature request/PR for a new feature label Feb 18, 2022

demsarjure assigned okuchaiev Feb 18, 2022

nithinraok assigned nithinraok and unassigned okuchaiev Feb 22, 2022

demsarjure mentioned this issue Mar 3, 2022

Diarization with ASR JSON output now also includes sentences. #3791

Closed

8 tasks

demsarjure mentioned this issue Mar 29, 2022

JSON output from diarization now includes sentences. Optimized senten… #3897

Merged

8 tasks

okuchaiev closed this as completed Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speaker diarization with ASR ouputs #3708

Speaker diarization with ASR ouputs #3708

demsarjure commented Feb 18, 2022

nithinraok commented Feb 18, 2022 •

edited

Loading

demsarjure commented Feb 19, 2022

nithinraok commented Feb 22, 2022 •

edited

Loading

demsarjure commented Mar 3, 2022

okuchaiev commented Apr 6, 2022

Speaker diarization with ASR ouputs #3708

Speaker diarization with ASR ouputs #3708

Comments

demsarjure commented Feb 18, 2022

nithinraok commented Feb 18, 2022 • edited Loading

demsarjure commented Feb 19, 2022

nithinraok commented Feb 22, 2022 • edited Loading

demsarjure commented Mar 3, 2022

okuchaiev commented Apr 6, 2022

nithinraok commented Feb 18, 2022 •

edited

Loading

nithinraok commented Feb 22, 2022 •

edited

Loading