Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speaker diarization with ASR ouputs #3708

Closed
demsarjure opened this issue Feb 18, 2022 · 5 comments
Closed

Speaker diarization with ASR ouputs #3708

demsarjure opened this issue Feb 18, 2022 · 5 comments
Assignees
Labels
feature request/PR for a new feature

Comments

@demsarjure
Copy link
Contributor

Is your feature request related to a problem? Please describe.

I am developing an API for a speaker diarization task with ASR (/examples/speaker_tasks/diarization/offline_diarization_with_asr.py). For my use case the script generates two useful outputs a .json that looks something like this:

{
	"status": "Success",
	"session_id": "example",
	"transcription": "thank you sunny day",
	"speaker_count": 4,
	"words": [
		{
			"word": "thank",
			"start_time": 0.0,
			"end_time": 0.6,
			"speaker_label": "speaker_1"
		},
		{
			"word": "you",
			"start_time": 0.7,
			"end_time": 1.1,
			"speaker_label": "speaker_1"
		},
		{
			"word": "sunny",
			"start_time": 1.5,
			"end_time": 2.1,
			"speaker_label": "speaker_2"
		},
		{
			"word": "day",
			"start_time": 2.2,
			"end_time": 2.3,
			"speaker_label": "speaker_2"
		}
	]
}

So we have the whole transcription which is very useful along with speaker labels for each of the spoken words. For API purposes this JSON is very handy however its contents are not very useful for practical applications. For diarization purposes and practical applications the script's .txt output is much more convenient:

[00:00.00 - 00:01.17] speaker_1: thank you
[00:01.54 - 00:02.33] speaker_2: sunny day

Describe the solution you'd like

Would it be possible to add the information from the .txt output to the JSON? E.g., something like:

	"diraization": [
		{
			"transcription": "thank you",
			"start_time": 0.0,
			"end_time": 1.17,
			"speaker_label": "speaker_1"
		},
		{
			"transcription": "sunny day",
			"start_time": 1.54,
			"end_time": 2.33,
			"speaker_label": "speaker_2"
		}
	]

Describe alternatives you've considered

I took a look at the code and I believe I would could code this myself, would you be interested in a pull request that modifies the output JSON?

@demsarjure demsarjure added the feature request/PR for a new feature label Feb 18, 2022
@nithinraok
Copy link
Collaborator

nithinraok commented Feb 18, 2022

Thanks for the suggestion.

The purpose of <uniq_name>.json file is to provide word level assignment of speaker labels, and the purpose of <uniq_name>.txt file is to provide sentence level assignment of speaker labels, this is generally the convention followed in diarization domain.

Is there any reason you would suggest the format to be in json for sentence level assignments than in txt format?

if you feel the need, I would suggest you to send a PR to add sentence level transcriptions to same <uniq_name>.json with added "sentences" key along with "words" key.

@demsarjure
Copy link
Contributor Author

Hi! Thanks for the reply.

I am interested in this functionality because of a practical application. Like I mentioned I am developing an API for speaker diarization and the return result of the API is a JSON file. One of the use cases we have for it is changing the color of automatically generated subtitles depending on who is speaking. Here we need number of speakers (provided in the JSON), transcript (provided in the JSON) and sentence level diarization (provided in the .txt file). So we need to prepare a new JSON on the API side that includes everything. Since our use case is not uncommon for speaker diarization I was wondering whether it would make sense to do this NeMo side and save time of other NeMo users when they need this as well.

@nithinraok
Copy link
Collaborator

nithinraok commented Feb 22, 2022

Yes, please feel free to send a PR to add sentence level transcriptions to same <uniq_name>.json with added "sentences" key along with "words" key.
Final json will have following keys:

  • status
  • session_id
  • transcription
  • speaker_count
  • words
  • sentences

@demsarjure
Copy link
Contributor Author

Hi!

The PR is at #3791. Let me know if it needs any changes.

Cheers, Jure

@okuchaiev
Copy link
Member

looks like the relevant PR was merged #3897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request/PR for a new feature
Projects
None yet
Development

No branches or pull requests

3 participants