JSON output from diarization now includes sentences. Optimized senten… #3897

demsarjure · 2022-03-29T10:29:47Z

…ce generation code

Signed-off-by: Jure Demsar [email protected]

What does this PR do ?

Diarization with ASR JSON output now also includes sentences. This is a continuation of #3791, which I accidentally closed when cleaning up the repo.

Collection: [Note which collection this PR will affect]

ASR

Changelog

Diarization with ASR JSON output now also includes sentences.

Usage

You can use the examples/speaker_tasks/diarization/offline_diarization.py script and check the output JSON file (.../pred_rttms/uniq_name.json).

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines.
Did you write any new necessary tests? No.
Did you add or update any necessary documentation? No.
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)? No.
Reviewer: Does the PR have correct import guards for all optional libraries? N/A.

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

@titu1994, @redoctopus, @jbalam-nv, or @okuchaiev

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to Speaker diarization with ASR ouputs #3708 (Speaker diarization with ASR ouputs #3708)

When implementing I used an approach that changes the original codebase as little as possible. In other words, I tried to be non-invasive. After reviewing the existing code for sentences (string_out) preparation in make_json_output() of /nemo/collections/asr/parts/utils/diarization_utils.py I feel like the whole process could be done much more efficiently and elegantly. Currently, the sentence preparation is appending words to a string where the beginning of the string are the time stamps. When the ending time stamp needs to be corrected the code then uses string splitting to extract and replace/correct the end time value. This is quite clunky and inefficient. It would be much more elegant if time stamps and the sentences were stored and built within a dictionary and then printed out at the end.

…ce generation code Signed-off-by: Jure Demsar <[email protected]>

tango4j · 2022-03-31T01:18:43Z

nemo/collections/asr/parts/utils/diarization_utils.py

+            start_point_str = datetime.fromtimestamp(start_point - datetime_offset).strftime(time_str)[:-4]
+            end_point_str = datetime.fromtimestamp(end_point - datetime_offset).strftime(time_str)[:-4]
+
+            # print time?


The variable "replace_time" is necessary for streaming diarization case.
In streaming diarization scenario, the end time timestamp should be updated everytime the diarization system gets another output.
Please do the following:

revive the argument "replace_time"

revive the if statement with the argument "replace_time"

Except this part, I checked that all the other functions are working well.

Hi tango4j!

Time replacement is now handled inside the make_json_output function (line 529). If the speaker is the same as the previous one we do not have a new sentence but the end time gets tweaked. The change is then also reflected in the printed out time (string_out variable that gets printed out to the .txt file) automatically. There is no need to have the replace_time argument there.

I might be wrong since I do not know how exactly the streaming pipeline will work. I can revive the replace_time argument, but in the current codebase it would not make any sense. I am writing this so we are on the same page. Please advise about how to proceed here.

Thanks, Jure.

Another remark is that the current implementation was executed as proposed by Nithin Rao in #3791.

@demsarjure
it's my bad that I didn't get to check make_json_output.
It seems now, in revised code, "sentences" item takes over start-end time generation for transcript and json item "sentences" at the same time, which reduces the redundancy.

Aside from this, could you do a minor cosmetic change by removing the one word comment?
For example, "sentences", "prepare", "color?", "print time?"
Other comments are helpful, but these one word comments seem somewhat unnecessary (since the variable names are self-explanatory).

nithinraok

Thanks, @demsarjure, ran and checked the output, looks good.

Just a minor change requested, please reflect, and should be good to go.
Also as taejin mentioned you may remove the comments added that look obvious.

nemo/collections/asr/parts/utils/diarization_utils.py

Signed-off-by: Jure Demsar <[email protected]>

demsarjure · 2022-04-01T08:00:33Z

@tango4j @nithinraok I revised everything in accordance with your comments. Thanks!

nithinraok · 2022-04-01T13:58:14Z

@demsarjure I think you missed above comment #3897 (comment)

Signed-off-by: Jure Demsar <[email protected]>

demsarjure · 2022-04-01T14:22:10Z

Yes, I did not see that comment, sorry about that :).

nithinraok

Thanks @demsarjure great work. LGTM

tango4j · 2022-04-01T18:41:32Z

Thank you @demsarjure . This is a thoughtful and useful contribution.

demsarjure added 2 commits March 29, 2022 12:25

JSON output from diarization now includes sentences. Optimized senten…

450a257

…ce generation code Signed-off-by: Jure Demsar <[email protected]>

Merge branch 'main' into feature/sentences

2afee4e

demsarjure mentioned this pull request Mar 30, 2022

Diarization with ASR JSON output now also includes sentences. #3791

Closed

8 tasks

Merge branch 'main' into feature/sentences

10424b5

nithinraok self-requested a review March 30, 2022 17:37

Merge branch 'main' into feature/sentences

23e2148

tango4j requested changes Mar 31, 2022

View reviewed changes

nithinraok requested changes Apr 1, 2022

View reviewed changes

nemo/collections/asr/parts/utils/diarization_utils.py Outdated Show resolved Hide resolved

nemo/collections/asr/parts/utils/diarization_utils.py Outdated Show resolved Hide resolved

Removed unnecessary comments and fixed a typo.

eb49978

Signed-off-by: Jure Demsar <[email protected]>

Moved all writing to logs and dicts to the write_and_log functions

9bb575b

Signed-off-by: Jure Demsar <[email protected]>

Merge branch 'main' into feature/sentences

4dd4d36

nithinraok approved these changes Apr 1, 2022

View reviewed changes

tango4j approved these changes Apr 1, 2022

View reviewed changes

tango4j merged commit b0e250f into NVIDIA:main Apr 1, 2022

okuchaiev mentioned this pull request Apr 6, 2022

Speaker diarization with ASR ouputs #3708

Closed

tango4j mentioned this pull request Apr 7, 2022

Torch conversion for VAD-Diarization pipeline #3930

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON output from diarization now includes sentences. Optimized senten… #3897

JSON output from diarization now includes sentences. Optimized senten… #3897

demsarjure commented Mar 29, 2022

tango4j Mar 31, 2022

demsarjure Mar 31, 2022

demsarjure Mar 31, 2022

tango4j Mar 31, 2022

nithinraok left a comment

demsarjure commented Apr 1, 2022

nithinraok commented Apr 1, 2022

demsarjure commented Apr 1, 2022

nithinraok left a comment

tango4j commented Apr 1, 2022

JSON output from diarization now includes sentences. Optimized senten… #3897

JSON output from diarization now includes sentences. Optimized senten… #3897

Conversation

demsarjure commented Mar 29, 2022

What does this PR do ?

Changelog

Usage

Before your PR is "Ready for review"

Who can review?

Additional Information

tango4j Mar 31, 2022

Choose a reason for hiding this comment

demsarjure Mar 31, 2022

Choose a reason for hiding this comment

demsarjure Mar 31, 2022

Choose a reason for hiding this comment

tango4j Mar 31, 2022

Choose a reason for hiding this comment

nithinraok left a comment

Choose a reason for hiding this comment

demsarjure commented Apr 1, 2022

nithinraok commented Apr 1, 2022

demsarjure commented Apr 1, 2022

nithinraok left a comment

Choose a reason for hiding this comment

tango4j commented Apr 1, 2022