Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON output from diarization now includes sentences. Optimized senten… #3897

Merged
merged 7 commits into from
Apr 1, 2022

Conversation

demsarjure
Copy link
Contributor

…ce generation code

Signed-off-by: Jure Demsar [email protected]

What does this PR do ?

Diarization with ASR JSON output now also includes sentences. This is a continuation of #3791, which I accidentally closed when cleaning up the repo.

Collection: [Note which collection this PR will affect]

ASR

Changelog

Diarization with ASR JSON output now also includes sentences.

Usage

You can use the examples/speaker_tasks/diarization/offline_diarization.py script and check the output JSON file (.../pred_rttms/uniq_name.json).

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines.
  • Did you write any new necessary tests? No.
  • Did you add or update any necessary documentation? No.
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)? No.
  • Reviewer: Does the PR have correct import guards for all optional libraries? N/A.

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

@titu1994, @redoctopus, @jbalam-nv, or @okuchaiev

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

When implementing I used an approach that changes the original codebase as little as possible. In other words, I tried to be non-invasive. After reviewing the existing code for sentences (string_out) preparation in make_json_output() of /nemo/collections/asr/parts/utils/diarization_utils.py I feel like the whole process could be done much more efficiently and elegantly. Currently, the sentence preparation is appending words to a string where the beginning of the string are the time stamps. When the ending time stamp needs to be corrected the code then uses string splitting to extract and replace/correct the end time value. This is quite clunky and inefficient. It would be much more elegant if time stamps and the sentences were stored and built within a dictionary and then printed out at the end.

@nithinraok nithinraok self-requested a review March 30, 2022 17:37
start_point_str = datetime.fromtimestamp(start_point - datetime_offset).strftime(time_str)[:-4]
end_point_str = datetime.fromtimestamp(end_point - datetime_offset).strftime(time_str)[:-4]

# print time?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The variable "replace_time" is necessary for streaming diarization case.
In streaming diarization scenario, the end time timestamp should be updated everytime the diarization system gets another output.
Please do the following:

  • revive the argument "replace_time"
  • revive the if statement with the argument "replace_time"

Except this part, I checked that all the other functions are working well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi tango4j!

Time replacement is now handled inside the make_json_output function (line 529). If the speaker is the same as the previous one we do not have a new sentence but the end time gets tweaked. The change is then also reflected in the printed out time (string_out variable that gets printed out to the .txt file) automatically. There is no need to have the replace_time argument there.

I might be wrong since I do not know how exactly the streaming pipeline will work. I can revive the replace_time argument, but in the current codebase it would not make any sense. I am writing this so we are on the same page. Please advise about how to proceed here.

Thanks, Jure.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another remark is that the current implementation was executed as proposed by Nithin Rao in #3791.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@demsarjure
it's my bad that I didn't get to check make_json_output.
It seems now, in revised code, "sentences" item takes over start-end time generation for transcript and json item "sentences" at the same time, which reduces the redundancy.

Aside from this, could you do a minor cosmetic change by removing the one word comment?
For example, "sentences", "prepare", "color?", "print time?"
Other comments are helpful, but these one word comments seem somewhat unnecessary (since the variable names are self-explanatory).

Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @demsarjure, ran and checked the output, looks good.

Just a minor change requested, please reflect, and should be good to go.
Also as taejin mentioned you may remove the comments added that look obvious.

nemo/collections/asr/parts/utils/diarization_utils.py Outdated Show resolved Hide resolved
nemo/collections/asr/parts/utils/diarization_utils.py Outdated Show resolved Hide resolved
@demsarjure
Copy link
Contributor Author

@tango4j @nithinraok I revised everything in accordance with your comments. Thanks!

@nithinraok
Copy link
Collaborator

@demsarjure I think you missed above comment #3897 (comment)

@demsarjure
Copy link
Contributor Author

Yes, I did not see that comment, sorry about that :).

Copy link
Collaborator

@nithinraok nithinraok left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @demsarjure great work. LGTM

@tango4j
Copy link
Collaborator

tango4j commented Apr 1, 2022

Thank you @demsarjure . This is a thoughtful and useful contribution.

@tango4j tango4j merged commit b0e250f into NVIDIA:main Apr 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants