-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Time Stamp calculation using transcribe_speech.py #5568
Conversation
Signed-off-by: smajumdar <[email protected]>
compute_langs: Bool to request language ID information (if the model supports it) | ||
|
||
(Optionally: You can limit the type of timestamp computations using below overrides) | ||
ctc_decoding.ctc_timestamp_type="all" # (default all, can be [all, char, word]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not related to this PR, but can we make the name of the param for both ctc and rnnt the same, something like timestamp_type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possible. I wanted it to be explicit cause technically hybrid models might overwrite each other value but it turns out it's not the case
if compute_timestamps: | ||
timestamps = transcriptions[idx].timestep | ||
if timestamps is not None and isinstance(timestamps, dict): | ||
timestamps.pop( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is poping done to save cpu or gpu memory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The item there is a torch tensor with just integer IDs corresponding to location in audio stream where some text token was emitted. Very low utilities and is just a building block for the char and word based dicts
Signed-off-by: smajumdar <[email protected]> Signed-off-by: smajumdar <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]> Signed-off-by: andrusenkoau <[email protected]>
Signed-off-by: smajumdar <[email protected]> Signed-off-by: smajumdar <[email protected]> Co-authored-by: Vahid Noroozi <[email protected]>
Signed-off-by: smajumdar [email protected]
What does this PR do ?
Adds flag to compute word and char level timestamp using the transcribe speech script.
Collection: [ASR]
Changelog
Usage
Before your PR is "Ready for review"
Pre checks:
PR Type:
If you haven't finished some of the above items you can still open "Draft" PR.
Who can review?