-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve RNN-T streaming decoding #3295
Improve RNN-T streaming decoding #3295
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3295
Note: Links to docs will display an error until the docs builds have been completed. ❌ 2 New Failures, 4 Unrelated FailuresAs of commit 768307f: BROKEN TRUNK - The following jobs failed but were present on the merge base 5a6f4eb:👉 Rebase onto the `viable/strict` branch to avoid these failures
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
629ea1e
to
5f8bb02
Compare
Thanks for the PR. Do you have some information about the improvement before and after? (note: Perhaps we should update the demo script to compute WER so that this kind of comparison is easy) |
5f8bb02
to
ff23b28
Compare
The transcripts match the non-streaming inference more closely with the change, since the predictor sees the full context of previous text predictions. The differences are even more pronounced on a custom dataset, with conversational audio, as opposed to speakers reading audio books at a fixed pace such as librispeech. Output of Beforehe hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to beled out in thickard fat and sauce stuff it you his belly counselled him after early nightfall the yellow lamps would light up here and there the squal of thephals hello bertie any good in your mind number ten fresh nelly is waiting on you good night husband the music came nearer and he recalled the words the words of shelley's fragment upon the moon wanderingless pale for weariness the dull light fell more faintly upon the page whereon equation began to unfold itself slowly and to spread abroad its widening tale a cold indifference reigned in his soul theos in which his ardor exting itself was a cold indifferent knowledge of himself at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace Afterhe hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fat and sauce stuff it into you his belly counselled him after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels hello bertie any good in your mind number ten fresh nelly is waiting on you good night husband the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tale a cold lucid indifference reigned in his soul the chaos in which his ardor extinguished itself was a cold indifferent knowledge of himself at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace |
Hi @lakshmi-speak, thanks for the PR. One observation here is that, although we're still processing the input waveform segment by segment, we're now storing token sequences in the hypotheses that grow unboundedly, and rather than outputting text in a streaming manner as before, we now output text only after all input segments have been processed. This behavior runs counter to what we want with streaming inference. Do you have any proposals around how to address these? |
Actually, every time we process an input segment we do have the
transcription up until that time step. I only made the changes to the demo
script to print in the end so as to not have the output repeat itself a
bunch of times. So intermediate results are available.
Regarding the hypotheses growing, this is correct. I am not sure how big of
an issue that is - perhaps there can be a way to reset the hypotheses as a
user defined flag?
The hypotheses will grow to reflect the input audio, similar to
non-streaming use case. But it’s also just text tokens so perhaps not that
big of an issue?
Happy to hear your feedback on this.
…On Thu, May 4, 2023 at 5:47 PM hwangjeff ***@***.***> wrote:
Hi @lakshmi-speak <https://github.com/lakshmi-speak>, thanks for the PR.
One observation here is that, although we're still processing the input
waveform segment by segment, we're now storing token sequences in the
hypotheses that grow unboundedly, and rather than outputting text in a
streaming manner as before, we now output text only after all input
segments have been processed. This behavior runs counter to what we want
with streaming inference. Do you have any proposals around how to address
these?
—
Reply to this email directly, view it on GitHub
<#3295 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AYOLQDD2TJALJISLSNNYHRDXEREZHANCNFSM6AAAAAAXTTITBE>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@lakshmi-speak got it, thanks. I suppose the situations in which the hypotheses would grow beyond reason are fairly niche. For now, then, I think we can go with what you have here and make it the caller's responsibility to reset the hypotheses. As for printing the transcripts at the very end in the demo and tutorials, perhaps we can go back to printing after each iteration but with a carriage return appended, e.g. replace |
f655156
to
0eba154
Compare
Made the changes to the pipeline_demo.py script (will only refresh the current line, but I think that's ok for librispeech?) |
Is the hypothesis the result of inference or is it something fed to models? Or alternatively, does the growing hypothesis have performance penalty on inference on later stage? |
print(transcript, end="", flush=True) | ||
hypothesis = hypos | ||
transcript = token_processor(hypos[0][0], lstrip=False) | ||
os.system('cls' if os.name == 'nt' else 'clear') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is because the transcript is long and single line refresh won't work, is there any other workaround for this other than screen clear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you show a screen shot of what issue happens due to long line?
I don't think system call belongs here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example is a transcription of a very long audio, which spans multiple lines on the screen. Since the transcript now updates every step but keeps all the prior history, if we didn't clear the screen, we end up with repetitions of the same line. Using carriage return on print works, if we are refreshing the same line. Here we want to refresh several previous print lines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lakshmi-speak it seems to work fine without the system call. can you remove it? the rest of the pr looks good — we can merge it after you make the change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hwangjeff ready to merge!
There isn't any additional performance penalty because of the hypotheses growing in length, since the prediction and |
34f1bf2
to
d5e1210
Compare
@hwangjeff has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
1 similar comment
@hwangjeff has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Hey @hwangjeff. Some guidance:Use 'module: ops' for operations under 'torchaudio/{transforms, functional}', and ML-related components under 'torchaudio/csrc' (e.g. RNN-T loss). Things in "examples" directory:
Please use 'other' tag only when you’re sure the changes are not much relevant to users, or when all other tags are not applicable. Try not to use it often, in order to minimize efforts required when we prepare release notes. When preparing release notes, please make sure 'documentation' and 'tutorials' occur as the last sub-categories under each primary category like 'new feature', 'improvements' or 'prototype'. Things related to build are by default excluded from the release note, except when it impacts users. For example: |
@hwangjeff merged this pull request in 9fc0dca. |
Summary: Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial. Differential Revision: D46227174 fbshipit-source-id: 9013b4add6d1c8e3300c3f8cfe4e695429158e8c
Summary: Pull Request resolved: pytorch#3379 Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial. Reviewed By: mthrok Differential Revision: D46227174 fbshipit-source-id: 2630295257c43acb14414b700b36939dfe6a8994
Summary: Pull Request resolved: pytorch#3379 Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial. Reviewed By: mthrok Differential Revision: D46227174 fbshipit-source-id: 0df50d354c080a26d76274233e78987c8d28d5a5
@lakshmi-speak note that we've merged your PR — thanks for contributing to the library! |
This commit fixes the following issues affecting streaming decoding quality
init_b
hypothesis is only regenerated from blank token if no initial hypotheses are provided.This also means that the resulting output is the entire transcript up until that time step, instead of just the incremental change in transcript.