Improve RNN-T streaming decoding #3295

lakshmi-speak · 2023-05-02T21:42:01Z

This commit fixes the following issues affecting streaming decoding quality

The init_b hypothesis is only regenerated from blank token if no initial hypotheses are provided.
Allows the decoder to receive top-K hypothesis to continue decoding from, instead of using just the top hypothesis at each decoding step. This dramatically affects decoding quality especially for speech with long pauses and disfluencies.
Some minor errors regarding shape checking for length.

This also means that the resulting output is the entire transcript up until that time step, instead of just the incremental change in transcript.

pytorch-bot · 2023-05-02T21:42:04Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3295

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Unrelated Failures

As of commit 768307f:

NEW FAILURES - The following jobs have failed:

build (3.10, 11.8) / Build doc (gh)
build (3.8) (gh)

BROKEN TRUNK - The following jobs failed but were present on the merge base 5a6f4eb:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

mthrok · 2023-05-03T19:26:57Z

Hi @lakshmi-speak

Thanks for the PR. Do you have some information about the improvement before and after?
For example running this script gives better result?
https://github.com/pytorch/audio/blob/main/examples/asr/emformer_rnnt/pipeline_demo.py

(note: Perhaps we should update the demo script to compute WER so that this kind of comparison is easy)

lakshmi-speak · 2023-05-04T17:12:19Z

Hi @lakshmi-speak

Thanks for the PR. Do you have some information about the improvement before and after? For example running this script gives better result? https://github.com/pytorch/audio/blob/main/examples/asr/emformer_rnnt/pipeline_demo.py

(note: Perhaps we should update the demo script to compute WER so that this kind of comparison is easy)

The transcripts match the non-streaming inference more closely with the change, since the predictor sees the full context of previous text predictions. The differences are even more pronounced on a custom dataset, with conversational audio, as opposed to speakers reading audio books at a fixed pace such as librispeech.

Output of https://github.com/pytorch/audio/blob/main/examples/asr/emformer_rnnt/pipeline_demo.py (in each pair of transcripts, first is streaming result and second is non-streaming result)

Before

he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to beled out in thickard fat and sauce
he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fat and sauce

stuff it you his belly counselled him
stuff it into you his belly counselled him

after early nightfall the yellow lamps would light up here and there the squal of thephals
after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels

hello bertie any good in your mind
hello bertie any good in your mind

number ten fresh nelly is waiting on you good night husband
number ten fresh nelly is waiting on you good night husband

the music came nearer and he recalled the words the words of shelley's fragment upon the moon wanderingless pale for weariness
the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness

the dull light fell more faintly upon the page whereon equation began to unfold itself slowly and to spread abroad its widening tale
the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tale

a cold indifference reigned in his soul
a cold lucid indifference reigned in his soul

theos in which his ardor exting itself was a cold indifferent knowledge of himself
the chaos in which his ardor extinguished itself was a cold indifferent knowledge of himself

at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace
at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace

After

he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fat and sauce
he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fat and sauce

stuff it into you his belly counselled him
stuff it into you his belly counselled him

after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels
after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels

hello bertie any good in your mind
hello bertie any good in your mind

number ten fresh nelly is waiting on you good night husband
number ten fresh nelly is waiting on you good night husband

the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness
the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness

the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tale
the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tale

a cold lucid indifference reigned in his soul
a cold lucid indifference reigned in his soul

the chaos in which his ardor extinguished itself was a cold indifferent knowledge of himself
the chaos in which his ardor extinguished itself was a cold indifferent knowledge of himself

at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace
at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace

hwangjeff · 2023-05-05T00:47:05Z

Hi @lakshmi-speak, thanks for the PR. One observation here is that, although we're still processing the input waveform segment by segment, we're now storing token sequences in the hypotheses that grow unboundedly, and rather than outputting text in a streaming manner as before, we now output text only after all input segments have been processed. This behavior runs counter to what we want with streaming inference. Do you have any proposals around how to address these?

lakshmi-speak · 2023-05-05T02:00:11Z

Actually, every time we process an input segment we do have the transcription up until that time step. I only made the changes to the demo script to print in the end so as to not have the output repeat itself a bunch of times. So intermediate results are available. Regarding the hypotheses growing, this is correct. I am not sure how big of an issue that is - perhaps there can be a way to reset the hypotheses as a user defined flag? The hypotheses will grow to reflect the input audio, similar to non-streaming use case. But it’s also just text tokens so perhaps not that big of an issue? Happy to hear your feedback on this.

…

On Thu, May 4, 2023 at 5:47 PM hwangjeff ***@***.***> wrote: Hi @lakshmi-speak <https://github.com/lakshmi-speak>, thanks for the PR. One observation here is that, although we're still processing the input waveform segment by segment, we're now storing token sequences in the hypotheses that grow unboundedly, and rather than outputting text in a streaming manner as before, we now output text only after all input segments have been processed. This behavior runs counter to what we want with streaming inference. Do you have any proposals around how to address these? — Reply to this email directly, view it on GitHub <#3295 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYOLQDD2TJALJISLSNNYHRDXEREZHANCNFSM6AAAAAAXTTITBE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

hwangjeff · 2023-05-05T04:24:42Z

@lakshmi-speak got it, thanks. I suppose the situations in which the hypotheses would grow beyond reason are fairly niche. For now, then, I think we can go with what you have here and make it the caller's responsibility to reset the hypotheses. As for printing the transcripts at the very end in the demo and tutorials, perhaps we can go back to printing after each iteration but with a carriage return appended, e.g. replace print(transcript, end="", flush=True) with print(transcript, end="\r", flush=True), so that each subsequent transcript overwrites the previous transcript. This way, users can see the updates in real time.

lakshmi-speak · 2023-05-05T21:31:39Z

Made the changes to the pipeline_demo.py script (will only refresh the current line, but I think that's ok for librispeech?)

mthrok · 2023-05-06T02:41:37Z

Regarding the hypotheses growing, this is correct. I am not sure how big of an issue that is - perhaps there can be a way to reset the hypotheses as a user defined flag? The hypotheses will grow to reflect the input audio, similar to non-streaming use case. But it’s also just text tokens so perhaps not that big of an issue?

Is the hypothesis the result of inference or is it something fed to models? Or alternatively, does the growing hypothesis have performance penalty on inference on later stage?

hwangjeff · 2023-05-11T21:12:33Z

examples/tutorials/online_asr_tutorial.py

-        print(transcript, end="", flush=True)
+        hypothesis = hypos
+        transcript = token_processor(hypos[0][0], lstrip=False)
+        os.system('cls' if os.name == 'nt' else 'clear')


is this necessary?

This is because the transcript is long and single line refresh won't work, is there any other workaround for this other than screen clear?

Can you show a screen shot of what issue happens due to long line?
I don't think system call belongs here.

This example is a transcription of a very long audio, which spans multiple lines on the screen. Since the transcript now updates every step but keeps all the prior history, if we didn't clear the screen, we end up with repetitions of the same line. Using carriage return on print works, if we are refreshing the same line. Here we want to refresh several previous print lines.

@lakshmi-speak it seems to work fine without the system call. can you remove it? the rest of the pr looks good — we can merge it after you make the change.

@hwangjeff ready to merge!

lakshmi-speak · 2023-05-12T23:56:28Z

Regarding the hypotheses growing, this is correct. I am not sure how big of an issue that is - perhaps there can be a way to reset the hypotheses as a user defined flag? The hypotheses will grow to reflect the input audio, similar to non-streaming use case. But it’s also just text tokens so perhaps not that big of an issue?

Is the hypothesis the result of inference or is it something fed to models? Or alternatively, does the growing hypothesis have performance penalty on inference on later stage?

There isn't any additional performance penalty because of the hypotheses growing in length, since the prediction and hypo_predictor_out are one-step. However, since we continue decoding at each step from top-k hypothesis, we do process k- batches of predictions in parallel at the first time-step per input frame.
This is to be expected and is the reason why we see superior performance in beam search decoding. Previously, we would use the just top-1 hypothesis at every step, even though top-k was being predicted.

facebook-github-bot · 2023-05-26T07:04:22Z

@hwangjeff has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot · 2023-05-26T16:17:47Z

@hwangjeff has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

github-actions · 2023-05-26T18:23:08Z

Hey @hwangjeff.
You merged this PR, but labels were not properly added. Please add a primary and secondary label (See https://github.com/pytorch/audio/blob/main/.github/process_commit.py).

Some guidance:

Use 'module: ops' for operations under 'torchaudio/{transforms, functional}', and ML-related components under 'torchaudio/csrc' (e.g. RNN-T loss).

Things in "examples" directory:

'recipe' is applicable to training recipes under the 'examples' folder,
'tutorial' is applicable to tutorials under the “examples/tutorials” folder
'example' is applicable to everything else (e.g. C++ examples)
'module: docs' is applicable to code documentations (not to tutorials).
Regarding examples in code documentations, please also use 'module: docs'.

Please use 'other' tag only when you’re sure the changes are not much relevant to users, or when all other tags are not applicable. Try not to use it often, in order to minimize efforts required when we prepare release notes.

When preparing release notes, please make sure 'documentation' and 'tutorials' occur as the last sub-categories under each primary category like 'new feature', 'improvements' or 'prototype'.

Things related to build are by default excluded from the release note, except when it impacts users. For example:
* Drop support of Python 3.7.
* Add support of Python 3.X.
* Change the way a third party library is bound (so that user needs to install it separately).

facebook-github-bot · 2023-05-26T18:58:43Z

@hwangjeff merged this pull request in 9fc0dca.

Summary: Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial. Differential Revision: D46227174 fbshipit-source-id: 9013b4add6d1c8e3300c3f8cfe4e695429158e8c

Summary: Pull Request resolved: pytorch#3379 Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial. Reviewed By: mthrok Differential Revision: D46227174 fbshipit-source-id: 2630295257c43acb14414b700b36939dfe6a8994

Summary: Pull Request resolved: pytorch#3379 Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial. Reviewed By: mthrok Differential Revision: D46227174 fbshipit-source-id: 0df50d354c080a26d76274233e78987c8d28d5a5

Summary: Pull Request resolved: #3379 Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial. Reviewed By: mthrok Differential Revision: D46227174 fbshipit-source-id: 7c1c3f05a6476cb0437622dea6f3ae6cb3ea9468

hwangjeff · 2023-05-31T22:48:27Z

@lakshmi-speak note that we've merged your PR — thanks for contributing to the library!

Summary: Fixes decoder calls and related code in Device ASR/AVSR tutorials to account for changes to RNN-T decoder introduced in #3295. Pull Request resolved: #3572 Reviewed By: mthrok Differential Revision: D48629428 Pulled By: hwangjeff fbshipit-source-id: 63ede307fb4412aa28f88972d56dca8405607b7a

facebook-github-bot added the CLA Signed label May 2, 2023

lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch 2 times, most recently from 629ea1e to 5f8bb02 Compare May 3, 2023 16:44

lakshmi-speak marked this pull request as draft May 3, 2023 18:49

lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch from 5f8bb02 to ff23b28 Compare May 4, 2023 17:02

lakshmi-speak changed the title ~~Fix RNN-T streaming decoding~~ Improve RNN-T streaming decoding May 4, 2023

lakshmi-speak marked this pull request as ready for review May 4, 2023 17:25

lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch from f655156 to 0eba154 Compare May 5, 2023 21:30

hwangjeff reviewed May 11, 2023

View reviewed changes

Improve RNN-T streaming decoding

d5e1210

lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch from 34f1bf2 to d5e1210 Compare May 23, 2023 21:07

Merge branch 'main' into rnnt_decoder_fix_streaming

768307f

facebook-github-bot closed this in 9fc0dca May 26, 2023

facebook-github-bot added the Merged label May 26, 2023

hwangjeff added BC-breaking improvement module: models labels May 31, 2023

mthrok mentioned this pull request Aug 7, 2023

IndexError: list index out of range while running ONLINE ASR WITH EMFORMER RNN-T #3530

Closed

hwangjeff mentioned this pull request Aug 24, 2023

Fix decoder call in Device ASR/AVSR tutorials #3572

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve RNN-T streaming decoding #3295

Improve RNN-T streaming decoding #3295

lakshmi-speak commented May 2, 2023 •

edited

Loading

pytorch-bot bot commented May 2, 2023 •

edited

Loading

mthrok commented May 3, 2023

lakshmi-speak commented May 4, 2023 •

edited

Loading

hwangjeff commented May 5, 2023

lakshmi-speak commented May 5, 2023 via email

hwangjeff commented May 5, 2023

lakshmi-speak commented May 5, 2023

mthrok commented May 6, 2023

hwangjeff May 11, 2023

lakshmi-speak May 12, 2023 •

edited

Loading

mthrok May 13, 2023

lakshmi-speak May 15, 2023

hwangjeff May 23, 2023

lakshmi-speak May 24, 2023

lakshmi-speak commented May 12, 2023 •

edited

Loading

facebook-github-bot commented May 26, 2023

facebook-github-bot commented May 26, 2023

github-actions bot commented May 26, 2023

facebook-github-bot commented May 26, 2023

hwangjeff commented May 31, 2023

Improve RNN-T streaming decoding #3295

Improve RNN-T streaming decoding #3295

Conversation

lakshmi-speak commented May 2, 2023 • edited Loading

pytorch-bot bot commented May 2, 2023 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3295

❌ 2 New Failures, 4 Unrelated Failures

mthrok commented May 3, 2023

lakshmi-speak commented May 4, 2023 • edited Loading

Before

After

hwangjeff commented May 5, 2023

lakshmi-speak commented May 5, 2023 via email

hwangjeff commented May 5, 2023

lakshmi-speak commented May 5, 2023

mthrok commented May 6, 2023

hwangjeff May 11, 2023

Choose a reason for hiding this comment

lakshmi-speak May 12, 2023 • edited Loading

Choose a reason for hiding this comment

mthrok May 13, 2023

Choose a reason for hiding this comment

lakshmi-speak May 15, 2023

Choose a reason for hiding this comment

hwangjeff May 23, 2023

Choose a reason for hiding this comment

lakshmi-speak May 24, 2023

Choose a reason for hiding this comment

lakshmi-speak commented May 12, 2023 • edited Loading

facebook-github-bot commented May 26, 2023

facebook-github-bot commented May 26, 2023

github-actions bot commented May 26, 2023

Some guidance:

facebook-github-bot commented May 26, 2023

hwangjeff commented May 31, 2023

lakshmi-speak commented May 2, 2023 •

edited

Loading

pytorch-bot bot commented May 2, 2023 •

edited

Loading

lakshmi-speak commented May 4, 2023 •

edited

Loading

lakshmi-speak May 12, 2023 •

edited

Loading

lakshmi-speak commented May 12, 2023 •

edited

Loading