Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve RNN-T streaming decoding #3295

Closed

Conversation

lakshmi-speak
Copy link
Contributor

@lakshmi-speak lakshmi-speak commented May 2, 2023

This commit fixes the following issues affecting streaming decoding quality

  1. The init_b hypothesis is only regenerated from blank token if no initial hypotheses are provided.
  2. Allows the decoder to receive top-K hypothesis to continue decoding from, instead of using just the top hypothesis at each decoding step. This dramatically affects decoding quality especially for speech with long pauses and disfluencies.
  3. Some minor errors regarding shape checking for length.

This also means that the resulting output is the entire transcript up until that time step, instead of just the incremental change in transcript.

@pytorch-bot
Copy link

pytorch-bot bot commented May 2, 2023

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3295

Note: Links to docs will display an error until the docs builds have been completed.

❌ 2 New Failures, 4 Unrelated Failures

As of commit 768307f:

NEW FAILURES - The following jobs have failed:

BROKEN TRUNK - The following jobs failed but were present on the merge base 5a6f4eb:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@lakshmi-speak lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch 2 times, most recently from 629ea1e to 5f8bb02 Compare May 3, 2023 16:44
@lakshmi-speak lakshmi-speak marked this pull request as draft May 3, 2023 18:49
@mthrok
Copy link
Collaborator

mthrok commented May 3, 2023

Hi @lakshmi-speak

Thanks for the PR. Do you have some information about the improvement before and after?
For example running this script gives better result?
https://github.com/pytorch/audio/blob/main/examples/asr/emformer_rnnt/pipeline_demo.py

(note: Perhaps we should update the demo script to compute WER so that this kind of comparison is easy)

@lakshmi-speak lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch from 5f8bb02 to ff23b28 Compare May 4, 2023 17:02
@lakshmi-speak
Copy link
Contributor Author

lakshmi-speak commented May 4, 2023

Hi @lakshmi-speak

Thanks for the PR. Do you have some information about the improvement before and after? For example running this script gives better result? https://github.com/pytorch/audio/blob/main/examples/asr/emformer_rnnt/pipeline_demo.py

(note: Perhaps we should update the demo script to compute WER so that this kind of comparison is easy)

The transcripts match the non-streaming inference more closely with the change, since the predictor sees the full context of previous text predictions. The differences are even more pronounced on a custom dataset, with conversational audio, as opposed to speakers reading audio books at a fixed pace such as librispeech.

Output of https://github.com/pytorch/audio/blob/main/examples/asr/emformer_rnnt/pipeline_demo.py (in each pair of transcripts, first is streaming result and second is non-streaming result)

Before

he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to beled out in thickard fat and sauce
he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fat and sauce

stuff it you his belly counselled him
stuff it into you his belly counselled him

after early nightfall the yellow lamps would light up here and there the squal of thephals
after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels

hello bertie any good in your mind
hello bertie any good in your mind

number ten fresh nelly is waiting on you good night husband
number ten fresh nelly is waiting on you good night husband

the music came nearer and he recalled the words the words of shelley's fragment upon the moon wanderingless pale for weariness
the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness

the dull light fell more faintly upon the page whereon equation began to unfold itself slowly and to spread abroad its widening tale
the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tale

a cold indifference reigned in his soul
a cold lucid indifference reigned in his soul

theos in which his ardor exting itself was a cold indifferent knowledge of himself
the chaos in which his ardor extinguished itself was a cold indifferent knowledge of himself

at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace
at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace

After

he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fat and sauce
he hoped there would be stew for dinner turnips and carrots and bruised potatoes and fat mutton pieces to be ladled out in thick peppered flour fat and sauce

stuff it into you his belly counselled him
stuff it into you his belly counselled him

after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels
after early nightfall the yellow lamps would light up here and there the squalid quarter of the brothels

hello bertie any good in your mind
hello bertie any good in your mind

number ten fresh nelly is waiting on you good night husband
number ten fresh nelly is waiting on you good night husband

the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness
the music came nearer and he recalled the words the words of shelley's fragment upon the moon wandering companionless pale for weariness

the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tale
the dull light fell more faintly upon the page whereon another equation began to unfold itself slowly and to spread abroad its widening tale

a cold lucid indifference reigned in his soul
a cold lucid indifference reigned in his soul

the chaos in which his ardor extinguished itself was a cold indifferent knowledge of himself
the chaos in which his ardor extinguished itself was a cold indifferent knowledge of himself

at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace
at most by an alms given to a beggar whose blessing he fled from he might hope wearily to win for himself some measure of actual grace


@lakshmi-speak lakshmi-speak changed the title Fix RNN-T streaming decoding Improve RNN-T streaming decoding May 4, 2023
@lakshmi-speak lakshmi-speak marked this pull request as ready for review May 4, 2023 17:25
@hwangjeff
Copy link
Contributor

Hi @lakshmi-speak, thanks for the PR. One observation here is that, although we're still processing the input waveform segment by segment, we're now storing token sequences in the hypotheses that grow unboundedly, and rather than outputting text in a streaming manner as before, we now output text only after all input segments have been processed. This behavior runs counter to what we want with streaming inference. Do you have any proposals around how to address these?

@lakshmi-speak
Copy link
Contributor Author

lakshmi-speak commented May 5, 2023 via email

@hwangjeff
Copy link
Contributor

@lakshmi-speak got it, thanks. I suppose the situations in which the hypotheses would grow beyond reason are fairly niche. For now, then, I think we can go with what you have here and make it the caller's responsibility to reset the hypotheses. As for printing the transcripts at the very end in the demo and tutorials, perhaps we can go back to printing after each iteration but with a carriage return appended, e.g. replace print(transcript, end="", flush=True) with print(transcript, end="\r", flush=True), so that each subsequent transcript overwrites the previous transcript. This way, users can see the updates in real time.

@lakshmi-speak lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch from f655156 to 0eba154 Compare May 5, 2023 21:30
@lakshmi-speak
Copy link
Contributor Author

Made the changes to the pipeline_demo.py script (will only refresh the current line, but I think that's ok for librispeech?)

@mthrok
Copy link
Collaborator

mthrok commented May 6, 2023

Regarding the hypotheses growing, this is correct. I am not sure how big of an issue that is - perhaps there can be a way to reset the hypotheses as a user defined flag? The hypotheses will grow to reflect the input audio, similar to non-streaming use case. But it’s also just text tokens so perhaps not that big of an issue?

Is the hypothesis the result of inference or is it something fed to models? Or alternatively, does the growing hypothesis have performance penalty on inference on later stage?

print(transcript, end="", flush=True)
hypothesis = hypos
transcript = token_processor(hypos[0][0], lstrip=False)
os.system('cls' if os.name == 'nt' else 'clear')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary?

Copy link
Contributor Author

@lakshmi-speak lakshmi-speak May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is because the transcript is long and single line refresh won't work, is there any other workaround for this other than screen clear?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you show a screen shot of what issue happens due to long line?
I don't think system call belongs here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example is a transcription of a very long audio, which spans multiple lines on the screen. Since the transcript now updates every step but keeps all the prior history, if we didn't clear the screen, we end up with repetitions of the same line. Using carriage return on print works, if we are refreshing the same line. Here we want to refresh several previous print lines.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lakshmi-speak it seems to work fine without the system call. can you remove it? the rest of the pr looks good — we can merge it after you make the change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hwangjeff ready to merge!

@lakshmi-speak
Copy link
Contributor Author

lakshmi-speak commented May 12, 2023

Regarding the hypotheses growing, this is correct. I am not sure how big of an issue that is - perhaps there can be a way to reset the hypotheses as a user defined flag? The hypotheses will grow to reflect the input audio, similar to non-streaming use case. But it’s also just text tokens so perhaps not that big of an issue?

Is the hypothesis the result of inference or is it something fed to models? Or alternatively, does the growing hypothesis have performance penalty on inference on later stage?

There isn't any additional performance penalty because of the hypotheses growing in length, since the prediction and hypo_predictor_out are one-step. However, since we continue decoding at each step from top-k hypothesis, we do process k- batches of predictions in parallel at the first time-step per input frame.
This is to be expected and is the reason why we see superior performance in beam search decoding. Previously, we would use the just top-1 hypothesis at every step, even though top-k was being predicted.

@lakshmi-speak lakshmi-speak force-pushed the rnnt_decoder_fix_streaming branch from 34f1bf2 to d5e1210 Compare May 23, 2023 21:07
@facebook-github-bot
Copy link
Contributor

@hwangjeff has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

1 similar comment
@facebook-github-bot
Copy link
Contributor

@hwangjeff has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@github-actions
Copy link

Hey @hwangjeff.
You merged this PR, but labels were not properly added. Please add a primary and secondary label (See https://github.com/pytorch/audio/blob/main/.github/process_commit.py).


Some guidance:

Use 'module: ops' for operations under 'torchaudio/{transforms, functional}', and ML-related components under 'torchaudio/csrc' (e.g. RNN-T loss).

Things in "examples" directory:

  • 'recipe' is applicable to training recipes under the 'examples' folder,
  • 'tutorial' is applicable to tutorials under the “examples/tutorials” folder
  • 'example' is applicable to everything else (e.g. C++ examples)
  • 'module: docs' is applicable to code documentations (not to tutorials).
    Regarding examples in code documentations, please also use 'module: docs'.

Please use 'other' tag only when you’re sure the changes are not much relevant to users, or when all other tags are not applicable. Try not to use it often, in order to minimize efforts required when we prepare release notes.


When preparing release notes, please make sure 'documentation' and 'tutorials' occur as the last sub-categories under each primary category like 'new feature', 'improvements' or 'prototype'.

Things related to build are by default excluded from the release note, except when it impacts users. For example:
* Drop support of Python 3.7.
* Add support of Python 3.X.
* Change the way a third party library is bound (so that user needs to install it separately).

@facebook-github-bot
Copy link
Contributor

@hwangjeff merged this pull request in 9fc0dca.

hwangjeff added a commit to hwangjeff/audio that referenced this pull request May 26, 2023
Summary: Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial.

Differential Revision: D46227174

fbshipit-source-id: 9013b4add6d1c8e3300c3f8cfe4e695429158e8c
hwangjeff added a commit to hwangjeff/audio that referenced this pull request May 31, 2023
Summary:
Pull Request resolved: pytorch#3379

Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial.

Reviewed By: mthrok

Differential Revision: D46227174

fbshipit-source-id: 2630295257c43acb14414b700b36939dfe6a8994
hwangjeff added a commit to hwangjeff/audio that referenced this pull request May 31, 2023
Summary:
Pull Request resolved: pytorch#3379

Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial.

Reviewed By: mthrok

Differential Revision: D46227174

fbshipit-source-id: 0df50d354c080a26d76274233e78987c8d28d5a5
facebook-github-bot pushed a commit that referenced this pull request May 31, 2023
Summary:
Pull Request resolved: #3379

Fixes `RNNTBeamSearch.infer`'s docstring and removes unused import from tutorial.

Reviewed By: mthrok

Differential Revision: D46227174

fbshipit-source-id: 7c1c3f05a6476cb0437622dea6f3ae6cb3ea9468
@hwangjeff
Copy link
Contributor

@lakshmi-speak note that we've merged your PR — thanks for contributing to the library!

facebook-github-bot pushed a commit that referenced this pull request Sep 4, 2023
Summary:
Fixes decoder calls and related code in Device ASR/AVSR tutorials to account for changes to RNN-T decoder introduced in #3295.

Pull Request resolved: #3572

Reviewed By: mthrok

Differential Revision: D48629428

Pulled By: hwangjeff

fbshipit-source-id: 63ede307fb4412aa28f88972d56dca8405607b7a
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants