-
Notifications
You must be signed in to change notification settings - Fork 672
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add LRS3 data preparation #3421
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/audio/3421
Note: Links to docs will display an error until the docs builds have been completed. ❌ 3 New FailuresAs of commit 6231cb1: NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks fine, but IIUC, the preparation uses four different FFmpeg-related packages. torchvision, opencv, ffmpeg, torchaudio.
This might cause some subtle difference in the data processing.
@mthrok has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
Summary: This PR adds a data preparation recipe that uses the ultra face detector to extract full-face video. The resulting video output is then used as input for training and evaluating RNNT-based models for automatic speech recognition (ASR), visual speech recognition (VSR), and audio-visual ASR (AV-ASR) on the LRS3 dataset. This PR also updates the word error rate (WER) for AV-ASR LRS3 models and improves the code readability. Pull Request resolved: pytorch#3421 Reviewed By: mpc001 Differential Revision: D46799748 Pulled By: mthrok fbshipit-source-id: 28a4fc1251700c739411db216d156e75db5db4fa
This pull request was exported from Phabricator. Differential Revision: D46799748 |
Hey @mthrok. Some guidance:Use 'module: ops' for operations under 'torchaudio/{transforms, functional}', and ML-related components under 'torchaudio/csrc' (e.g. RNN-T loss). Things in "examples" directory:
Regarding examples in code documentations, please also use 'module: docs'. Please use 'other' tag only when you’re sure the changes are not much relevant to users, or when all other tags are not applicable. Try not to use it often, in order to minimize efforts required when we prepare release notes. When preparing release notes, please make sure 'documentation' and 'tutorials' occur as the last sub-categories under each primary category like 'new feature', 'improvements' or 'prototype'. Things related to build are by default excluded from the release note, except when it impacts users. For example: |
This PR adds a data preparation recipe that uses the ultra face detector to extract full-face video. The resulting video output is then used as input for training and evaluating RNNT-based models for automatic speech recognition (ASR), visual speech recognition (VSR), and audio-visual ASR (AV-ASR) on the LRS3 dataset.
This PR also updates the word error rate (WER) for AV-ASR LRS3 models and improves the code readability.