Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NeMo + Lhotse integration #7880

Merged
merged 8 commits into from
Jan 13, 2024
Merged

Conversation

pzelasko
Copy link
Collaborator

@pzelasko pzelasko commented Nov 13, 2023

What does this PR do ?

This PR adds an option to leverage Lhotse for dataloading in NeMo.

Collection: Currently ASR only, with clear path to extend into other collections.

Changelog

  • Support creating a Lhotse dataloader from NeMo configuration
  • Support for most Lhotse-specific dataloading features, especially: dynamic bucketing, multiplexing datasets.
  • Support initializing Lhotse dataloader either from Lhotse manifests, Lhotse Shar (tarred) manifests, NeMo manifests, and NeMo tarred manifests
  • ASR dataloading for hybrid CTC transducer as the first example

Usage

  • See the unit tests
  • NeMo's model.{train,validation,test}_ds sections can be extended with additional fields use_lhotse: True/False and lhotse: {...} to enable this.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

  • New Feature
  • Bugfix
  • Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

  • Related to # (issue)

@github-actions github-actions bot added the ASR label Nov 13, 2023
@nithinraok
Copy link
Collaborator

jenkins

@pzelasko pzelasko marked this pull request as ready for review December 13, 2023 22:48
@github-actions github-actions bot added the CI label Dec 15, 2023
Copy link
Contributor

This PR is stale because it has been open for 14 days with no activity. Remove stale label or comment or update or this will be closed in 7 days.

@github-actions github-actions bot added the stale label Dec 30, 2023
@pzelasko pzelasko removed the stale label Jan 2, 2024
Comment on lines 282 to 297
from nemo.collections.asr.data.audio_to_text_lhotse import LhotseSpeechToTextBpeDataset
from nemo.collections.common.data.lhotse import get_lhotse_dataloader_from_config

return get_lhotse_dataloader_from_config(

Check notice

Code scanning / CodeQL

Commented-out code Note

This comment appears to contain commented-out code.
Signed-off-by: Piotr Żelasko <[email protected]>
Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall it looks quite good to me. There are minor changes required

  1. Change all copyright header in your PR to 2024
  2. Move all hard dependency imports to the top of the file and not inside functions or classes
  3. Move all soft dependency imports under try catch pattern, see numba and apex for reference in NeMo

Final review @VahidooX - from my side this is ok to merge with above fixes

@@ -82,13 +82,15 @@ RUN INSTALL_MSG=$(/bin/bash /tmp/torchaudio_build/scripts/installers/install_tor

# install nemo dependencies
WORKDIR /tmp/nemo
ENV LHOTSE_REQUIRE_TORCHAUDIO=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it prevents Lhotse from pulling/checking torchaudio.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user need to set it when not using the dockers?

Copy link
Collaborator Author

@pzelasko pzelasko Jan 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only if they can't have torchaudio installed for some reason. IIUC the issue with torchaudio mostly stems from upstream docker images not having it pre-built and the need to build from source when building NeMo containers (which may fail for a number of reasons). But outside these docker images it's straightforward to pip/conda install torchaudio alongside torch. WDYT?

Dockerfile Outdated
COPY requirements .
RUN for f in $(ls requirements*.txt); do pip3 install --disable-pip-version-check --no-cache-dir -r $f; done

# install flash attention
RUN pip install flash-attn
# install numba for latest containers
RUN pip install numba>=0.57.1
RUN pip install pyloudnorm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove from here, put in requirements for asr.txt

model.train_ds.shuffle=true # optional

# Lhotse dataloading related arguments
+model.train_ds.use_lhotse=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ++ to be safer (in case key already exists in config)

}

def __init__(self, tokenizer):
from lhotse.dataset import AudioSamples
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put these imports in try catch at the top, after the nemo imports. See other examples of import guards.

IMO its better to know what imports are there in a file at the top rather than hunt down imports in file.

"""

def __init__(self, tokenizer):
from nemo.collections.common.tokenizers.aggregate_tokenizer import AggregateTokenizer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these imported here? They arent problematic.

@@ -90,6 +90,17 @@ def __init__(self, cfg: DictConfig, trainer=None):
)

def _setup_dataloader_from_config(self, config: Optional[Dict]):
if config.get("use_lhotse"):
from nemo.collections.asr.data.audio_to_text_lhotse import LhotseSpeechToTextBpeDataset
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put these imports at the top of the file. If they are guarded properly there's no issue in importing the file.

# See the License for the specific language governing permissions and
# limitations under the License.

from .cutset import read_cutset_from_config
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No relative imports - please use absolute imports throughout NeMo codebase

"""
logging.info("We will be using a Lhotse DataLoader.")

from lhotse import CutSet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import at top with try catch, check inside function if HAVE_LHOTSE


try:
from lhotse.lazy import ImitatesDict
except ImportError:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

catch both ImportError and ModuleNotFoundError (some python versions raise the latter)

requirements/requirements_asr.txt Show resolved Hide resolved
Signed-off-by: Piotr Żelasko <[email protected]>
@pzelasko
Copy link
Collaborator Author

pzelasko commented Jan 9, 2024

@titu1994 all suggestions implemented

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, just need @VahidooX's final signoff

@titu1994
Copy link
Collaborator

titu1994 commented Jan 9, 2024

Need to pass jenkins though

@pzelasko
Copy link
Collaborator Author

jenkins

Copy link
Collaborator

@VahidooX VahidooX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm generally, just some minor comments!

@@ -82,13 +82,15 @@ RUN INSTALL_MSG=$(/bin/bash /tmp/torchaudio_build/scripts/installers/install_tor

# install nemo dependencies
WORKDIR /tmp/nemo
ENV LHOTSE_REQUIRE_TORCHAUDIO=0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the user need to set it when not using the dockers?

docs/source/asr/datasets.rst Show resolved Hide resolved
docs/source/asr/datasets.rst Show resolved Hide resolved
return self._tokenizer(text)


def _identity(x):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use this function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, not needed anymore. I removed it.

requirements/requirements_asr.txt Show resolved Hide resolved
@pzelasko
Copy link
Collaborator Author

jenkins

Copy link
Collaborator

@titu1994 titu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, thanks for the awesome work !

@pzelasko pzelasko merged commit 199a8ba into NVIDIA:main Jan 13, 2024
11 checks passed
@pzelasko
Copy link
Collaborator Author

Thanks guys! Such a joy to click merge after all this work 😀

minitu pushed a commit to minitu/NeMo that referenced this pull request Jan 19, 2024
* Lhotse integration squashed PR

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Som

Signed-off-by: Piotr Żelasko <[email protected]>

* Update copyright headers to 2024

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix NLP imports

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Vahid

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
ssh-meister pushed a commit to ssh-meister/NeMo that referenced this pull request Feb 15, 2024
* Lhotse integration squashed PR

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Som

Signed-off-by: Piotr Żelasko <[email protected]>

* Update copyright headers to 2024

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix NLP imports

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Vahid

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Sasha Meister <[email protected]>
pablo-garay pushed a commit that referenced this pull request Mar 19, 2024
* Lhotse integration squashed PR

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Som

Signed-off-by: Piotr Żelasko <[email protected]>

* Update copyright headers to 2024

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix NLP imports

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Vahid

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Pablo Garay <[email protected]>
rohitrango pushed a commit to rohitrango/NeMo that referenced this pull request Jun 25, 2024
* Lhotse integration squashed PR

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Som

Signed-off-by: Piotr Żelasko <[email protected]>

* Update copyright headers to 2024

Signed-off-by: Piotr Żelasko <[email protected]>

* Fix NLP imports

Signed-off-by: Piotr Żelasko <[email protected]>

* Code review - Vahid

Signed-off-by: Piotr Żelasko <[email protected]>

---------

Signed-off-by: Piotr Żelasko <[email protected]>
Signed-off-by: Piotr Żelasko <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants