Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relative paths in tarred dataset #2776

Merged
merged 8 commits into from
Sep 11, 2021

Conversation

michalivne
Copy link
Collaborator

This PR Adds on-the-fly validation of tarred dataset files.
Changes:

  1. If file is missing, looking for the file in same path as metadata file.
  2. If file cannot be found, raises an exception.

Copy link
Contributor

@MaximumEntropy MaximumEntropy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@okuchaiev okuchaiev merged commit 209a61e into NVIDIA:main Sep 11, 2021
@michalivne michalivne deleted the nmt-tarred_dataset_relative_paths branch September 13, 2021 12:12
paarthneekhara pushed a commit to paarthneekhara/NeMo that referenced this pull request Sep 17, 2021
* 1. Added on-the-fly validation of tarred dataset files. If missing, trying to look for file in same path as metadata file.

Signed-off-by: Micha Livne <[email protected]>

* 1. FIed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added logging of the number of updated files.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Signed-off-by: Paarth Neekhara <[email protected]>
fayejf pushed a commit that referenced this pull request Sep 22, 2021
* 1. Added on-the-fly validation of tarred dataset files. If missing, trying to look for file in same path as metadata file.

Signed-off-by: Micha Livne <[email protected]>

* 1. FIed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added logging of the number of updated files.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
jfsantos pushed a commit to jfsantos/NeMo that referenced this pull request Nov 19, 2021
* 1. Added on-the-fly validation of tarred dataset files. If missing, trying to look for file in same path as metadata file.

Signed-off-by: Micha Livne <[email protected]>

* 1. FIed style.

Signed-off-by: Micha Livne <[email protected]>

* 1. Debugging.

Signed-off-by: Micha Livne <[email protected]>

* 1. Added logging of the number of updated files.

Signed-off-by: Micha Livne <[email protected]>

Co-authored-by: Micha Livne <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants