Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow multiple filenames per partition when separating by metadata #99

Merged
merged 7 commits into from
Jun 28, 2024

Conversation

ayushdg
Copy link
Collaborator

@ayushdg ayushdg commented Jun 5, 2024

Description

Closes #89

Adds an option to handle scenarios where there are multiple filenames in a single partition when writing with filename.
This is typically true when files are read in with files_per_partition > 1.

Usage

dataset = DocumentDataset.read_json(path, files_per_partition=5, include_filename=True)
write_to_disk(dataset.df, output_file_dir=path, write_to_filename=True)

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

@ayushdg ayushdg marked this pull request as ready for review June 6, 2024 18:17
@ayushdg ayushdg requested review from ryantwolf and VibhuJawa June 6, 2024 18:17
@ayushdg ayushdg self-assigned this Jun 6, 2024
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this @ayushdg , should be helpful across the various modules . Requested some changes around supporting parquet

nemo_curator/utils/distributed_utils.py Outdated Show resolved Hide resolved
nemo_curator/utils/distributed_utils.py Show resolved Hide resolved
@ayushdg ayushdg force-pushed the bug-seperate-by-metadata-fpp branch from 2f72e96 to d2bb9c7 Compare June 26, 2024 23:59
@ayushdg ayushdg requested review from VibhuJawa and removed request for ryantwolf June 28, 2024 17:11
Copy link
Collaborator

@VibhuJawa VibhuJawa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM to merge, thanks for working on this @ayushdg

@ayushdg ayushdg merged commit 9e25631 into NVIDIA:main Jun 28, 2024
3 checks passed
@ayushdg ayushdg deleted the bug-seperate-by-metadata-fpp branch June 28, 2024 22:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ValueError: More than one filename found in partition
2 participants