Skip to content

Parquet uploads off-by-one naming scheme #6303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ZachNagengast opened this issue Oct 14, 2023 · 4 comments
Open

Parquet uploads off-by-one naming scheme #6303

ZachNagengast opened this issue Oct 14, 2023 · 4 comments

Comments

@ZachNagengast
Copy link
Contributor

Describe the bug

I noticed this numbering scheme not matching up in a different project and wanted to raise it as an issue for discussion, what is the actual proper way to have these stored?

image

The -SSSSS-of-NNNNN seems to be used widely across the codebase. The section that creates the part in my screenshot is here https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py#L5287
There are also some edits to this section in the single commit branch.

Steps to reproduce the bug

  1. Upload a dataset that requires at least two parquet files in it
  2. Observe the naming scheme

Expected behavior

The couple options here are of course 1. keeping it as is

2. Starting the index at 1:
train-00001-of-00002-{hash}.parquet
train-00002-of-00002-{hash}.parquet

3. My preferred option (which would solve my specific issue), dropping the total entirely:
train-00000-{hash}.parquet
train-00001-{hash}.parquet

This also solves an issue that will occur with an append variable for push_to_hub (see #6290) where as you add a new parquet file, you need to rename everything in the repo as well.

However, I know there are parts of the repo that use 0 as the starting file or may require the total, so raising the question for discussion.

Environment info

  • datasets version: 2.14.6.dev0
  • Platform: macOS-14.0-arm64-arm-64bit
  • Python version: 3.10.12
  • Huggingface_hub version: 0.18.0
  • PyArrow version: 12.0.1
  • Pandas version: 1.5.3
@mariosasko
Copy link
Collaborator

You can find the reasoning behind this naming scheme here.

This point has been raised several times, so I'd be okay with starting with 00001- (also to be consistent with the transformers sharding), but I'm not sure @lhoestq agrees.

@lhoestq
Copy link
Member

lhoestq commented Oct 16, 2023

We start at 0 in datasets for consistency with Apache Spark, Apache Beam, Dask and others.

Also note transformers isn't a good reference on this topic. I talked with the maintainers when they added shards but it was already released this way. Though we found that there is a backward-compatible way in transformers to start at 0, but no request from transformers users to changes this AFAIK.

@julien-c
Copy link
Member

not sure it would be a good idea to break the consistency now, IMO

@ZachNagengast
Copy link
Contributor Author

ZachNagengast commented Oct 16, 2023

Makes sense to start at 0 for plenty of good reasons so I'm on board.

What about the second part -of-0000X? With single commit PR #6269 just getting merged, there was a note about issues with 100+ file edits #6269 (comment).

That would be my last remaining concern in the context of the push_to_hub(..., append=True) work to be done, where appending a single file to the full dataset will require renaming every other existing file in the dataset. If it doesn't seem like a big issue for this work then all the better 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants