-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Parquet uploads off-by-one naming scheme #6303
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We start at 0 in Also note |
not sure it would be a good idea to break the consistency now, IMO |
Makes sense to start at 0 for plenty of good reasons so I'm on board. What about the second part That would be my last remaining concern in the context of the |
Describe the bug
I noticed this numbering scheme not matching up in a different project and wanted to raise it as an issue for discussion, what is the actual proper way to have these stored?
The
-SSSSS-of-NNNNN
seems to be used widely across the codebase. The section that creates the part in my screenshot is here https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py#L5287There are also some edits to this section in the single commit branch.
Steps to reproduce the bug
Expected behavior
The couple options here are of course 1. keeping it as is
2. Starting the index at 1:
train-00001-of-00002-{hash}.parquet
train-00002-of-00002-{hash}.parquet
3. My preferred option (which would solve my specific issue), dropping the total entirely:
train-00000-{hash}.parquet
train-00001-{hash}.parquet
This also solves an issue that will occur with an
append
variable forpush_to_hub
(see #6290) where as you add a new parquet file, you need to rename everything in the repo as well.However, I know there are parts of the repo that use 0 as the starting file or may require the total, so raising the question for discussion.
Environment info
datasets
version: 2.14.6.dev0The text was updated successfully, but these errors were encountered: