Parquet uploads off-by-one naming scheme #6303

ZachNagengast · 2023-10-14T18:31:03Z

Describe the bug

I noticed this numbering scheme not matching up in a different project and wanted to raise it as an issue for discussion, what is the actual proper way to have these stored?

The -SSSSS-of-NNNNN seems to be used widely across the codebase. The section that creates the part in my screenshot is here https://github.com/huggingface/datasets/blob/main/src/datasets/arrow_dataset.py#L5287
There are also some edits to this section in the single commit branch.

Steps to reproduce the bug

Upload a dataset that requires at least two parquet files in it
Observe the naming scheme

Expected behavior

The couple options here are of course 1. keeping it as is

2. Starting the index at 1:
train-00001-of-00002-{hash}.parquet
train-00002-of-00002-{hash}.parquet

3. My preferred option (which would solve my specific issue), dropping the total entirely:
train-00000-{hash}.parquet
train-00001-{hash}.parquet

This also solves an issue that will occur with an append variable for push_to_hub (see #6290) where as you add a new parquet file, you need to rename everything in the repo as well.

However, I know there are parts of the repo that use 0 as the starting file or may require the total, so raising the question for discussion.

Environment info

datasets version: 2.14.6.dev0
Platform: macOS-14.0-arm64-arm-64bit
Python version: 3.10.12
Huggingface_hub version: 0.18.0
PyArrow version: 12.0.1
Pandas version: 1.5.3

The text was updated successfully, but these errors were encountered:

mariosasko · 2023-10-16T13:11:50Z

You can find the reasoning behind this naming scheme here.

This point has been raised several times, so I'd be okay with starting with 00001- (also to be consistent with the transformers sharding), but I'm not sure @lhoestq agrees.

lhoestq · 2023-10-16T13:24:06Z

We start at 0 in datasets for consistency with Apache Spark, Apache Beam, Dask and others.

Also note transformers isn't a good reference on this topic. I talked with the maintainers when they added shards but it was already released this way. Though we found that there is a backward-compatible way in transformers to start at 0, but no request from transformers users to changes this AFAIK.

julien-c · 2023-10-16T13:51:04Z

not sure it would be a good idea to break the consistency now, IMO

ZachNagengast · 2023-10-16T16:25:46Z

Makes sense to start at 0 for plenty of good reasons so I'm on board.

What about the second part -of-0000X? With single commit PR #6269 just getting merged, there was a note about issues with 100+ file edits #6269 (comment).

That would be my last remaining concern in the context of the push_to_hub(..., append=True) work to be done, where appending a single file to the full dataset will require renaming every other existing file in the dataset. If it doesn't seem like a big issue for this work then all the better 👍

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet uploads off-by-one naming scheme #6303

Parquet uploads off-by-one naming scheme #6303

ZachNagengast commented Oct 14, 2023

mariosasko commented Oct 16, 2023

lhoestq commented Oct 16, 2023

julien-c commented Oct 16, 2023

ZachNagengast commented Oct 16, 2023 •

edited

Loading

Parquet uploads off-by-one naming scheme #6303

Parquet uploads off-by-one naming scheme #6303

Comments

ZachNagengast commented Oct 14, 2023

Describe the bug

Steps to reproduce the bug

Expected behavior

Environment info

mariosasko commented Oct 16, 2023

lhoestq commented Oct 16, 2023

julien-c commented Oct 16, 2023

ZachNagengast commented Oct 16, 2023 • edited Loading

ZachNagengast commented Oct 16, 2023 •

edited

Loading