Automatically revert to last successful commit to hub when a push_to_hub is interrupted #5045

jorahn · 2022-09-29T18:08:12Z

Is your feature request related to a problem? Please describe.
I pushed a modification of a large dataset (remove a column) to the hub. The push was interrupted after some files were committed to the repo. This left the dataset to raise an error on load_dataset() (ValueError couldn’t cast … because column names don’t match). Only by specifying the previous (complete) commit as revision=commit_hash in load_data(), I was able to repair this and after a successful, complete push, the dataset loads without error again.

Describe the solution you'd like
Would it make sense to detect an incomplete push_to_hub() and automatically revert to the previous commit/revision?

Describe alternatives you've considered
Leave everything as is, the revision parameter in load_dataset() allows to manually fix this problem.

Additional context
Provide useful defaults

lhoestq · 2022-09-30T15:25:55Z

Could you share the error you got please ? Maybe the full stack trace if you have it ?

Maybe push_to_hub be implemented as a single commit @Wauplin ? This way if it fails, the repo is still at the previous (valid) state instead of ending-up in an invalid/incimplete state.

Wauplin · 2022-09-30T15:36:41Z

Maybe push_to_hub be implemented as a single commit ?

I think that would definitely be the way to go. Do you know the reasons why not implementing it like this in the first place ? I guess it is because of not been able to upload all at once with huggingface_hub but if there was another reason, please let me know.
About pushing all at once, it seems to be a more and more requested feature. I have created this issue huggingface/huggingface_hub#1085 recently but other discussions already happened in the past. The moon-landing team is working on it (cc @coyotte508). The huggingface_hub integration will come afterwards.

For now, maybe it's best to wait for a proper implementation instead of creating a temporary workaround :)

lhoestq · 2022-09-30T15:43:06Z

I think that would definitely be the way to go. Do you know the reasons why not implementing it like this in the first place ? I guess it is because of not been able to upload all at once with huggingface_hub but if there was another reason, please let me know.

Ideally we would want to upload the files iteratively - and then once everything is uploaded we proceed to commit. When we implemented push_to_hub, using upload_file for each shard was the only option.

For more context: for each shard to upload we do:

load the arrow shard in memory
convert to parquet
upload

So to avoid OOM we need to upload the files iteratively.

For now, maybe it's best to wait for a proper implementation instead of creating a temporary workaround :)

Let us know if we can help !

Wauplin · 2022-09-30T15:53:16Z

Ideally we would want to upload the files iteratively - and then once everything is uploaded we proceed to commit.

Oh I see. So maybe this has to be done in an implementation specific to datasets/ as it is not a very common case (upload a bunch of files on the fly).

You can maybe have a look at how huggingface_hub is implemented for LFS files (arrow shards are LFS anyway, right?).
In upload_lfs_files LFS files are uploaded 1 by 1 (multithreaded) and then the commit is pushed to the Hub once all files have been uploaded. This is pretty much what you need, right ?

I can help you if you have questions how to do it in datasets. If that makes sense we could then move the implementation from datasets to huggingface_hub once it's mature. Next week I'm on holidays but feel free to start without my input.

(also cc @coyotte508 and @SBrandeis who implemented LFS upload in hfh)

jorahn · 2022-09-30T16:49:21Z

Could you share the error you got please ? Maybe the full stack trace if you have it ?

Here’s part of the stack trace, that I can reproduce at the moment from a photo I took (potential typos from OCR):

ValueError
Traceback (most recent call last)
<ipython-input-4-274613b7d3f5> in <module>
from datasets import load dataset
ds = load_dataset('jrahn/chessv6', use_auth_token-True)

/us/local/1ib/python3.7/dist-packages/datasets/table.py in cast_table _to_schema (table, schema)
Line 2005 raise ValueError()

ValueError: Couldn't cast 
fen: string 
move: string 
res: string 
eco: string 
move_id: int64
res_num: int64 to
{ 'fen': Value(dtype='string', id=None), 
'move': Value(dtype=' string', id=None),
'res': Value(dtype='string', id=None),
'eco': Value(dtype='string', id=None), 
'hc': Value(dtype='string', id=None), 
'move_ id': Value(dtype='int64', id=None),
'res_num': Value(dtype= 'int64' , id=None) }
because column names don't match

The column 'hc' was removed before the interrupted push_to_hub(). It appears in the column list in curly brackets but not in the column list above.

Let me know, if I can be of any help.

jorahn added the enhancement New feature or request label Sep 29, 2022

mariosasko mentioned this issue Feb 10, 2023

Support incremental uploads huggingface/huggingface_hub#1430

Open

mariosasko mentioned this issue Oct 8, 2023

Reduce the number of commits in push_to_hub #6269

Merged

1 task

mariosasko closed this as completed in #6269 Oct 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically revert to last successful commit to hub when a push_to_hub is interrupted #5045

Automatically revert to last successful commit to hub when a push_to_hub is interrupted #5045

jorahn commented Sep 29, 2022 •

edited

Loading

lhoestq commented Sep 30, 2022 •

edited

Loading

Wauplin commented Sep 30, 2022

lhoestq commented Sep 30, 2022 •

edited

Loading

Wauplin commented Sep 30, 2022

jorahn commented Sep 30, 2022

Automatically revert to last successful commit to hub when a push_to_hub is interrupted #5045

Automatically revert to last successful commit to hub when a push_to_hub is interrupted #5045

Comments

jorahn commented Sep 29, 2022 • edited Loading

lhoestq commented Sep 30, 2022 • edited Loading

Wauplin commented Sep 30, 2022

lhoestq commented Sep 30, 2022 • edited Loading

Wauplin commented Sep 30, 2022

jorahn commented Sep 30, 2022

jorahn commented Sep 29, 2022 •

edited

Loading

lhoestq commented Sep 30, 2022 •

edited

Loading

lhoestq commented Sep 30, 2022 •

edited

Loading