Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically revert to last successful commit to hub when a push_to_hub is interrupted #5045

Closed
jorahn opened this issue Sep 29, 2022 · 5 comments · Fixed by #6269
Closed
Labels
enhancement New feature or request

Comments

@jorahn
Copy link

jorahn commented Sep 29, 2022

Is your feature request related to a problem? Please describe.
I pushed a modification of a large dataset (remove a column) to the hub. The push was interrupted after some files were committed to the repo. This left the dataset to raise an error on load_dataset() (ValueError couldn’t cast … because column names don’t match). Only by specifying the previous (complete) commit as revision=commit_hash in load_data(), I was able to repair this and after a successful, complete push, the dataset loads without error again.

Describe the solution you'd like
Would it make sense to detect an incomplete push_to_hub() and automatically revert to the previous commit/revision?

Describe alternatives you've considered
Leave everything as is, the revision parameter in load_dataset() allows to manually fix this problem.

Additional context
Provide useful defaults

@jorahn jorahn added the enhancement New feature or request label Sep 29, 2022
@lhoestq
Copy link
Member

lhoestq commented Sep 30, 2022

Could you share the error you got please ? Maybe the full stack trace if you have it ?

Maybe push_to_hub be implemented as a single commit @Wauplin ? This way if it fails, the repo is still at the previous (valid) state instead of ending-up in an invalid/incimplete state.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 30, 2022

Maybe push_to_hub be implemented as a single commit ?

I think that would definitely be the way to go. Do you know the reasons why not implementing it like this in the first place ? I guess it is because of not been able to upload all at once with huggingface_hub but if there was another reason, please let me know.
About pushing all at once, it seems to be a more and more requested feature. I have created this issue huggingface/huggingface_hub#1085 recently but other discussions already happened in the past. The moon-landing team is working on it (cc @coyotte508). The huggingface_hub integration will come afterwards.

For now, maybe it's best to wait for a proper implementation instead of creating a temporary workaround :)

@lhoestq
Copy link
Member

lhoestq commented Sep 30, 2022

I think that would definitely be the way to go. Do you know the reasons why not implementing it like this in the first place ? I guess it is because of not been able to upload all at once with huggingface_hub but if there was another reason, please let me know.

Ideally we would want to upload the files iteratively - and then once everything is uploaded we proceed to commit. When we implemented push_to_hub, using upload_file for each shard was the only option.

For more context: for each shard to upload we do:

  1. load the arrow shard in memory
  2. convert to parquet
  3. upload

So to avoid OOM we need to upload the files iteratively.

For now, maybe it's best to wait for a proper implementation instead of creating a temporary workaround :)

Let us know if we can help !

@Wauplin
Copy link
Contributor

Wauplin commented Sep 30, 2022

Ideally we would want to upload the files iteratively - and then once everything is uploaded we proceed to commit.

Oh I see. So maybe this has to be done in an implementation specific to datasets/ as it is not a very common case (upload a bunch of files on the fly).

You can maybe have a look at how huggingface_hub is implemented for LFS files (arrow shards are LFS anyway, right?).
In upload_lfs_files LFS files are uploaded 1 by 1 (multithreaded) and then the commit is pushed to the Hub once all files have been uploaded. This is pretty much what you need, right ?

I can help you if you have questions how to do it in datasets. If that makes sense we could then move the implementation from datasets to huggingface_hub once it's mature. Next week I'm on holidays but feel free to start without my input.

(also cc @coyotte508 and @SBrandeis who implemented LFS upload in hfh)

@jorahn
Copy link
Author

jorahn commented Sep 30, 2022

Could you share the error you got please ? Maybe the full stack trace if you have it ?

Here’s part of the stack trace, that I can reproduce at the moment from a photo I took (potential typos from OCR):

ValueError
Traceback (most recent call last)
<ipython-input-4-274613b7d3f5> in <module>
from datasets import load dataset
ds = load_dataset('jrahn/chessv6', use_auth_token-True)

/us/local/1ib/python3.7/dist-packages/datasets/table.py in cast_table _to_schema (table, schema)
Line 2005 raise ValueError()

ValueError: Couldn't cast 
fen: string 
move: string 
res: string 
eco: string 
move_id: int64
res_num: int64 to
{ 'fen': Value(dtype='string', id=None), 
'move': Value(dtype=' string', id=None),
'res': Value(dtype='string', id=None),
'eco': Value(dtype='string', id=None), 
'hc': Value(dtype='string', id=None), 
'move_ id': Value(dtype='int64', id=None),
'res_num': Value(dtype= 'int64' , id=None) }
because column names don't match 

The column 'hc' was removed before the interrupted push_to_hub(). It appears in the column list in curly brackets but not in the column list above.

Let me know, if I can be of any help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants