Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable upload_folder to upload content in chunks #1085

Closed
merveenoyan opened this issue Sep 27, 2022 · 11 comments · Fixed by #1117
Closed

Enable upload_folder to upload content in chunks #1085

merveenoyan opened this issue Sep 27, 2022 · 11 comments · Fixed by #1117
Assignees
Labels
bug Something isn't working
Milestone

Comments

@merveenoyan
Copy link
Contributor

Describe the bug

When trying to convert and upload this dataset using dataset converter tool I get following error in upload_folder (see logs).

Most datasets on kaggle are quite large and weirdly structured so if we want more datasets uploaded with tool, the library should handle it (maybe by uploading in chunks)

Reproduction

See this notebook and try to convert above dataset if you decide to run it again (it already has logs as of now)

Logs

/usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self)
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942 

HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/datasets/merve/bird-species/preupload/main

The above exception was the direct cause of the following exception:

HfHubHTTPError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/ipywidgets/widgets/widget_output.py in inner(*args, **kwargs)
    101                     self.clear_output(*clear_args, **clear_kwargs)
    102                 with self:
--> 103                     return func(*args, **kwargs)
    104             return inner
    105         return capture_decorator

/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in login_token_event(t)
    279         print(f"\t- Kaggle ID: {kaggle_id}")
    280         print(f"\t- Repo ID: {repo_id}")
--> 281         url = kaggle_to_hf(kaggle_id, repo_id)
    282         output.clear_output()
    283         print(f"You can view your dataset here: {url}")

/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in kaggle_to_hf(kaggle_id, repo_id, token, unzip, path_in_repo)
    215         upload_file(path_or_fileobj=gitattributes_file.as_posix(), path_in_repo=".gitattributes", repo_id=repo_id, token=token, repo_type='dataset')
    216 
--> 217         upload_folder(folder_path=temp_dir, path_in_repo="", repo_id=repo_id, token=None, repo_type='dataset')
    218     # Try to make dataset card as well!
    219     card = DatasetCard.from_template(

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in upload_folder(self, repo_id, folder_path, path_in_repo, commit_message, commit_description, token, repo_type, revision, create_pr, parent_commit, allow_patterns, ignore_patterns)
   2391             revision=revision,
   2392             create_pr=create_pr,
-> 2393             parent_commit=parent_commit,
   2394         )
   2395 

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in create_commit(self, repo_id, operations, commit_message, commit_description, token, repo_type, revision, create_pr, num_threads, parent_commit)
   2035                 revision=revision,
   2036                 endpoint=self.endpoint,
-> 2037                 create_pr=create_pr,
   2038             )
   2039         except RepositoryNotFoundError as e:

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/_commit_api.py in fetch_upload_modes(additions, repo_type, repo_id, token, revision, endpoint, create_pr)
    375         params={"create_pr": "1"} if create_pr else None,
    376     )
--> 377     hf_raise_for_status(resp, endpoint_name="preupload")
    378 
    379     preupload_info = validate_preupload_info(resp.json())

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
    252         # Convert `HTTPError` into a `HfHubHTTPError` to display request information
    253         # as well (request id and/or server error message)
--> 254         raise HfHubHTTPError(str(HTTPError), response=response) from e
    255 
    256 

HfHubHTTPError: <class 'requests.exceptions.HTTPError'> (Request ID: IdC2Rq6MbaM7tuOR-Q0Kr)

request entity too large

System Info

Using this branch of converter tool: https://github.com/merveenoyan/huggingface-datasets-converter, the only change is Hub version https://github.com/huggingface/huggingface_hub.git@fix-auth-in-lfs-upload
@Wauplin
Copy link
Contributor

Wauplin commented Sep 27, 2022

I see 2 things here:

  1. the request entity too large happens because the payload with only metadata is too big for the server. This happens if we try to create a commit that adds with too many files (thousands ?). I think we should definitely catch the HTTP 413 error to make it clearer for the user. Something like "you are trying to upload too many files at the same time, please upload in chunks". I would catch the error in fetch_upload_mode directly.

  2. As of fixing the issue itself, I see multiple options:

    1. Either we update upload_folder implementation so that files are uploaded in chunks of X files (X=500 ?). I am not a big fan of doing this as it raises a the question of what to do if the process is interrupted in the middle. Should we delete the first X commits ? How to keep track efficiently of which files have been uploaded and which ones not for retrying ?
    2. Update only fetch_upload_modes to fetch the upload modes by chunks. If that works, it would be perfect as only 1 commit would be created. However I am quite pessimistic on it as the commit payload itself would be too big as well.
    3. Either we document how to manually upload a folder with a snippet of code and let the users deal with it (suggesting ways of mitigating the upload issues). For example, we could make the _prepare_upload_folder_commit public (with documentation) and suggest a workaround like below:
    from huggingface_hub.hf_api import _prepare_upload_folder_commit, create_commit
    
    def chunker(seq, size):
        # ugly but taken from https://stackoverflow.com/a/434328
        return (seq[pos:pos + size] for pos in range(0, len(seq), size))
    
    operations = _prepare_upload_folder_commit(folder_path, path_in_repo="")
    for chunk in chunker(operations, size=500):
        create_commit(
            repo_type=repo_type,
            repo_id=repo_id,
            operations=files_to_add,
            commit_message="Uploading 500 files",
        )

    Snippet would have to be reworked to handle exceptions.

    iv. Increase request payload in moon-landing 🙄

    v. Been able to stream the payload to moon-landing ? Or to have way to upload by chunks and make only 1 commit once everything is uploaded. Not sure how that would work...

In general I don't find any satisfying solution. If a moon-landing solution is possible, that's good. But if we really have to make several commits, I think documenting how to upload in chunks is the way to go as we let the user deal with the uneasy part instead of trying to guess what would be best.

(another solution is to advice for using the Repository object instead but is this really something we wanna do ? 😕)

WDYT @julien-c @Pierrci @SBrandeis ?

@osanseviero
Copy link
Contributor

Discussion seems similar to #918?

@Wauplin
Copy link
Contributor

Wauplin commented Sep 27, 2022

Ah yes thanks @osanseviero and sorry for the duplicate.

Also related to #920 and nateraw/huggingface-datasets-converter#13.

@merveenoyan
Copy link
Contributor Author

@osanseviero the error here is different, I guess that's why I missed it.

@julien-c
Copy link
Member

pinging @coyotte508 on this (@SBrandeis currently has low bandwidth as he's working on Spaces and billing)

@Wauplin
Copy link
Contributor

Wauplin commented Sep 27, 2022

From #918 (comment):

Then likely there is so many files that the 250kB limit is overcome just with the preupload call.
Either the hub library should batch the preupload calls (in chunks of 250 files for example) or we should allow a bigger body on the hub side

So yes, related even though hidden.

@coyotte508
Copy link
Member

preupload needs to be chunked.

Regarding the /commit call, https://github.com/huggingface/moon-landing/pull/3874 uses the new content type (and modifies the backend a bit)

After it's merged, I'll update the /commit doc, and it should be possible to do very large commits with the hub library.

@Wauplin
Copy link
Contributor

Wauplin commented Sep 29, 2022

(also mentioned in this thread)

@Wauplin
Copy link
Contributor

Wauplin commented Sep 29, 2022

Pasting a snippet of code to upload a large folder originally from @thomasw21 on slack (internal link) (ping @merveenoyan @nateraw if this snippet can be of interest as a temporary workaround for your issues):

from pathlib import Path
from huggingface_hub import HfApi, CommitOperationAdd

def get_all_files(root: Path):
    dirs = [root]
    while len(dirs) > 0:
        dir = dirs.pop()
        for candidate in dir.iterdir():
            if candidate.is_file():
                yield candidate
            if candidate.is_dir():
                dirs.append(candidate)

def get_groups_of_n(n: int, iterator):
    assert n > 1
    buffer = []
    for elt in iterator:
        if len(buffer) == n:
            yield buffer
            buffer = []
        buffer.append(elt)
    if len(buffer) != 0:
        yield buffer

def main():
    api = HfApi()

    root = Path("checkpoint_1007000")
    n = 100

    for i, file_paths in enumerate(get_groups_of_n(n, get_all_files(root))):
        print(f"Committing {file_paths}")
        operations = [
            CommitOperationAdd(path_in_repo=str(file_path), path_or_fileobj=str(file_path))
            for file_path in file_paths
        ]
        api.create_commit(
            repo_id="bigscience/mt0-t5x",
            operations=operations,
            commit_message=f"Upload part {i}",
        )

if __name__ == "__main__":
    main()

@coyotte508
Copy link
Member

huggingface/hub-docs#348 - the docs for the commit endpoints, with the "application/x-ndjson" content type.

Basically, the content type is as such:

{key: "header", value: {"summary": string, "description"?: string, parentCommit?: string}}
{key: "file", value: { content: string; path: string; encoding?: "utf-8" | "base64"; }}
{key: "deletedFile", value: { path: string }}
{key: "lfsFile", value: { path: string; algo: "sha256"; oid: string; size?: number; }}

There can be multiple files, lfs files, deleted files, one line for each. Each line is a JSON. If we add other features to the commit API (eg to rename file, delete folders), it will follow the same pattern: a plural words for the application/json content-type with an array of objects, and a singular word for the application/x-ndjson content-type in the key field with an object in the value field.

There's a maximum of 25k LFS files, and 1GB payload.

@Wauplin
Copy link
Contributor

Wauplin commented Oct 10, 2022

Thanks a lot @coyotte508 for both the implementation and the documentation around it ! I'll start to look into it to integrate that in huggingface_hub :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
5 participants