Enable upload_folder to upload content in chunks #1085

merveenoyan · 2022-09-27T10:20:03Z

Describe the bug

When trying to convert and upload this dataset using dataset converter tool I get following error in upload_folder (see logs).

Most datasets on kaggle are quite large and weirdly structured so if we want more datasets uploaded with tool, the library should handle it (maybe by uploading in chunks)

Reproduction

See this notebook and try to convert above dataset if you decide to run it again (it already has logs as of now)

Logs

/usr/local/lib/python3.7/dist-packages/requests/models.py in raise_for_status(self)
    940         if http_error_msg:
--> 941             raise HTTPError(http_error_msg, response=self)
    942 

HTTPError: 413 Client Error: Payload Too Large for url: https://huggingface.co/api/datasets/merve/bird-species/preupload/main

The above exception was the direct cause of the following exception:

HfHubHTTPError                            Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/ipywidgets/widgets/widget_output.py in inner(*args, **kwargs)
    101                     self.clear_output(*clear_args, **clear_kwargs)
    102                 with self:
--> 103                     return func(*args, **kwargs)
    104             return inner
    105         return capture_decorator

/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in login_token_event(t)
    279         print(f"\t- Kaggle ID: {kaggle_id}")
    280         print(f"\t- Repo ID: {repo_id}")
--> 281         url = kaggle_to_hf(kaggle_id, repo_id)
    282         output.clear_output()
    283         print(f"You can view your dataset here: {url}")

/content/huggingface-datasets-converter/huggingface_datasets_converter/convert.py in kaggle_to_hf(kaggle_id, repo_id, token, unzip, path_in_repo)
    215         upload_file(path_or_fileobj=gitattributes_file.as_posix(), path_in_repo=".gitattributes", repo_id=repo_id, token=token, repo_type='dataset')
    216 
--> 217         upload_folder(folder_path=temp_dir, path_in_repo="", repo_id=repo_id, token=None, repo_type='dataset')
    218     # Try to make dataset card as well!
    219     card = DatasetCard.from_template(

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in upload_folder(self, repo_id, folder_path, path_in_repo, commit_message, commit_description, token, repo_type, revision, create_pr, parent_commit, allow_patterns, ignore_patterns)
   2391             revision=revision,
   2392             create_pr=create_pr,
-> 2393             parent_commit=parent_commit,
   2394         )
   2395 

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/hf_api.py in create_commit(self, repo_id, operations, commit_message, commit_description, token, repo_type, revision, create_pr, num_threads, parent_commit)
   2035                 revision=revision,
   2036                 endpoint=self.endpoint,
-> 2037                 create_pr=create_pr,
   2038             )
   2039         except RepositoryNotFoundError as e:

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_validators.py in _inner_fn(*args, **kwargs)
     92                 validate_repo_id(arg_value)
     93 
---> 94         return fn(*args, **kwargs)
     95 
     96     return _inner_fn

/usr/local/lib/python3.7/dist-packages/huggingface_hub/_commit_api.py in fetch_upload_modes(additions, repo_type, repo_id, token, revision, endpoint, create_pr)
    375         params={"create_pr": "1"} if create_pr else None,
    376     )
--> 377     hf_raise_for_status(resp, endpoint_name="preupload")
    378 
    379     preupload_info = validate_preupload_info(resp.json())

/usr/local/lib/python3.7/dist-packages/huggingface_hub/utils/_errors.py in hf_raise_for_status(response, endpoint_name)
    252         # Convert `HTTPError` into a `HfHubHTTPError` to display request information
    253         # as well (request id and/or server error message)
--> 254         raise HfHubHTTPError(str(HTTPError), response=response) from e
    255 
    256 

HfHubHTTPError: <class 'requests.exceptions.HTTPError'> (Request ID: IdC2Rq6MbaM7tuOR-Q0Kr)

request entity too large

System Info

Using this branch of converter tool: https://github.com/merveenoyan/huggingface-datasets-converter, the only change is Hub version https://github.com/huggingface/huggingface_hub.git@fix-auth-in-lfs-upload

The text was updated successfully, but these errors were encountered:

Wauplin · 2022-09-27T10:45:08Z

I see 2 things here:

the request entity too large happens because the payload with only metadata is too big for the server. This happens if we try to create a commit that adds with too many files (thousands ?). I think we should definitely catch the HTTP 413 error to make it clearer for the user. Something like "you are trying to upload too many files at the same time, please upload in chunks". I would catch the error in fetch_upload_mode directly.
As of fixing the issue itself, I see multiple options:
1. Either we update upload_folder implementation so that files are uploaded in chunks of X files (X=500 ?). I am not a big fan of doing this as it raises a the question of what to do if the process is interrupted in the middle. Should we delete the first X commits ? How to keep track efficiently of which files have been uploaded and which ones not for retrying ?
2. Update only fetch_upload_modes to fetch the upload modes by chunks. If that works, it would be perfect as only 1 commit would be created. However I am quite pessimistic on it as the commit payload itself would be too big as well.
3. Either we document how to manually upload a folder with a snippet of code and let the users deal with it (suggesting ways of mitigating the upload issues). For example, we could make the _prepare_upload_folder_commit public (with documentation) and suggest a workaround like below:
```
from huggingface_hub.hf_api import _prepare_upload_folder_commit, create_commit

def chunker(seq, size):
    # ugly but taken from https://stackoverflow.com/a/434328
    return (seq[pos:pos + size] for pos in range(0, len(seq), size))

operations = _prepare_upload_folder_commit(folder_path, path_in_repo="")
for chunk in chunker(operations, size=500):
    create_commit(
        repo_type=repo_type,
        repo_id=repo_id,
        operations=files_to_add,
        commit_message="Uploading 500 files",
    )
```
Snippet would have to be reworked to handle exceptions.

iv. ~~Increase request payload in moon-landing 🙄~~

v. Been able to stream the payload to moon-landing ? Or to have way to upload by chunks and make only 1 commit once everything is uploaded. Not sure how that would work...

In general I don't find any satisfying solution. If a moon-landing solution is possible, that's good. But if we really have to make several commits, I think documenting how to upload in chunks is the way to go as we let the user deal with the uneasy part instead of trying to guess what would be best.

(another solution is to advice for using the Repository object instead but is this really something we wanna do ? 😕)

WDYT @julien-c @Pierrci @SBrandeis ?

osanseviero · 2022-09-27T10:50:35Z

Discussion seems similar to #918?

Wauplin · 2022-09-27T10:53:57Z

Ah yes thanks @osanseviero and sorry for the duplicate.

Also related to #920 and nateraw/huggingface-datasets-converter#13.

merveenoyan · 2022-09-27T10:53:58Z

@osanseviero the error here is different, I guess that's why I missed it.

julien-c · 2022-09-27T10:54:03Z

pinging @coyotte508 on this (@SBrandeis currently has low bandwidth as he's working on Spaces and billing)

Wauplin · 2022-09-27T10:55:14Z

From #918 (comment):

Then likely there is so many files that the 250kB limit is overcome just with the preupload call.
Either the hub library should batch the preupload calls (in chunks of 250 files for example) or we should allow a bigger body on the hub side

So yes, related even though hidden.

coyotte508 · 2022-09-27T11:29:03Z

preupload needs to be chunked.

Regarding the /commit call, https://github.com/huggingface/moon-landing/pull/3874 uses the new content type (and modifies the backend a bit)

After it's merged, I'll update the /commit doc, and it should be possible to do very large commits with the hub library.

Wauplin · 2022-09-29T10:36:22Z

(also mentioned in this thread)

Wauplin · 2022-09-29T13:57:30Z

Pasting a snippet of code to upload a large folder originally from @thomasw21 on slack (internal link) (ping @merveenoyan @nateraw if this snippet can be of interest as a temporary workaround for your issues):

from pathlib import Path
from huggingface_hub import HfApi, CommitOperationAdd

def get_all_files(root: Path):
    dirs = [root]
    while len(dirs) > 0:
        dir = dirs.pop()
        for candidate in dir.iterdir():
            if candidate.is_file():
                yield candidate
            if candidate.is_dir():
                dirs.append(candidate)

def get_groups_of_n(n: int, iterator):
    assert n > 1
    buffer = []
    for elt in iterator:
        if len(buffer) == n:
            yield buffer
            buffer = []
        buffer.append(elt)
    if len(buffer) != 0:
        yield buffer

def main():
    api = HfApi()

    root = Path("checkpoint_1007000")
    n = 100

    for i, file_paths in enumerate(get_groups_of_n(n, get_all_files(root))):
        print(f"Committing {file_paths}")
        operations = [
            CommitOperationAdd(path_in_repo=str(file_path), path_or_fileobj=str(file_path))
            for file_path in file_paths
        ]
        api.create_commit(
            repo_id="bigscience/mt0-t5x",
            operations=operations,
            commit_message=f"Upload part {i}",
        )

if __name__ == "__main__":
    main()

coyotte508 · 2022-10-03T09:58:43Z

huggingface/hub-docs#348 - the docs for the commit endpoints, with the "application/x-ndjson" content type.

Basically, the content type is as such:

{key: "header", value: {"summary": string, "description"?: string, parentCommit?: string}}
{key: "file", value: { content: string; path: string; encoding?: "utf-8" | "base64"; }}
{key: "deletedFile", value: { path: string }}
{key: "lfsFile", value: { path: string; algo: "sha256"; oid: string; size?: number; }}

There can be multiple files, lfs files, deleted files, one line for each. Each line is a JSON. If we add other features to the commit API (eg to rename file, delete folders), it will follow the same pattern: a plural words for the application/json content-type with an array of objects, and a singular word for the application/x-ndjson content-type in the key field with an object in the value field.

There's a maximum of 25k LFS files, and 1GB payload.

Wauplin · 2022-10-10T16:21:10Z

Thanks a lot @coyotte508 for both the implementation and the documentation around it ! I'll start to look into it to integrate that in huggingface_hub :)

merveenoyan added the bug Something isn't working label Sep 27, 2022

merveenoyan assigned merveenoyan and Wauplin and unassigned merveenoyan Sep 27, 2022

merveenoyan mentioned this issue Sep 27, 2022

Hub client library raises request entity too large nateraw/huggingface-datasets-converter#20

Closed

Wauplin mentioned this issue Sep 30, 2022

Automatically revert to last successful commit to hub when a push_to_hub is interrupted huggingface/datasets#5045

Closed

Wauplin added this to the v0.11 milestone Oct 10, 2022

Wauplin mentioned this issue Oct 18, 2022

Create commit by streaming a ndjson payload (allow lots of file in single commit) #1117

Merged

Wauplin closed this as completed in #1117 Oct 21, 2022

ivanleomk mentioned this issue Dec 11, 2023

Added a new article 567-labs/fastllm#25

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable upload_folder to upload content in chunks #1085

Enable upload_folder to upload content in chunks #1085

merveenoyan commented Sep 27, 2022

Wauplin commented Sep 27, 2022 •

edited

Loading

osanseviero commented Sep 27, 2022

Wauplin commented Sep 27, 2022

merveenoyan commented Sep 27, 2022

julien-c commented Sep 27, 2022

Wauplin commented Sep 27, 2022

coyotte508 commented Sep 27, 2022

Wauplin commented Sep 29, 2022 •

edited

Loading

Wauplin commented Sep 29, 2022 •

edited by thomasw21

Loading

coyotte508 commented Oct 3, 2022

Wauplin commented Oct 10, 2022

Enable upload_folder to upload content in chunks #1085

Enable upload_folder to upload content in chunks #1085

Comments

merveenoyan commented Sep 27, 2022

Describe the bug

Reproduction

Logs

System Info

Wauplin commented Sep 27, 2022 • edited Loading

osanseviero commented Sep 27, 2022

Wauplin commented Sep 27, 2022

merveenoyan commented Sep 27, 2022

julien-c commented Sep 27, 2022

Wauplin commented Sep 27, 2022

coyotte508 commented Sep 27, 2022

Wauplin commented Sep 29, 2022 • edited Loading

Wauplin commented Sep 29, 2022 • edited by thomasw21 Loading

coyotte508 commented Oct 3, 2022

Wauplin commented Oct 10, 2022

Wauplin commented Sep 27, 2022 •

edited

Loading

Wauplin commented Sep 29, 2022 •

edited

Loading

Wauplin commented Sep 29, 2022 •

edited by thomasw21

Loading