-
Notifications
You must be signed in to change notification settings - Fork 637
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable upload_folder to upload content in chunks #1085
Comments
I see 2 things here:
In general I don't find any satisfying solution. If a moon-landing solution is possible, that's good. But if we really have to make several commits, I think documenting how to upload in chunks is the way to go as we let the user deal with the uneasy part instead of trying to guess what would be best. (another solution is to advice for using the WDYT @julien-c @Pierrci @SBrandeis ? |
Discussion seems similar to #918? |
Ah yes thanks @osanseviero and sorry for the duplicate. Also related to #920 and nateraw/huggingface-datasets-converter#13. |
@osanseviero the error here is different, I guess that's why I missed it. |
pinging @coyotte508 on this (@SBrandeis currently has low bandwidth as he's working on Spaces and billing) |
From #918 (comment):
So yes, related even though hidden. |
Regarding the After it's merged, I'll update the |
(also mentioned in this thread) |
Pasting a snippet of code to upload a large folder originally from @thomasw21 on slack (internal link) (ping @merveenoyan @nateraw if this snippet can be of interest as a temporary workaround for your issues): from pathlib import Path
from huggingface_hub import HfApi, CommitOperationAdd
def get_all_files(root: Path):
dirs = [root]
while len(dirs) > 0:
dir = dirs.pop()
for candidate in dir.iterdir():
if candidate.is_file():
yield candidate
if candidate.is_dir():
dirs.append(candidate)
def get_groups_of_n(n: int, iterator):
assert n > 1
buffer = []
for elt in iterator:
if len(buffer) == n:
yield buffer
buffer = []
buffer.append(elt)
if len(buffer) != 0:
yield buffer
def main():
api = HfApi()
root = Path("checkpoint_1007000")
n = 100
for i, file_paths in enumerate(get_groups_of_n(n, get_all_files(root))):
print(f"Committing {file_paths}")
operations = [
CommitOperationAdd(path_in_repo=str(file_path), path_or_fileobj=str(file_path))
for file_path in file_paths
]
api.create_commit(
repo_id="bigscience/mt0-t5x",
operations=operations,
commit_message=f"Upload part {i}",
)
if __name__ == "__main__":
main() |
huggingface/hub-docs#348 - the docs for the commit endpoints, with the "application/x-ndjson" content type. Basically, the content type is as such:
There can be multiple files, lfs files, deleted files, one line for each. Each line is a JSON. If we add other features to the commit API (eg to rename file, delete folders), it will follow the same pattern: a plural words for the There's a maximum of 25k LFS files, and 1GB payload. |
Thanks a lot @coyotte508 for both the implementation and the documentation around it ! I'll start to look into it to integrate that in |
Describe the bug
When trying to convert and upload this dataset using dataset converter tool I get following error in
upload_folder
(see logs).Most datasets on kaggle are quite large and weirdly structured so if we want more datasets uploaded with tool, the library should handle it (maybe by uploading in chunks)
Reproduction
See this notebook and try to convert above dataset if you decide to run it again (it already has logs as of now)
Logs
System Info
The text was updated successfully, but these errors were encountered: