-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DDUF parser v0.1 #2692
DDUF parser v0.1 #2692
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean v0 implementation!
Should work well enough to prototype+validate the format with partners. cc @SunMarc @LysandreJik
return entries | ||
|
||
|
||
def write_dduf_file(dduf_path: Union[str, Path], diffuser_path: Union[str, Path]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering replacing this by a more low-level helper:
def export_as_dduf(dduf_path: Union[str, Path], entries: Dict[str, Union[str, Path, bytes, BinaryIO, ... (maybe more?) ]]) -> None:
Advantages would be:
- be able to serialize on the fly (no need for saving on disk + re-saving as DDUF)
- be able to select which files to export. an entire folder might contain extra files or files for different quantization, etc. Passing files explicitly allows for more flexbiility.
We could still have another helper export_folder_as_dduf
for convenience but would simply be a wrapper around export_as_dduf
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or even entries
could be a Iterable[Tuple[str, Union[str, Path, bytes, BinaryIO]]
. This way you could pass an iterator that will serialize things on the fly (no need to have everything in memory at once).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_dduf_file("FLUX.1-dev.dduf", diffuser_path="path/to/FLUX.1-dev")
here could the diffuser_path
be renamed into something more general as we want DDUF to be independent of diffusers
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to see how we can blend this with the current logic of diffusers : https://github.com/huggingface/diffusers/pull/10037/files#:~:text=**save_kwargs)-,if%20dduf_filename%3A,-SunMarc%20marked%20this
Right now, I'm saving the files to the archive right after each model gets saved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 1b11f0b
Now you can pass a list of files to export:
# Export specific files from the local disk.
>>> from huggingface_hub import export_entries_as_dduf
>>> export_entries_as_dduf(
... "stable-diffusion-v1-4-FP16.dduf",
... entries=[ # List entries to add to the DDUF file (here, only FP16 weights)
... ("model_index.json", "path/to/model_index.json"),
... ("vae/config.json", "path/to/vae/config.json"),
... ("vae/diffusion_pytorch_model.fp16.safetensors", "path/to/vae/diffusion_pytorch_model.fp16.safetensors"),
... ("text_encoder/config.json", "path/to/text_encoder/config.json"),
... ("text_encoder/model.fp16.safetensors", "path/to/text_encoder/model.fp16.safetensors"),
... # ... add more entries here
... ]
... )
or even iterate over entries to save to lazily dump state dicts without saving to disk:
# Export state_dicts one by one from a loaded pipeline
>>> from diffusers import DiffusionPipeline
>>> from typing import Generator, Tuple
>>> import safetensors.torch
>>> from huggingface_hub import export_entries_as_dduf
>>> pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
... # ... do some work with the pipeline
>>> def as_entries(pipe: DiffusionPipeline) -> Generator[Tuple[str, bytes], None, None]:
... # Build an generator that yields the entries to add to the DDUF file.
... # The first element of the tuple is the filename in the DDUF archive (must use UNIX separator!). The second element is the content of the file.
... # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time)
... yield "vae/config.json", pipe.vae.to_json_string().encode()
... yield "vae/diffusion_pytorch_model.safetensors", safetensors.torch.save(pipe.vae.state_dict())
... yield "text_encoder/config.json", pipe.text_encoder.config.to_json_string().encode()
... yield "text_encoder/model.safetensors", safetensors.torch.save(pipe.text_encoder.state_dict())
... # ... add more entries here
>>> export_entries_as_dduf("stable-diffusion-v1-4.dduf", as_entries=as_entries(pipe))
And I've kept exporting an entire folder at once for ease of use:
# Export a folder as a DDUF file
>>> from huggingface_hub import export_folder_as_dduf
>>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the neat examples and starting this!
return entries | ||
|
||
|
||
def write_dduf_file(dduf_path: Union[str, Path], diffuser_path: Union[str, Path]) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write_dduf_file("FLUX.1-dev.dduf", diffuser_path="path/to/FLUX.1-dev")
here could the diffuser_path
be renamed into something more general as we want DDUF to be independent of diffusers
.
neat packaging idea... not sure what the highest priority is but a consideration re zip vs tar, tar would let you start reading weights into memory w/o seeking to end and having to access to the zip directory (in footer). To mitigate the direct access to offsets (w/o reading forward) in tar, you can write a TOC metadata file at the start of the tar to work around that. Like uncompressed zip you have direct access to bytes in a tar file. |
cc @ylow maybe for context on why we opted to go with zip vs. tar^ |
@rwightman zip or tar, either way an index structure will be necessary. And zip comes with one as part of the format so no need to invent another. Indeed it might seem that one additional "seek" might be necessary to obtain the index structure, but say from a HTTP range read you can do "Range: bytes=-100000" for instance to fetch the last 100k, so in that sense there is no major difference reading from start or end from a streaming perspective. On the other hand its kinda nice to be able to rely on a common container format so that minimal effort is needed to implement readers in other languages. |
@ylow FWIW that extra range request is a seek and an extra network RTT (or few), one of the reasons why reading from parquet files when 'streaming' using fsspec is quite a bit slower at scale than just streaming tar (forward streaming only). The server does pay a penalty with extra IOPS. Probably not as big an issue with model weights vs dataset files... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good and clean to me, thanks @Wauplin ! left some comments, nothing major
Co-authored-by: Célina <[email protected]>
... # Build an generator that yields the entries to add to the DDUF file. | ||
... # The first element of the tuple is the filename in the DDUF archive (must use UNIX separator!). The second element is the content of the file. | ||
... # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time) | ||
... yield "vae/config.json", pipe.vae.to_json_string().encode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was confused that this is pipe.vae.to_json_string()
but then it's pipe.text_encoder.config.to_json_string()
below, but I think it's correct as the vae config is a FrozenDict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be up for another solution tbh. I just wanted to show that you need to serialize things as bytes by yourself.
…_hub into dduf-parser-v0.1
Co-authored-by: Pedro Cuenca <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, Lucain! My comments are mostly minor in nature. I think this is ready to 🚢
…_hub into dduf-parser-v0.1
Co-authored-by: Sayak Paul <[email protected]>
Co-authored-by: Célina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM ! Just a nit
Thanks everyone for the feedback! |
Related to https://github.com/huggingface/huggingface.js/tree/main/packages/dduf/src and huggingface/diffusers#10037.
Goal is to centralize the common bricks to save and load a DDUF file, therefore ensuring the specs are followed (e.g. force ZIP64, no compression, only txt/json/gguf/safetensors files). For now the implementation is based on the built-in
zipfile
module for practical reasons.TODO:
updatewrite_dduf_file
signature to infer the DDUF filename based on model name, quantization, etc. See [FEAT] DDUF format diffusers#10037 (comment)=> let's not do it in the end [FEAT] DDUF format diffusers#10037 (comment)
Short docs: https://moon-ci-docs.huggingface.co/docs/huggingface_hub/pr_2692/en/package_reference/serialization#dduf-file-format
Example: https://huggingface.co/Wauplin/stable-diffusion-v1-4-DDUF
How to write a DDUF file?
How to read a DDUF file?