Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DDUF parser v0.1 #2692

Merged
merged 35 commits into from
Dec 13, 2024
Merged

DDUF parser v0.1 #2692

merged 35 commits into from
Dec 13, 2024

Conversation

Wauplin
Copy link
Contributor

@Wauplin Wauplin commented Dec 4, 2024

Related to https://github.com/huggingface/huggingface.js/tree/main/packages/dduf/src and huggingface/diffusers#10037.

Goal is to centralize the common bricks to save and load a DDUF file, therefore ensuring the specs are followed (e.g. force ZIP64, no compression, only txt/json/gguf/safetensors files). For now the implementation is based on the built-in zipfile module for practical reasons.

TODO:

Short docs: https://moon-ci-docs.huggingface.co/docs/huggingface_hub/pr_2692/en/package_reference/serialization#dduf-file-format

Example: https://huggingface.co/Wauplin/stable-diffusion-v1-4-DDUF

How to write a DDUF file?

>>> from huggingface_hub import write_dduf_file
>>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev")

How to read a DDUF file?

>>> import json
>>> import safetensors.load
>>> from huggingface_hub import read_dduf_file

# Read DDUF metadata
>>> dduf_entries = read_dduf_file("FLUX.1-dev.dduf")

# Returns a mapping filename <> DDUFEntry
>>> dduf_entries["model_index.json"]
DDUFEntry(filename='model_index.json', offset=66, length=587)

# Load model index as JSON
>>> json.loads(dduf_entries["model_index.json"].read_text())
{'_class_name': 'FluxPipeline', '_diffusers_version': '0.32.0.dev0', '_name_or_path': 'black-forest-labs/FLUX.1-dev', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'text_encoder_2': ['transformers', 'T5EncoderModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'tokenizer_2': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'FluxTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}

# Load VAE weights using safetensors
>>> with dduf_entries["vae/diffusion_pytorch_model.safetensors"].as_mmap() as mm:
...     state_dict = safetensors.torch.load(mm)

@Wauplin Wauplin marked this pull request as draft December 4, 2024 10:47
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin Wauplin requested a review from sayakpaul December 4, 2024 11:22
Copy link
Member

@julien-c julien-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean v0 implementation!

Should work well enough to prototype+validate the format with partners. cc @SunMarc @LysandreJik

src/huggingface_hub/serialization/_dduf.py Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
return entries


def write_dduf_file(dduf_path: Union[str, Path], diffuser_path: Union[str, Path]) -> None:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering replacing this by a more low-level helper:

def export_as_dduf(dduf_path: Union[str, Path], entries: Dict[str, Union[str, Path, bytes, BinaryIO, ... (maybe more?) ]]) -> None:

Advantages would be:

  • be able to serialize on the fly (no need for saving on disk + re-saving as DDUF)
  • be able to select which files to export. an entire folder might contain extra files or files for different quantization, etc. Passing files explicitly allows for more flexbiility.

We could still have another helper export_folder_as_dduf for convenience but would simply be a wrapper around export_as_dduf.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or even entries could be a Iterable[Tuple[str, Union[str, Path, bytes, BinaryIO]]. This way you could pass an iterator that will serialize things on the fly (no need to have everything in memory at once).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_dduf_file("FLUX.1-dev.dduf", diffuser_path="path/to/FLUX.1-dev")

here could the diffuser_path be renamed into something more general as we want DDUF to be independent of diffusers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to see how we can blend this with the current logic of diffusers : https://github.com/huggingface/diffusers/pull/10037/files#:~:text=**save_kwargs)-,if%20dduf_filename%3A,-SunMarc%20marked%20this

Right now, I'm saving the files to the archive right after each model gets saved

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 1b11f0b

Now you can pass a list of files to export:

# Export specific files from the local disk.
>>> from huggingface_hub import export_entries_as_dduf
>>> export_entries_as_dduf(
...     "stable-diffusion-v1-4-FP16.dduf",
...     entries=[ # List entries to add to the DDUF file (here, only FP16 weights)
...         ("model_index.json", "path/to/model_index.json"),
...         ("vae/config.json", "path/to/vae/config.json"),
...         ("vae/diffusion_pytorch_model.fp16.safetensors", "path/to/vae/diffusion_pytorch_model.fp16.safetensors"),
...         ("text_encoder/config.json", "path/to/text_encoder/config.json"),
...         ("text_encoder/model.fp16.safetensors", "path/to/text_encoder/model.fp16.safetensors"),
...         # ... add more entries here
...     ]
... )

or even iterate over entries to save to lazily dump state dicts without saving to disk:

# Export state_dicts one by one from a loaded pipeline 
>>> from diffusers import DiffusionPipeline
>>> from typing import Generator, Tuple
>>> import safetensors.torch
>>> from huggingface_hub import export_entries_as_dduf
>>> pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
... # ... do some work with the pipeline

>>> def as_entries(pipe: DiffusionPipeline) -> Generator[Tuple[str, bytes], None, None]:
...     # Build an generator that yields the entries to add to the DDUF file.
...     # The first element of the tuple is the filename in the DDUF archive (must use UNIX separator!). The second element is the content of the file.
...     # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time)
...     yield "vae/config.json", pipe.vae.to_json_string().encode()
...     yield "vae/diffusion_pytorch_model.safetensors", safetensors.torch.save(pipe.vae.state_dict())
...     yield "text_encoder/config.json", pipe.text_encoder.config.to_json_string().encode()
...     yield "text_encoder/model.safetensors", safetensors.torch.save(pipe.text_encoder.state_dict())
...     # ... add more entries here

>>> export_entries_as_dduf("stable-diffusion-v1-4.dduf", as_entries=as_entries(pipe))

And I've kept exporting an entire folder at once for ease of use:

# Export a folder as a DDUF file
>>> from huggingface_hub import export_folder_as_dduf
>>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev")

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the neat examples and starting this!

src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Show resolved Hide resolved
return entries


def write_dduf_file(dduf_path: Union[str, Path], diffuser_path: Union[str, Path]) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_dduf_file("FLUX.1-dev.dduf", diffuser_path="path/to/FLUX.1-dev")

here could the diffuser_path be renamed into something more general as we want DDUF to be independent of diffusers.

src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
@Wauplin Wauplin changed the title First draft for a DDUF parser DDUF parser v0.1 Dec 4, 2024
@rwightman
Copy link

neat packaging idea... not sure what the highest priority is but a consideration re zip vs tar, tar would let you start reading weights into memory w/o seeking to end and having to access to the zip directory (in footer). To mitigate the direct access to offsets (w/o reading forward) in tar, you can write a TOC metadata file at the start of the tar to work around that. Like uncompressed zip you have direct access to bytes in a tar file.

@julien-c
Copy link
Member

julien-c commented Dec 4, 2024

cc @ylow maybe for context on why we opted to go with zip vs. tar^

@ylow
Copy link

ylow commented Dec 4, 2024

@rwightman zip or tar, either way an index structure will be necessary. And zip comes with one as part of the format so no need to invent another. Indeed it might seem that one additional "seek" might be necessary to obtain the index structure, but say from a HTTP range read you can do "Range: bytes=-100000" for instance to fetch the last 100k, so in that sense there is no major difference reading from start or end from a streaming perspective. On the other hand its kinda nice to be able to rely on a common container format so that minimal effort is needed to implement readers in other languages.

@rwightman
Copy link

@ylow FWIW that extra range request is a seek and an extra network RTT (or few), one of the reasons why reading from parquet files when 'streaming' using fsspec is quite a bit slower at scale than just streaming tar (forward streaming only). The server does pay a penalty with extra IOPS.

Probably not as big an issue with model weights vs dataset files...

Copy link
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good and clean to me, thanks @Wauplin ! left some comments, nothing major

docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved
src/huggingface_hub/serialization/_dduf.py Show resolved Hide resolved
docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved
... # Build an generator that yields the entries to add to the DDUF file.
... # The first element of the tuple is the filename in the DDUF archive (must use UNIX separator!). The second element is the content of the file.
... # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time)
... yield "vae/config.json", pipe.vae.to_json_string().encode()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was confused that this is pipe.vae.to_json_string() but then it's pipe.text_encoder.config.to_json_string() below, but I think it's correct as the vae config is a FrozenDict.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be up for another solution tbh. I just wanted to show that you need to serialize things as bytes by yourself.

Copy link
Member

@sayakpaul sayakpaul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, Lucain! My comments are mostly minor in nature. I think this is ready to 🚢

docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved
docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved
tests/test_dduf.py Show resolved Hide resolved
Copy link
Member

@SunMarc SunMarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM ! Just a nit

src/huggingface_hub/serialization/_dduf.py Show resolved Hide resolved
@hanouticelina hanouticelina self-requested a review December 13, 2024 11:36
@Wauplin
Copy link
Contributor Author

Wauplin commented Dec 13, 2024

Thanks everyone for the feedback!

@Wauplin Wauplin merged commit 4b0b179 into main Dec 13, 2024
17 checks passed
@Wauplin Wauplin deleted the dduf-parser-v0.1 branch December 13, 2024 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants