DDUF parser v0.1 #2692

Wauplin · 2024-12-04T10:45:49Z

Related to https://github.com/huggingface/huggingface.js/tree/main/packages/dduf/src and huggingface/diffusers#10037.

Goal is to centralize the common bricks to save and load a DDUF file, therefore ensuring the specs are followed (e.g. force ZIP64, no compression, only txt/json/gguf/safetensors files). For now the implementation is based on the built-in zipfile module for practical reasons.

TODO:

high priority: do not rely on private attribute to determine file offset in zipfile (see [FEAT] DDUF format diffusers#10037 (comment))
~~update write_dduf_file signature to infer the DDUF filename based on model name, quantization, etc. See [FEAT] DDUF format diffusers#10037 (comment)~~
=> let's not do it in the end [FEAT] DDUF format diffusers#10037 (comment)
optimize when writing file (at the moment, read files chunk by chunk and add them to the zip => there might be a quicker way to do that)
allow writing directly from blobs => would avoid having to save as safetensors on disk + re-save as DDUF
add proper tests

Short docs: https://moon-ci-docs.huggingface.co/docs/huggingface_hub/pr_2692/en/package_reference/serialization#dduf-file-format

Example: https://huggingface.co/Wauplin/stable-diffusion-v1-4-DDUF

How to write a DDUF file?

>>> from huggingface_hub import write_dduf_file
>>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev")

How to read a DDUF file?

>>> import json
>>> import safetensors.load
>>> from huggingface_hub import read_dduf_file

# Read DDUF metadata
>>> dduf_entries = read_dduf_file("FLUX.1-dev.dduf")

# Returns a mapping filename <> DDUFEntry
>>> dduf_entries["model_index.json"]
DDUFEntry(filename='model_index.json', offset=66, length=587)

# Load model index as JSON
>>> json.loads(dduf_entries["model_index.json"].read_text())
{'_class_name': 'FluxPipeline', '_diffusers_version': '0.32.0.dev0', '_name_or_path': 'black-forest-labs/FLUX.1-dev', 'scheduler': ['diffusers', 'FlowMatchEulerDiscreteScheduler'], 'text_encoder': ['transformers', 'CLIPTextModel'], 'text_encoder_2': ['transformers', 'T5EncoderModel'], 'tokenizer': ['transformers', 'CLIPTokenizer'], 'tokenizer_2': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'FluxTransformer2DModel'], 'vae': ['diffusers', 'AutoencoderKL']}

# Load VAE weights using safetensors
>>> with dduf_entries["vae/diffusion_pytorch_model.safetensors"].as_mmap() as mm:
...     state_dict = safetensors.torch.load(mm)

HuggingFaceDocBuilderDev · 2024-12-04T10:51:41Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

julien-c

clean v0 implementation!

Should work well enough to prototype+validate the format with partners. cc @SunMarc @LysandreJik

src/huggingface_hub/serialization/_dduf.py

Wauplin · 2024-12-04T11:47:40Z

src/huggingface_hub/serialization/_dduf.py

+    return entries
+
+
+def write_dduf_file(dduf_path: Union[str, Path], diffuser_path: Union[str, Path]) -> None:


I'm considering replacing this by a more low-level helper:

def export_as_dduf(dduf_path: Union[str, Path], entries: Dict[str, Union[str, Path, bytes, BinaryIO, ... (maybe more?) ]]) -> None:

Advantages would be:

be able to serialize on the fly (no need for saving on disk + re-saving as DDUF)

be able to select which files to export. an entire folder might contain extra files or files for different quantization, etc. Passing files explicitly allows for more flexbiility.

We could still have another helper export_folder_as_dduf for convenience but would simply be a wrapper around export_as_dduf.

Or even entries could be a Iterable[Tuple[str, Union[str, Path, bytes, BinaryIO]]. This way you could pass an iterator that will serialize things on the fly (no need to have everything in memory at once).

write_dduf_file("FLUX.1-dev.dduf", diffuser_path="path/to/FLUX.1-dev")

here could the diffuser_path be renamed into something more general as we want DDUF to be independent of diffusers.

It would be nice to see how we can blend this with the current logic of diffusers : https://github.com/huggingface/diffusers/pull/10037/files#:~:text=**save_kwargs)-,if%20dduf_filename%3A,-SunMarc%20marked%20this

Right now, I'm saving the files to the archive right after each model gets saved

Done in 1b11f0b

Now you can pass a list of files to export:

# Export specific files from the local disk. >>> from huggingface_hub import export_entries_as_dduf >>> export_entries_as_dduf( ... "stable-diffusion-v1-4-FP16.dduf", ... entries=[ # List entries to add to the DDUF file (here, only FP16 weights) ... ("model_index.json", "path/to/model_index.json"), ... ("vae/config.json", "path/to/vae/config.json"), ... ("vae/diffusion_pytorch_model.fp16.safetensors", "path/to/vae/diffusion_pytorch_model.fp16.safetensors"), ... ("text_encoder/config.json", "path/to/text_encoder/config.json"), ... ("text_encoder/model.fp16.safetensors", "path/to/text_encoder/model.fp16.safetensors"), ... # ... add more entries here ... ] ... )

or even iterate over entries to save to lazily dump state dicts without saving to disk:

# Export state_dicts one by one from a loaded pipeline >>> from diffusers import DiffusionPipeline >>> from typing import Generator, Tuple >>> import safetensors.torch >>> from huggingface_hub import export_entries_as_dduf >>> pipe = DiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4") ... # ... do some work with the pipeline >>> def as_entries(pipe: DiffusionPipeline) -> Generator[Tuple[str, bytes], None, None]: ... # Build an generator that yields the entries to add to the DDUF file. ... # The first element of the tuple is the filename in the DDUF archive (must use UNIX separator!). The second element is the content of the file. ... # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time) ... yield "vae/config.json", pipe.vae.to_json_string().encode() ... yield "vae/diffusion_pytorch_model.safetensors", safetensors.torch.save(pipe.vae.state_dict()) ... yield "text_encoder/config.json", pipe.text_encoder.config.to_json_string().encode() ... yield "text_encoder/model.safetensors", safetensors.torch.save(pipe.text_encoder.state_dict()) ... # ... add more entries here >>> export_entries_as_dduf("stable-diffusion-v1-4.dduf", as_entries=as_entries(pipe))

And I've kept exporting an entire folder at once for ease of use:

# Export a folder as a DDUF file >>> from huggingface_hub import export_folder_as_dduf >>> export_folder_as_dduf("FLUX.1-dev.dduf", folder_path="path/to/FLUX.1-dev")

sayakpaul

Thanks for the neat examples and starting this!

src/huggingface_hub/serialization/_dduf.py

sayakpaul · 2024-12-04T12:18:16Z

src/huggingface_hub/serialization/_dduf.py

+    return entries
+
+
+def write_dduf_file(dduf_path: Union[str, Path], diffuser_path: Union[str, Path]) -> None:


write_dduf_file("FLUX.1-dev.dduf", diffuser_path="path/to/FLUX.1-dev")

here could the diffuser_path be renamed into something more general as we want DDUF to be independent of diffusers.

src/huggingface_hub/serialization/_dduf.py

rwightman · 2024-12-04T18:05:43Z

neat packaging idea... not sure what the highest priority is but a consideration re zip vs tar, tar would let you start reading weights into memory w/o seeking to end and having to access to the zip directory (in footer). To mitigate the direct access to offsets (w/o reading forward) in tar, you can write a TOC metadata file at the start of the tar to work around that. Like uncompressed zip you have direct access to bytes in a tar file.

julien-c · 2024-12-04T18:24:32Z

cc @ylow maybe for context on why we opted to go with zip vs. tar^

ylow · 2024-12-04T20:45:57Z

@rwightman zip or tar, either way an index structure will be necessary. And zip comes with one as part of the format so no need to invent another. Indeed it might seem that one additional "seek" might be necessary to obtain the index structure, but say from a HTTP range read you can do "Range: bytes=-100000" for instance to fetch the last 100k, so in that sense there is no major difference reading from start or end from a streaming perspective. On the other hand its kinda nice to be able to rely on a common container format so that minimal effort is needed to implement readers in other languages.

rwightman · 2024-12-04T20:55:47Z

@ylow FWIW that extra range request is a seek and an extra network RTT (or few), one of the reasons why reading from parquet files when 'streaming' using fsspec is quite a bit slower at scale than just streaming tar (forward streaming only). The server does pay a penalty with extra IOPS.

Probably not as big an issue with model weights vs dataset files...

hanouticelina

looks good and clean to me, thanks @Wauplin ! left some comments, nothing major

docs/source/en/package_reference/serialization.md

src/huggingface_hub/serialization/_dduf.py

Co-authored-by: Célina <[email protected]>

docs/source/en/package_reference/serialization.md

pcuenca · 2024-12-11T14:06:43Z

docs/source/en/package_reference/serialization.md

+...     # Build an generator that yields the entries to add to the DDUF file.
+...     # The first element of the tuple is the filename in the DDUF archive (must use UNIX separator!). The second element is the content of the file.
+...     # Entries will be evaluated lazily when the DDUF file is created (only 1 entry is loaded in memory at a time)
+...     yield "vae/config.json", pipe.vae.to_json_string().encode()


I was confused that this is pipe.vae.to_json_string() but then it's pipe.text_encoder.config.to_json_string() below, but I think it's correct as the vae config is a FrozenDict.

I'd be up for another solution tbh. I just wanted to show that you need to serialize things as bytes by yourself.

…_hub into dduf-parser-v0.1

Co-authored-by: Pedro Cuenca <[email protected]>

sayakpaul

Thanks, Lucain! My comments are mostly minor in nature. I think this is ready to 🚢

docs/source/en/package_reference/serialization.md

tests/test_dduf.py

…_hub into dduf-parser-v0.1

Co-authored-by: Sayak Paul <[email protected]>

docs/source/en/package_reference/serialization.md

Co-authored-by: Célina <[email protected]>

src/huggingface_hub/serialization/_dduf.py

SunMarc

LGTM ! Just a nit

src/huggingface_hub/serialization/_dduf.py

Wauplin · 2024-12-13T11:39:06Z

Thanks everyone for the feedback!

Wauplin added 3 commits December 4, 2024 11:36

First draft for a DDUF parser

3b533a6

write before read

953bbae

comments and lint

d30558b

Wauplin marked this pull request as draft December 4, 2024 10:47

Wauplin requested review from pcuenca, Vaibhavs10, SunMarc, julien-c, LysandreJik and coyotte508 December 4, 2024 10:47

forbid nested directoroes

0f21bd3

Wauplin requested a review from sayakpaul December 4, 2024 11:22

gguf typo

b4bf030

Wauplin mentioned this pull request Dec 4, 2024

[FEAT] DDUF format huggingface/diffusers#10037

Merged

4 tasks

julien-c approved these changes Dec 4, 2024

View reviewed changes

src/huggingface_hub/serialization/_dduf.py Show resolved Hide resolved

src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved

src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved

Wauplin commented Dec 4, 2024

View reviewed changes

sayakpaul reviewed Dec 4, 2024

View reviewed changes

SunMarc reviewed Dec 4, 2024

View reviewed changes

src/huggingface_hub/serialization/_dduf.py Show resolved Hide resolved

Wauplin added 2 commits December 4, 2024 17:03

export_from_entries

1b11f0b

some docs

f349cbe

Wauplin requested review from sayakpaul, SunMarc and julien-c December 4, 2024 16:53

Wauplin changed the title ~~First draft for a DDUF parser~~ DDUF parser v0.1 Dec 4, 2024

julien-c approved these changes Dec 4, 2024

View reviewed changes

Wauplin and others added 2 commits December 11, 2024 13:46

Merge branch 'main' into dduf-parser-v0.1

c7ce20a

required model_index.json

f5f0f25

hanouticelina reviewed Dec 11, 2024

View reviewed changes

Apply suggestions from code review

0d1045d

Co-authored-by: Célina <[email protected]>

pcuenca reviewed Dec 11, 2024

View reviewed changes

Wauplin and others added 3 commits December 11, 2024 15:32

use f-string in logs

dca1586

Merge branch 'dduf-parser-v0.1' of github.com:huggingface/huggingface…

f6bee85

…_hub into dduf-parser-v0.1

Update docs/source/en/package_reference/serialization.md

157633c

Co-authored-by: Pedro Cuenca <[email protected]>

sayakpaul approved these changes Dec 12, 2024

View reviewed changes

Wauplin and others added 6 commits December 12, 2024 09:11

remove add_entry_to_dduf

5cf560d

new rules: folders in model_index.json + config files in folders

e6c62da

Merge branch 'dduf-parser-v0.1' of github.com:huggingface/huggingface…

4078626

…_hub into dduf-parser-v0.1

add arg

381ac7e

add arg

76265b4

Update docs/source/en/package_reference/serialization.md

d796252

Co-authored-by: Sayak Paul <[email protected]>

hanouticelina reviewed Dec 12, 2024

View reviewed changes

docs/source/en/package_reference/serialization.md Outdated Show resolved Hide resolved

Wauplin and others added 2 commits December 12, 2024 11:57

Update docs/source/en/package_reference/serialization.md

5c2bb63

Co-authored-by: Célina <[email protected]>

add scheduler config

360ddd1

SunMarc reviewed Dec 12, 2024

View reviewed changes

src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved

Wauplin added 2 commits December 12, 2024 16:08

scheduler_config

c168e23

style

4553f4c

SunMarc approved these changes Dec 12, 2024

View reviewed changes

src/huggingface_hub/serialization/_dduf.py Show resolved Hide resolved

sayakpaul reviewed Dec 13, 2024

View reviewed changes

src/huggingface_hub/serialization/_dduf.py Outdated Show resolved Hide resolved

Wauplin added 2 commits December 13, 2024 10:40

preprocessor_config.json

6ad29e7

Merge remote-tracking branch 'origtin' into dduf-parser-v0.1

ce2b858

hanouticelina self-requested a review December 13, 2024 11:36

hanouticelina approved these changes Dec 13, 2024

View reviewed changes

Wauplin merged commit 4b0b179 into main Dec 13, 2024
17 checks passed

Wauplin deleted the dduf-parser-v0.1 branch December 13, 2024 11:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DDUF parser v0.1 #2692

DDUF parser v0.1 #2692

Wauplin commented Dec 4, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Dec 4, 2024

julien-c left a comment

Wauplin Dec 4, 2024

Wauplin Dec 4, 2024

sayakpaul Dec 4, 2024

SunMarc Dec 4, 2024

Wauplin Dec 4, 2024

sayakpaul left a comment

sayakpaul Dec 4, 2024

rwightman commented Dec 4, 2024

julien-c commented Dec 4, 2024

ylow commented Dec 4, 2024

rwightman commented Dec 4, 2024

hanouticelina left a comment

pcuenca Dec 11, 2024

Wauplin Dec 11, 2024

sayakpaul left a comment

SunMarc left a comment

Wauplin commented Dec 13, 2024

		return entries


		def write_dduf_file(dduf_path: Union[str, Path], diffuser_path: Union[str, Path]) -> None:

DDUF parser v0.1 #2692

DDUF parser v0.1 #2692

Conversation

Wauplin commented Dec 4, 2024 • edited Loading

How to write a DDUF file?

How to read a DDUF file?

HuggingFaceDocBuilderDev commented Dec 4, 2024

julien-c left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rwightman commented Dec 4, 2024

julien-c commented Dec 4, 2024

ylow commented Dec 4, 2024

rwightman commented Dec 4, 2024

hanouticelina left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sayakpaul left a comment

Choose a reason for hiding this comment

SunMarc left a comment

Choose a reason for hiding this comment

Wauplin commented Dec 13, 2024

Wauplin commented Dec 4, 2024 •

edited

Loading