[diffusion] Enable Cache‑DiT config for diffusers backend#16662
[diffusion] Enable Cache‑DiT config for diffusers backend#16662mickqian merged 7 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @qimcis, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces support for Cache-DiT acceleration within SGLang's Diffusers backend. It allows users to leverage Cache-DiT for optimizing any Diffusers pipeline by providing a configuration file. The changes involve adding a new command-line argument to specify the Cache-DiT configuration, integrating the Cache-DiT enablement logic directly into the Diffusers pipeline loading process, and updating relevant documentation to guide users on this new feature. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces support for Cache-DiT acceleration within the diffusers backend. It adds a new CLI argument --cache-dit-config to specify a Cache-DiT configuration file, updates the ServerArgs dataclass to include this configuration, and integrates the Cache-DiT enabling logic into the DiffusersPipeline.
The changes also include minor documentation updates to reflect the new CLI argument and clarify limitations regarding distributed support for Cache-DiT.
| except Exception: | ||
| logger.exception("Failed to enable cache-dit for diffusers pipeline") | ||
| raise |
There was a problem hiding this comment.
The try...except Exception block is very broad. While logger.exception is helpful for debugging, catching a generic Exception can mask specific issues that might arise from cache_dit.enable_cache. Consider catching more specific exceptions if known, or at least adding a comment explaining why a broad exception is necessary here (e.g., due to the external nature of the cache_dit library and its potential to raise various exceptions).
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching a broad Exception and then passing can hide critical errors, making debugging difficult. If get_text_encoder_from_pipe can fail in expected ways, consider catching those specific exceptions. Otherwise, it's better to log the exception at a debug level or re-raise it if it indicates a serious problem, rather than silently ignoring it.
|
@DefTruth could you take a look when you're available? |
cool~ I'll review it this afternoon |
|
@qimcis @mickqian Thanks for this great work! I have some suggestions as follows:
# Quantization will be treated as a third-party tool for cache-dit, so we do not plan to
# support quant_config for this backend. For quantization-related use cases, we
# recommend directly using sglang (instead of the cache-dit backend) as it delivers
# better performance.
cache_dit.enable_cache(pipe, **cache_dit.load_configs("cache_dit_config.yml"))
|
|
@qimcis The docs for |
| help="Attention backend for diffusers pipelines (e.g., flash, _flash_3_hub, sage, xformers). " | ||
| "See: https://huggingface.co/docs/diffusers/main/en/optimization/attention_backends", | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
this might function well. but we use env vars for non-diffusers-backend, introducing such a new arg could cause confusion
|
🤗Cache-DiT v1.2.0 Major Release is ready! I recommend that we use this version. |
|
Apologies for the delay, have had a busy week I think I've implemented all the requested changes, lmk if I missed anything After rerunning the tests on 1.2.0 I'm seeing good improvements! Let me know if these are approximately expected numbers @DefTruth
|
Hi~ thank you for providing the test data. The overall result is LGTM. Could you please also provide the complete test log at the same time? I can help analyze what causes the difference in the acceleration ratio. |
Yup! Attached below: Test 1 BaselineTest 1 Cache-DiTTest 2 BaselineTest 2 Cache-DiT |
|
can you also share the test logs based on cache-dit v1.1.8 ? |
I don't have the logs for when I originally ran the tests, I just reran them on the previous commit, logs below: Test 1 BaselineTest 1 Cache-DiTTest 2 BaselineTest 2 Cache-DiT |
|
@qimcis @mickqian From the comparison of the time consumption in Transformers, I believe the results are as expected because Cache-DiT only affects the Transformer modules. What do you think? cache-dit v1.1.8 # Test 1 Cache-DiT, Transformer ~ 17s
100%|█████████████████████████████████████████████████████████| 50/50 [00:17<00:00, 2.93it/s]
100%|█████████████████████████████████████████████████████████| 50/50 [00:17<00:00, 2.93it/s]
# Test 2 Cache-DiT, Transformer ~ 16s
100%|█████████████████████████████████████████████████████████| 50/50 [00:16<00:00, 2.99it/s]
100%|█████████████████████████████████████████████████████████| 50/50 [00:16<00:00, 2.99it/s]cache-dit v1.2.0 # Test 1 Cache-DiT, Transformer ~ 17s
100%|█████████████████████████████████████████████████████████|50/50 [00:17<00:00, 2.94it/s]
100%|█████████████████████████████████████████████████████████|50/50 [00:17<00:00, 2.94it/s]
# Test 2 Cache-DiT, Transformer ~ 16s
100%|█████████████████████████████████████████████████████████| 50/50 [00:16<00:00, 2.99it/s]
100%|█████████████████████████████████████████████████████████| 50/50 [00:16<00:00, 2.99it/s] |
this looks like it could be the case! |
|
@mickqian do you have time to take a look? thanks~ |
|
@qimcis hi, can you also test the configs w/o parallel text encoder and parallel vae for this latest commit? Just remove |
|
/tag-and-rerun-ci |
tested most recent commit on cache dit 1.2.0 w/o parallel text encoder and parallel vae, with added --warmup True flag: |
e2bec4d to
073f328
Compare
should i add tests in this pr? or make a separate pr for these |
|
@qimcis you could do it later |
|
diffusion tests and lint all passed, the failed srt tests are irrelevant |
…nd (sgl-project#16662) Signed-off-by: Chi <chixie.mcisaac@gmail.com> Signed-off-by: qimcis <chixie.mcisaac@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com>
Motivation
Addressing #16642, enabling Cache‑DiT acceleration for any diffusers pipeline in SGLang by allowing a cache‑dit config file to be passed through the diffusers backend.
Modifications
--cache-dit-configserver arg inpython/sglang/multimodal_gen/runtime/server_args.pypython/sglang/multimodal_gen/runtime/pipelines/diffusers_pipeline.py.--diffusers-kwargspath to avoidSamplingParamsinit errors:python/sglang/multimodal_gen/configs/sample/sampling_params.py.Accuracy Tests
Benchmarking and Profiling
2x RTX Pro 6000 WS
Test 1
Test 2
Checklist