[feat] Add cache-dit support for helios#2473
[feat] Add cache-dit support for helios#2473JasonJ2021 wants to merge 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: JasonJ2021 <jasonj@zju.edu.cn>
There was a problem hiding this comment.
Pull request overview
Adds cache-dit acceleration support to the Helios diffusion pipeline, and exposes a minimal CLI surface in the Helios offline example to enable it.
Changes:
- Introduce a Helios-specific cache-dit enabler (
enable_cache_for_helios) usingBlockAdapterand register it inCUSTOM_DIT_ENABLERS. - Extend
examples/offline_inference/helios/end2end.pywith--cache-backend cache_ditand--enable-cache-dit-summary, wiring them intoOmni(...).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
vllm_omni/diffusion/cache/cache_dit_backend.py |
Adds and registers a Helios cache-dit enabler + refresh logic. |
examples/offline_inference/helios/end2end.py |
Adds CLI flags and passes cache-dit configuration into the Omni diffusion run. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def enable_cache_for_helios(pipeline: Any, cache_config: Any) -> Callable[[int], None]: | ||
| """Enable cache-dit for Helios pipeline. | ||
|
|
||
| Args: | ||
| pipeline: The Helios pipeline instance. | ||
| cache_config: DiffusionCacheConfig instance with cache configuration. | ||
|
|
||
| Returns: | ||
| A refresh function that can be called to update cache context with new num_inference_steps. | ||
| """ |
There was a problem hiding this comment.
The return type annotation and docstring for enable_cache_for_helios don’t match what is actually returned/used. The returned refresh_cache_context expects (pipeline, num_inference_steps, verbose) (as required by CacheDiTBackend.refresh), but the function is annotated as Callable[[int], None] and the doc says it can be called with just num_inference_steps. Update the annotation/docstring (or wrap the function) so the signature is accurate and type-checkable.
| "BagelPipeline": enable_cache_for_bagel, | ||
| "GlmImagePipeline": enable_cache_for_glm_image, | ||
| "Flux2Pipeline": enable_cache_for_flux2, | ||
| "HeliosPipeline": enable_cache_for_helios, | ||
| } | ||
| ) |
There was a problem hiding this comment.
HeliosPipeline was added to CUSTOM_DIT_ENABLERS, but there’s no unit test covering this new custom enabler (unlike the existing HunyuanImage3Pipeline registration test). Add a test in tests/diffusion/cache/test_cache_backends.py to assert the registry entry exists and that enabling on a mocked HeliosPipeline calls cache_dit.enable_cache with a BlockAdapter configured against pipeline.transformer/pipeline.transformer.blocks and that backend.refresh() targets pipeline.transformer.
|
@wtomin @hsliuustc0106 PTAL, thx~ |
|
can you compare with the original paper and check how the perf can reach up to 20 fps in H100 device? is there still any gap between vllm-omni and the recommended inference? |
Sadly, I don’t have access to an H100 machine. However, I can compare the performance of vllm-omni with the official inference framework on H20 later. |
SamitHuang
left a comment
There was a problem hiding this comment.
Missing test for this new feature. Please add tests in tests/diffusion/cache/test_cache_backends.py for the new HeliosPipeline custom enabler, similar to the existing HunyuanImage3Pipeline tests, to verify the registry and adapter configuration.
|
|
||
|
|
||
| def enable_cache_for_helios(pipeline: Any, cache_config: Any) -> Callable[[int], None]: | ||
| """Enable cache-dit for Helios pipeline. |
There was a problem hiding this comment.
The return type annotation Callable[[int], None] does not match the actual returned function signature (pipeline, num_inference_steps, verbose). Please update the annotation to Callable[..., None] or correct the type hint to match.
Signed-off-by: Jiahui Sun <jhsun2020@gmail.com>
|
Hi, I want to know whether vllm-omni currently can achieve better inference speed compared with the official repo? can you help the community make comparison and check the profiling details? |
yeah, i will work on this |
|
here is the official repo: https://github.com/PKU-YuanGroup/Helios |
Performance Comparison: Helios Base T2V ModelI ran performance tests comparing the official Helios repo with vllm-omni (both with and without cache-dit enabled) for generating a time-lapse video on H20. The results seem to be significantly slower than the official data, even when considering that the H20 has only 1/13th the TFLOPS of the H100 (148 TFLOPS vs 1,979 TFLOPS in FP16). Test Setup:
Results
Commands:
|
lishunyang12
left a comment
There was a problem hiding this comment.
Review: [feat] Add cache-dit support for helios
Overall this is a clean addition that follows the existing Flux2 enabler pattern well. The 2.57x speedup result is impressive. A few items to address:
Issues
1. Tautological assertion in test (test_cache_backends.py, line ~206)
assert adapter_kwargs["forward_pattern"] == adapter_kwargs["forward_pattern"].__class__.Pattern_2This compares adapter_kwargs["forward_pattern"] with a value derived from itself -- it will always pass regardless of what the value actually is. It should be:
assert adapter_kwargs["forward_pattern"] == ForwardPattern.Pattern_2You need to import ForwardPattern in the test (or reference it from the mock) and assert against the expected enum value directly.
2. Unrelated test change (test_cache_backends.py, line 60)
The change from Fn_compute_blocks: 2 to Fn_compute_blocks: 1 in test_enable_single_transformer is unrelated to Helios support. If this was needed to fix a pre-existing test failure, it should be called out in the PR description. If not, please revert it to keep the PR focused.
Suggestions (non-blocking)
3. Hardcoded cache config in end2end.py
The cache config dict in end2end.py (lines 219-227) hardcodes values like residual_diff_threshold: 0.24, max_continuous_cached_steps: 3, etc. Consider either:
- Exposing key tunables (at least
residual_diff_threshold) as CLI arguments, or - Adding a comment noting these are tuned defaults for Helios-Base so future users know they may need adjustment for other Helios variants.
This is fine for an initial PR but worth a follow-up.
4. Trailing comma in log string
In enable_cache_for_helios (around line 1251 of cache_dit_backend.py):
f"W={db_cache_config.max_warmup_steps}, "The trailing comma+space at the end of the f-string is a minor cosmetic issue (also present in some existing enablers). Not blocking.
Looks Good
- The enabler correctly uses
BlockAdapterwithtransformer.blocks,ForwardPattern.Pattern_2,has_separate_cfg=True, andcheck_forward_pattern=True-- appropriate for a Helios-style architecture. - The refresh function properly handles both plain refresh and SCM mask policy refresh, consistent with other enablers.
- The
pipeline.transformer is Noneguard in both enable and refresh paths is good defensive coding. - Tests cover the happy path, missing-transformer-on-enable, and missing-transformer-on-refresh scenarios -- good coverage.
- Registration in
CUSTOM_DIT_ENABLERSis correct.
Please fix issue #1 (tautological assertion). Issue #2 needs clarification or revert.
|
Blocking Issues fixed. cc. @lishunyang12 |
Signed-off-by: JasonJ2021 <jasonj@zju.edu.cn>
|
can you help add a recipe for this model? |
bb0f2d6 to
e659b06
Compare
Signed-off-by: Jason <72191212+JasonJ2021@users.noreply.github.com>
Yes, and Should I add the recipe in the vllm-project/recipes repo? |
Please refer to #2645 and submit your prs. |
Purpose
Accelerate Helios model with cache-dit
Test Plan
python end2end.py \ --cache-backend cache_dit --enable-cache-dit-summary --model BestWishYsh/Helios-Base \ --sample-type t2v \ --prompt "A dynamic time-lapse video showing the rapidly moving scenery from the window of a speeding train." \ --guidance-scale 5.0 \ --output helios_t2v_base.mp4Test Result
On 1xH20 Server: E2E time is reduced from 801s to 311s, 2.57x acceleration.
Final output video:
helios_t2v_base.mp4