[Enhancement] Add force_refresh support for GLM-Image for cache-dit 1.3.0 upgrade#1858
Conversation
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Two optimizations that eliminate ~6.5s of IPC serialization overhead for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online serving mode: Phase 1 – Inline diffusion (eliminate Hop3): When there is exactly one diffusion stage in async mode, initialize OmniDiffusion directly in the orchestrator process instead of spawning a stage worker subprocess. This removes the entire Hop3 serialization path (pickle + mp.Queue/SHM) between the stage worker and orchestrator. GPU workers for tensor parallelism are still spawned by DiffusionExecutor. Phase 2 – SHM tensor transfer (optimize Hop1): Replace pickle-based serialization of large tensors through MessageQueue with POSIX shared memory. The worker copies tensor data into a named SHM segment and enqueues only lightweight metadata; the scheduler reconstructs the tensor from SHM. This reduces Hop1 overhead from ~3.4s to ~1.5s. Measured on Wan2.2-I2V-A14B (TP=2, 1280x720, 5s@16fps, 1 step): Before: e2e = 37.5s Phase 1: e2e = 33.1s (−4.4s) Phase 2: e2e = 31.0s (−2.1s) Total: e2e = 31.0s (−6.5s, −17.5%) Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com>
…17.5%) perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, 17.5%)
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Upgrade cache-dit dependency to the latest release (1.3.0). All existing imports and APIs remain compatible. Verified with Qwen-Image offline inference showing ~2x speedup with cache-dit acceleration. Signed-off-by: yx <yx@users.noreply.github.com> Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com>
…Image Add force_refresh_step_hint and force_refresh_step_policy to DiffusionCacheConfig and wire them through to DBCacheConfig. Register custom cache-dit enablers for HeliosPipeline, HeliosPyramidPipeline, and GlmImagePipeline. - Helios: multi-chunk denoise loop requires cache reset between chunks, so force_refresh_step_hint defaults to num_inference_steps and force_refresh_step_policy defaults to "repeat". - GLM-Image: editing mode preprocesses input image in one extra transformer call; force_refresh_step_hint=1 discards stale state. Signed-off-by: yx <yx@users.noreply.github.com> Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com>
Add --cache-backend and --enable-cache-dit-summary CLI arguments to the GLM-Image offline inference example, enabling cache-dit acceleration for the diffusion stage. Signed-off-by: yx <yx@users.noreply.github.com> Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com>
Made-with: Cursor Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
|
How is this PR different from #1399? Is it the cache-dit version different? |
Add two args: |
yes, support for cache-dit v1.3.0 |
|
@SamitHuang Can you cooperate with this pr's author: #1399, I hope we can encourage more wild developer participate in our project😊 ! |
sure, happy to do that. I didn't notice 1399 previously |
|
Do other models need this feature? |
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
reference docs at: https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/#mcc-multiple-cache-contexts-within-a-single-denoising-loop |
….3.0 upgrade (vllm-project#1858) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
….3.0 upgrade (vllm-project#1858) Signed-off-by: samithuang <285365963@qq.com> Signed-off-by: Samit <285365963@qq.com> Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Purpose
Add
force_refresh_step_hintandforce_refresh_step_policysupport fromcache-dit1.3.0 for GLM-Image model, aligning with the cache-dit example usage. Also adds--cache-backendCLI support to the GLM-Image end2end example script.Why GLM-Image needs special handling
GLM-Image (GlmImagePipeline): In editing mode, the transformer is called once to process the input image before the denoising loop begins. Setting
force_refresh_step_hint = 1ensures the cache is force-refreshed after this preprocessing call, discarding stale hidden states before actual denoising. For text-to-image mode,force_refresh_step_hint = None(no force refresh needed). This can be configured in cache-dit 1.3.0Changes
vllm_omni/diffusion/data.py: Addedforce_refresh_step_hintandforce_refresh_step_policyfields toDiffusionCacheConfigvllm_omni/diffusion/cache/cache_dit_backend.py:_build_db_cache_config()toDBCacheConfigenable_cache_for_glm_image()custom enablerCUSTOM_DIT_ENABLERSexamples/offline_inference/glm_image/end2end.py: Added--cache-backendand--enable-cache-dit-summaryCLI argumentsrequirements/common.txt: Upgraded cache-dit from 1.2.0 to 1.3.0Dependency
This PR depends on vllm-project/vllm-omni#1834 (cache-dit 1.3.0 upgrade) being merged first, or includes the upgrade in this PR.
Test Plan
Test Commands (GLM-Image)
Test Result
Benchmark on dual NVIDIA H800 GPUs (AR on GPU 1, Diffusion on GPU 6), GLM-Image T2I, 1024x1024, 50 steps:
Cache-dit DBCache config:
F1B0_W4_threshold=0.24_MC3Note: The total generation time is dominated by the AR stage (~27s), so the diffusion-stage speedup (~3x) translates to a more modest ~1.28x end-to-end speedup. For workloads with longer diffusion steps or batch processing, the speedup would be more pronounced.
Cache-dit summary (from the accelerated run):
w/o cache-dit:
w/ cache-dit: