Skip to content

[Enhancement] Add force_refresh support for GLM-Image for cache-dit 1.3.0 upgrade#1858

Merged
wtomin merged 24 commits intovllm-project:mainfrom
SamitHuang:feat/cache-dit-helios-glm-image
Mar 24, 2026
Merged

[Enhancement] Add force_refresh support for GLM-Image for cache-dit 1.3.0 upgrade#1858
wtomin merged 24 commits intovllm-project:mainfrom
SamitHuang:feat/cache-dit-helios-glm-image

Conversation

@SamitHuang
Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang commented Mar 12, 2026

Purpose

Add force_refresh_step_hint and force_refresh_step_policy support from cache-dit 1.3.0 for GLM-Image model, aligning with the cache-dit example usage. Also adds --cache-backend CLI support to the GLM-Image end2end example script.

Why GLM-Image needs special handling

GLM-Image (GlmImagePipeline): In editing mode, the transformer is called once to process the input image before the denoising loop begins. Setting force_refresh_step_hint = 1 ensures the cache is force-refreshed after this preprocessing call, discarding stale hidden states before actual denoising. For text-to-image mode, force_refresh_step_hint = None (no force refresh needed). This can be configured in cache-dit 1.3.0

Changes

  1. vllm_omni/diffusion/data.py: Added force_refresh_step_hint and force_refresh_step_policy fields to DiffusionCacheConfig
  2. vllm_omni/diffusion/cache/cache_dit_backend.py:
    • Pass new fields through _build_db_cache_config() to DBCacheConfig
    • Added enable_cache_for_glm_image() custom enabler
    • Registered in CUSTOM_DIT_ENABLERS
  3. examples/offline_inference/glm_image/end2end.py: Added --cache-backend and --enable-cache-dit-summary CLI arguments
  4. requirements/common.txt: Upgraded cache-dit from 1.2.0 to 1.3.0

Dependency

This PR depends on vllm-project/vllm-omni#1834 (cache-dit 1.3.0 upgrade) being merged first, or includes the upgrade in this PR.

Test Plan

  • Pre-commit check passes (ruff check, ruff format, typos)
  • Verify GLM-Image T2I inference with cache-dit acceleration

Test Commands (GLM-Image)

# Baseline (no cache-dit)
CUDA_VISIBLE_DEVICES=1,6 python examples/offline_inference/glm_image/end2end.py \
  --model-path zai-org/GLM-Image \
  --config-path examples/offline_inference/glm_image/glm_image.yaml \
  --prompt "A photo of an astronaut riding a horse on mars" \
  --height 1024 --width 1024 --num-inference-steps 50 --seed 42 \
  --output glm_image_baseline.png --verbose

# With cache-dit
CUDA_VISIBLE_DEVICES=1,6 python examples/offline_inference/glm_image/end2end.py \
  --model-path zai-org/GLM-Image \
  --config-path examples/offline_inference/glm_image/glm_image.yaml \
  --prompt "A photo of an astronaut riding a horse on mars" \
  --height 1024 --width 1024 --num-inference-steps 50 --seed 42 \
  --cache-backend cache_dit --enable-cache-dit-summary \
  --output glm_image_cachedit.png --verbose

Test Result

Benchmark on dual NVIDIA H800 GPUs (AR on GPU 1, Diffusion on GPU 6), GLM-Image T2I, 1024x1024, 50 steps:

Metric Without Cache-DiT With Cache-DiT 1.3.0 Speedup
Total generation time (AR+Diffusion) 45.86s 35.74s 1.28x
Diffusion stage time ~15s ~5s ~3x
Cached steps / total steps 0/50 35/50 (70%)

Cache-dit DBCache config: F1B0_W4_threshold=0.24_MC3

Note: The total generation time is dominated by the AR stage (~27s), so the diffusion-stage speedup (~3x) translates to a more modest ~1.28x end-to-end speedup. For workloads with longer diffusion steps or batch processing, the speedup would be more pronounced.

Cache-dit summary (from the accelerated run):

[Cache-DiT] ⚡️Cache Steps and Residual Diffs Statistics: GlmImageTransformerBlock
| Cache Steps | Diffs P50 | Diffs P95 | Diffs Max |
|-------------|-----------|-----------|-----------|
| 35          | 0.057     | 0.13      | 0.157     |

w/o cache-dit:

glm_image_baseline

w/ cache-dit:

glm_image_cachedit

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Two optimizations that eliminate ~6.5s of IPC serialization overhead
for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online
serving mode:

Phase 1 – Inline diffusion (eliminate Hop3):
When there is exactly one diffusion stage in async mode, initialize
OmniDiffusion directly in the orchestrator process instead of spawning
a stage worker subprocess. This removes the entire Hop3 serialization
path (pickle + mp.Queue/SHM) between the stage worker and orchestrator.
GPU workers for tensor parallelism are still spawned by DiffusionExecutor.

Phase 2 – SHM tensor transfer (optimize Hop1):
Replace pickle-based serialization of large tensors through MessageQueue
with POSIX shared memory. The worker copies tensor data into a named SHM
segment and enqueues only lightweight metadata; the scheduler reconstructs
the tensor from SHM. This reduces Hop1 overhead from ~3.4s to ~1.5s.

Measured on Wan2.2-I2V-A14B (TP=2, 1280x720, 5s@16fps, 1 step):
  Before:  e2e = 37.5s
  Phase 1: e2e = 33.1s  (−4.4s)
  Phase 2: e2e = 31.0s  (−2.1s)
  Total:   e2e = 31.0s  (−6.5s, −17.5%)

Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
…17.5%)

perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, 17.5%)
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Upgrade cache-dit dependency to the latest release (1.3.0). All existing
imports and APIs remain compatible. Verified with Qwen-Image offline
inference showing ~2x speedup with cache-dit acceleration.

Signed-off-by: yx <yx@users.noreply.github.com>
Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
…Image

Add force_refresh_step_hint and force_refresh_step_policy to
DiffusionCacheConfig and wire them through to DBCacheConfig. Register
custom cache-dit enablers for HeliosPipeline, HeliosPyramidPipeline,
and GlmImagePipeline.

- Helios: multi-chunk denoise loop requires cache reset between chunks,
  so force_refresh_step_hint defaults to num_inference_steps and
  force_refresh_step_policy defaults to "repeat".
- GLM-Image: editing mode preprocesses input image in one extra
  transformer call; force_refresh_step_hint=1 discards stale state.

Signed-off-by: yx <yx@users.noreply.github.com>
Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
Add --cache-backend and --enable-cache-dit-summary CLI arguments to the
GLM-Image offline inference example, enabling cache-dit acceleration for
the diffusion stage.

Signed-off-by: yx <yx@users.noreply.github.com>
Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang SamitHuang changed the title [Enhancement] Add cache-dit force_refresh support for GLM-Image and upgrade to 1.3.0 [Enhancement] Add cache-dit force_refresh support for GLM-Image based on cache-dit 1.3.0 Mar 12, 2026
Signed-off-by: Samit <285365963@qq.com>
@SamitHuang SamitHuang changed the title [Enhancement] Add cache-dit force_refresh support for GLM-Image based on cache-dit 1.3.0 [Enhancement] Add cache-dit support for GLM-Image based on force_refresh in cache-dit 1.3.0 Mar 12, 2026
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 13, 2026

How is this PR different from #1399? Is it the cache-dit version different?

Copy link
Copy Markdown
Collaborator

@princepride princepride left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@princepride
Copy link
Copy Markdown
Collaborator

How is this PR different from #1399? Is it the cache-dit version different?

Add two args: force_refresh_step_hint and force_refresh_step_policy?

@SamitHuang
Copy link
Copy Markdown
Collaborator Author

How is this PR different from #1399? Is it the cache-dit version different?

Add two args: force_refresh_step_hint and force_refresh_step_policy?

yes, support for cache-dit v1.3.0

@princepride
Copy link
Copy Markdown
Collaborator

@SamitHuang Can you cooperate with this pr's author: #1399, I hope we can encourage more wild developer participate in our project😊 !

@SamitHuang
Copy link
Copy Markdown
Collaborator Author

@SamitHuang Can you cooperate with this pr's author: #1399, I hope we can encourage more wild developer participate in our project😊 !

sure, happy to do that. I didn't notice 1399 previously

@SamitHuang SamitHuang changed the title [Enhancement] Add cache-dit support for GLM-Image based on force_refresh in cache-dit 1.3.0 [Enhancement] Add force_refresh support for GLM-Image for cache-dit 1.3.0 upgrade Mar 13, 2026
@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 21, 2026
@Gaohan123
Copy link
Copy Markdown
Collaborator

Do other models need this feature?

Copy link
Copy Markdown
Collaborator

@wtomin wtomin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
@wtomin wtomin added the ready label to trigger buildkite CI label Mar 23, 2026
@DefTruth
Copy link
Copy Markdown
Contributor

Do other models need this feature?

reference docs at: https://cache-dit.readthedocs.io/en/latest/user_guide/CACHE_API/#mcc-multiple-cache-contexts-within-a-single-denoising-loop

@wtomin wtomin merged commit 9ead0d8 into vllm-project:main Mar 24, 2026
7 of 8 checks passed
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
….3.0 upgrade (vllm-project#1858)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
zhangj1an pushed a commit to zhangj1an/vllm-omni that referenced this pull request Mar 26, 2026
….3.0 upgrade (vllm-project#1858)

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants