Skip to content

[Enhancement] Upgrade cache-dit from 1.2.0 to 1.3.0#1834

Merged
SamitHuang merged 19 commits into
vllm-project:mainfrom
SamitHuang:upgrade/cache-dit-1.3.0
Mar 12, 2026
Merged

[Enhancement] Upgrade cache-dit from 1.2.0 to 1.3.0#1834
SamitHuang merged 19 commits into
vllm-project:mainfrom
SamitHuang:upgrade/cache-dit-1.3.0

Conversation

@SamitHuang
Copy link
Copy Markdown
Collaborator

@SamitHuang SamitHuang commented Mar 12, 2026

Purpose

Upgrade the cache-dit dependency from 1.2.0 to 1.3.0 (latest release). This is a version bump with full API backward compatibility — all existing imports, enable_cache(), refresh_context(), BlockAdapter, DBCacheConfig, etc., remain unchanged.

Test Plan

  • Verified all cache_dit imports used in vllm_omni/diffusion/cache/cache_dit_backend.py pass with 1.3.0
  • Ran offline inference benchmark on Qwen/Qwen-Image (text-to-image, 1024x1024, 50 steps) comparing with and without cache-dit acceleration
  • Pre-commit passes on the changed file
# Without cache-dit (baseline)
CUDA_VISIBLE_DEVICES=1 python examples/offline_inference/text_to_image/text_to_image.py \
  --model Qwen/Qwen-Image --prompt "a cup of coffee on the table" \
  --seed 142 --num-inference-steps 50 --height 1024 --width 1024

# With cache-dit 1.3.0
CUDA_VISIBLE_DEVICES=1 python examples/offline_inference/text_to_image/text_to_image.py \
  --model Qwen/Qwen-Image --prompt "a cup of coffee on the table" \
  --seed 142 --num-inference-steps 50 --height 1024 --width 1024 \
  --cache-backend cache_dit --enable-cache-dit-summary

Test Result

Benchmark on single NVIDIA H800 GPU:

Metric Without Cache-DiT With Cache-DiT 1.3.0 Speedup
Total generation time 7.551s 3.761s 2.01x
Diffusion engine exec time 7,436ms 3,644ms 2.04x

Cache-dit 1.3.0 delivers ~2x acceleration on Qwen-Image with default DBCache config (Fn=1, Bn=0, W=4, threshold=0.24), consistent with 1.2.0 behavior.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands.
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update — N/A, no doc changes needed.
  • (Optional) Release notes update — N/A, minor dependency bump.

Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Two optimizations that eliminate ~6.5s of IPC serialization overhead
for single-stage diffusion pipelines (e.g. Wan2.2 I2V/T2V) in online
serving mode:

Phase 1 – Inline diffusion (eliminate Hop3):
When there is exactly one diffusion stage in async mode, initialize
OmniDiffusion directly in the orchestrator process instead of spawning
a stage worker subprocess. This removes the entire Hop3 serialization
path (pickle + mp.Queue/SHM) between the stage worker and orchestrator.
GPU workers for tensor parallelism are still spawned by DiffusionExecutor.

Phase 2 – SHM tensor transfer (optimize Hop1):
Replace pickle-based serialization of large tensors through MessageQueue
with POSIX shared memory. The worker copies tensor data into a named SHM
segment and enqueues only lightweight metadata; the scheduler reconstructs
the tensor from SHM. This reduces Hop1 overhead from ~3.4s to ~1.5s.

Measured on Wan2.2-I2V-A14B (TP=2, 1280x720, 5s@16fps, 1 step):
  Before:  e2e = 37.5s
  Phase 1: e2e = 33.1s  (−4.4s)
  Phase 2: e2e = 31.0s  (−2.1s)
  Total:   e2e = 31.0s  (−6.5s, −17.5%)

Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
…17.5%)

perf: reduce IPC overhead for single-stage diffusion serving (~6.5s, 17.5%)
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: Samit <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
Signed-off-by: samithuang <285365963@qq.com>
@DefTruth DefTruth mentioned this pull request Mar 12, 2026
5 tasks
@SamitHuang SamitHuang force-pushed the upgrade/cache-dit-1.3.0 branch from 997ba86 to 30a6201 Compare March 12, 2026 06:20
Upgrade cache-dit dependency to the latest release (1.3.0). All existing
imports and APIs remain compatible. Verified with Qwen-Image offline
inference showing ~2x speedup with cache-dit acceleration.

Signed-off-by: yx <yx@users.noreply.github.com>
Made-with: Cursor

Signed-off-by: samithuang <285365963@qq.com>
@SamitHuang SamitHuang force-pushed the upgrade/cache-dit-1.3.0 branch from 30a6201 to 5fcf302 Compare March 12, 2026 06:49
@SamitHuang SamitHuang added the ready label to trigger buildkite CI label Mar 12, 2026
@SamitHuang SamitHuang merged commit 4dbaa74 into vllm-project:main Mar 12, 2026
7 checks passed
yiliu30 pushed a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 20, 2026
Signed-off-by: samithuang <285365963@qq.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants