Skip to content

[Bugfix] Fix layer-wise offload incompatibility with HSDP#2021

Merged
wtomin merged 5 commits into
vllm-project:mainfrom
RuixiangMa:fixlayeroffloadhsdp
Apr 1, 2026
Merged

[Bugfix] Fix layer-wise offload incompatibility with HSDP#2021
wtomin merged 5 commits into
vllm-project:mainfrom
RuixiangMa:fixlayeroffloadhsdp

Conversation

@RuixiangMa
Copy link
Copy Markdown
Collaborator

@RuixiangMa RuixiangMa commented Mar 19, 2026

Purpose

vllm serve Wan-AI/Wan2.2-TI2V-5B-Diffusers --omni --use-hsdp --hsdp-shard-size 4 --enable-layerwise-offload --port 8004

File "/home/ruixiang/omni/vllm_omni/diffusion/offloader/layerwise_backend.py", line 115, in _to_cpu
cpu_tensor[current_offset : current_offset + numel].copy_(param_or_buf.flatten())
RuntimeError: aten.copy_.default got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!

Fix layer-wise offload incompatibility with HSDP by properly handling DTensor to local tensor conversion

Test Plan

Test Result

(APIServer pid=199924) INFO: 127.0.0.1:52912 - "POST /v1/videos HTTP/1.1" 200 OK
(APIServer pid=199924) INFO 03-20 01:33:30 [serving_video.py:118] Boundary ratio parse: request=0.875 gen_params=0.875
(APIServer pid=199924) INFO 03-20 01:33:30 [serving_video.py:128] Video sampling params: steps=10 guidance=4.0 guidance_2=4.0 seed=42
(APIServer pid=199924) INFO 03-20 01:33:30 [serving_video.py:204] Video generation routing: stage_configs=present, has_stage_list=True, engine_type=AsyncOmni
(APIServer pid=199924) INFO 03-20 01:33:30 [async_omni.py:511] [AsyncOrchestrator] Inline diffusion generate for request video_gen_02d35aafeb1e46399788ce9181d04de8
(APIServer pid=199924) INFO 03-20 01:33:30 [diffusion_engine.py:86] Pre-processing completed in 0.0000 seconds
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:15<00:00, 1.56s/it]
(APIServer pid=199924) INFO 03-20 01:33:47 [diffusion_engine.py:94] Generation completed successfully.
(APIServer pid=199924) INFO 03-20 01:33:47 [diffusion_engine.py:116] Post-processing completed in 0.0542 seconds
(APIServer pid=199924) INFO 03-20 01:33:47 [diffusion_engine.py:119] DiffusionEngine.step breakdown: preprocess=0.04 ms, add_req_and_wait=17790.30 ms, postprocess=54.16 ms, total=17845.49 ms
(APIServer pid=199924) INFO 03-20 01:33:47 [omni_diffusion.py:133] OmniDiffusion.generate total: 17846.18 ms
(APIServer pid=199924) INFO 03-20 01:33:48 [serving_video.py:159] Video response encoding (MP4+base64): 327.40 ms
(APIServer pid=199924) INFO 03-20 01:33:48 [api_server.py:1778] Video request video_gen_02d35aafeb1e46399788ce9181d04de8 persisted /tmp/storage/video_gen_02d35aafeb1e46399788ce9181d04de8.mp4 output file.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Lancer <maruixiang6688@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b2c9d6a9cf

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +119 to +123
if use_hsdp and hasattr(t, "to_local"):
local_t = t.to_local()
else:
local_t = t
weights_with_local.append((name, t, local_t))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Don't rematerialize HSDP blocks from to_local() shards

With --use-hsdp --enable-layerwise-offload, _to_cpu() now snapshots each DTensor via to_local() and prefetch_layer() later writes that shard back as a plain Tensor. Because HookRegistry wraps forward() instead of registering a forward-pre-hook (vllm_omni/diffusion/hooks/base.py:165-172), the block's FSDP/HSDP pre-hook runs outside this path; the eagerly prefetched first block in LayerWiseOffloadBackend.enable() therefore enters its first call with only rank-local weights, not a DTensor that FSDP can all-gather. On multi-GPU HSDP, any dim-0-sharded weight will then run with truncated per-rank shapes or fail outright.

Useful? React with 👍 / 👎.

@RuixiangMa RuixiangMa changed the title Fixlayeroffloadhsdp [Buggix] Fix layer-wise offload incompatibility with HSDP Mar 19, 2026
@RuixiangMa RuixiangMa changed the title [Buggix] Fix layer-wise offload incompatibility with HSDP [Bugfix] Fix layer-wise offload incompatibility with HSDP Mar 19, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@yuanheng-zhao PTAL

Comment on lines +119 to +122
if use_hsdp and hasattr(t, "to_local"):
local_t = t.to_local()
else:
local_t = t
Copy link
Copy Markdown
Collaborator

@yuanheng-zhao yuanheng-zhao Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does HSDP convert all of the model parameters to DTensor?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only parameters of sharded modules (matched by hsdp_shard_conditions) become DTensor, other parameter remain as regular

weights_with_local = []
for name, t in name2weights.items():
if use_hsdp and hasattr(t, "to_local"):
local_t = t.to_local()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chances that DTensor.to_local returns AsyncCollectiveTensor at here

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to_local() returns the local shard directly without triggering communication,i think it is saft

)

param_or_buf.data = torch.empty((), device=device, dtype=dtype)
original_tensor.data = torch.empty((), device=device, dtype=dtype)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DTensor.data = XXX is accessing Dtensor data directly rather than its local tensor. Could you help to check if this is consistent and compatible?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. That would bypass DTensor semantics, so I changed the layerwise backend to keep the DTensor wrapper and update _local_tensor instead

Comment on lines 184 to 192
for metadata in ordered_metadata:
target_name = metadata["name"]
target_param_or_buf = (
layer_params[target_name] if target_name in layer_params else layer_bufs[target_name]
)

target_param_or_buf.data = gpu_weight[metadata["offset"] : metadata["offset"] + metadata["numel"]].view(
metadata["shape"]
)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above. The assignment and materialization of gpu tensors during prefetch for DTensor directly happen here (e.g., target_param_or_buf.data = ...)

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix addresses the crash correctly. Two notes:

  1. The use_hsdp flag feels redundant — just checking hasattr(t, "to_local") everywhere would be simpler, no config plumbing needed, and works for any future DTensor scenario.

  2. yuanheng-zhao's .data assignment question (lines 142, 192) is important — assigning plain tensors back into DTensor parameters during prefetch_layer may break FSDP bookkeeping. offload_layer() has the same pattern but wasn't updated either.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments. The _to_cpu fix looks correct, but prefetch_layer and offload_layer still assign plain tensors to .data on DTensor parameters — that seems like it'll break HSDP bookkeeping on the reload path.

# When HSDP is enabled, tensors are DTensor and we need to_local() for correct numel/shape
weights_with_local = []
for name, t in name2weights.items():
if use_hsdp and hasattr(t, "to_local"):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use_hsdp guard is redundant — just check hasattr(t, "to_local") unconditionally. That removes the need to thread a bool through OffloadConfigLayerwiseOffloadHook_to_cpuapply_block_hook, and it'll work for any future DTensor usage, not just HSDP.

Suggested change
if use_hsdp and hasattr(t, "to_local"):
if hasattr(t, "to_local"):

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the local use_hsdp plumbing in this module and switched to capability-based handling for DTensor tensors only

"offset": current_offset,
"numel": numel,
"shape": param_or_buf.shape,
"shape": local_tensor.shape,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prefetch_layer uses metadata["shape"] to .view() the GPU slice and then assigns it to target_param_or_buf.data. With this PR, shape is now the local tensor shape, but target_param_or_buf is still a DTensor. That .data assignment replaces the DTensor internals with a plain tensor — doesn't this break FSDP/HSDP state tracking on the reload path? Same concern for offload_layer which does param.data = torch.empty(...).

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied the same DTensor-safe storage update in prefetch_layer() as well, so both prefetch and offload follow the same handling

enable_layerwise_offload = getattr(od_config, "enable_layerwise_offload", False)
pin_cpu_memory = getattr(od_config, "pin_cpu_memory", True)

parallel_config = getattr(od_config, "parallel_config", None)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you drop the explicit use_hsdp flag (see other comment about duck-typing to_local), this config plumbing and the field on OffloadConfig can all go away.

@RuixiangMa
Copy link
Copy Markdown
Collaborator Author

Fix addresses the crash correctly. Two notes:

  1. The use_hsdp flag feels redundant — just checking hasattr(t, "to_local") everywhere would be simpler, no config plumbing needed, and works for any future DTensor scenario.
  2. yuanheng-zhao's .data assignment question (lines 142, 192) is important — assigning plain tensors back into DTensor parameters during prefetch_layer may break FSDP bookkeeping. offload_layer() has the same pattern but wasn't updated either.

Good suggestion, my initial commit used a similar approach, but I updated it to make the intent clearer for HSDP, i will modify it.

Signed-off-by: Lancer <maruixiang6688@gmail.com>
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Mar 31, 2026

Please fix the pre-commit errors. This bugfix looks good to me.

Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa
Copy link
Copy Markdown
Collaborator Author

Please fix the pre-commit errors. This bugfix looks good to me.

fixed

@wtomin wtomin added the ready label to trigger buildkite CI label Mar 31, 2026
@wtomin wtomin merged commit dd0b6fd into vllm-project:main Apr 1, 2026
7 of 8 checks passed
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants