[Bugfix] Fix layer-wise offload incompatibility with HSDP#2021
Conversation
Signed-off-by: Lancer <maruixiang6688@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b2c9d6a9cf
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if use_hsdp and hasattr(t, "to_local"): | ||
| local_t = t.to_local() | ||
| else: | ||
| local_t = t | ||
| weights_with_local.append((name, t, local_t)) |
There was a problem hiding this comment.
Don't rematerialize HSDP blocks from
to_local() shards
With --use-hsdp --enable-layerwise-offload, _to_cpu() now snapshots each DTensor via to_local() and prefetch_layer() later writes that shard back as a plain Tensor. Because HookRegistry wraps forward() instead of registering a forward-pre-hook (vllm_omni/diffusion/hooks/base.py:165-172), the block's FSDP/HSDP pre-hook runs outside this path; the eagerly prefetched first block in LayerWiseOffloadBackend.enable() therefore enters its first call with only rank-local weights, not a DTensor that FSDP can all-gather. On multi-GPU HSDP, any dim-0-sharded weight will then run with truncated per-rank shapes or fail outright.
Useful? React with 👍 / 👎.
|
@yuanheng-zhao PTAL |
| if use_hsdp and hasattr(t, "to_local"): | ||
| local_t = t.to_local() | ||
| else: | ||
| local_t = t |
There was a problem hiding this comment.
Does HSDP convert all of the model parameters to DTensor?
There was a problem hiding this comment.
only parameters of sharded modules (matched by hsdp_shard_conditions) become DTensor, other parameter remain as regular
| weights_with_local = [] | ||
| for name, t in name2weights.items(): | ||
| if use_hsdp and hasattr(t, "to_local"): | ||
| local_t = t.to_local() |
There was a problem hiding this comment.
Is there any chances that DTensor.to_local returns AsyncCollectiveTensor at here
There was a problem hiding this comment.
to_local() returns the local shard directly without triggering communication,i think it is saft
| ) | ||
|
|
||
| param_or_buf.data = torch.empty((), device=device, dtype=dtype) | ||
| original_tensor.data = torch.empty((), device=device, dtype=dtype) |
There was a problem hiding this comment.
DTensor.data = XXX is accessing Dtensor data directly rather than its local tensor. Could you help to check if this is consistent and compatible?
There was a problem hiding this comment.
Agreed. That would bypass DTensor semantics, so I changed the layerwise backend to keep the DTensor wrapper and update _local_tensor instead
| for metadata in ordered_metadata: | ||
| target_name = metadata["name"] | ||
| target_param_or_buf = ( | ||
| layer_params[target_name] if target_name in layer_params else layer_bufs[target_name] | ||
| ) | ||
|
|
||
| target_param_or_buf.data = gpu_weight[metadata["offset"] : metadata["offset"] + metadata["numel"]].view( | ||
| metadata["shape"] | ||
| ) |
There was a problem hiding this comment.
Same as above. The assignment and materialization of gpu tensors during prefetch for DTensor directly happen here (e.g., target_param_or_buf.data = ...)
lishunyang12
left a comment
There was a problem hiding this comment.
Fix addresses the crash correctly. Two notes:
-
The
use_hsdpflag feels redundant — just checkinghasattr(t, "to_local")everywhere would be simpler, no config plumbing needed, and works for any future DTensor scenario. -
yuanheng-zhao's
.dataassignment question (lines 142, 192) is important — assigning plain tensors back into DTensor parameters duringprefetch_layermay break FSDP bookkeeping.offload_layer()has the same pattern but wasn't updated either.
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple comments. The _to_cpu fix looks correct, but prefetch_layer and offload_layer still assign plain tensors to .data on DTensor parameters — that seems like it'll break HSDP bookkeeping on the reload path.
| # When HSDP is enabled, tensors are DTensor and we need to_local() for correct numel/shape | ||
| weights_with_local = [] | ||
| for name, t in name2weights.items(): | ||
| if use_hsdp and hasattr(t, "to_local"): |
There was a problem hiding this comment.
The use_hsdp guard is redundant — just check hasattr(t, "to_local") unconditionally. That removes the need to thread a bool through OffloadConfig → LayerwiseOffloadHook → _to_cpu → apply_block_hook, and it'll work for any future DTensor usage, not just HSDP.
| if use_hsdp and hasattr(t, "to_local"): | |
| if hasattr(t, "to_local"): |
There was a problem hiding this comment.
removed the local use_hsdp plumbing in this module and switched to capability-based handling for DTensor tensors only
| "offset": current_offset, | ||
| "numel": numel, | ||
| "shape": param_or_buf.shape, | ||
| "shape": local_tensor.shape, |
There was a problem hiding this comment.
prefetch_layer uses metadata["shape"] to .view() the GPU slice and then assigns it to target_param_or_buf.data. With this PR, shape is now the local tensor shape, but target_param_or_buf is still a DTensor. That .data assignment replaces the DTensor internals with a plain tensor — doesn't this break FSDP/HSDP state tracking on the reload path? Same concern for offload_layer which does param.data = torch.empty(...).
There was a problem hiding this comment.
applied the same DTensor-safe storage update in prefetch_layer() as well, so both prefetch and offload follow the same handling
| enable_layerwise_offload = getattr(od_config, "enable_layerwise_offload", False) | ||
| pin_cpu_memory = getattr(od_config, "pin_cpu_memory", True) | ||
|
|
||
| parallel_config = getattr(od_config, "parallel_config", None) |
There was a problem hiding this comment.
If you drop the explicit use_hsdp flag (see other comment about duck-typing to_local), this config plumbing and the field on OffloadConfig can all go away.
Good suggestion, my initial commit used a similar approach, but I updated it to make the intent clearer for HSDP, i will modify it. |
|
Please fix the pre-commit errors. This bugfix looks good to me. |
fixed |
…ct#2021) Signed-off-by: Lancer <maruixiang6688@gmail.com>
…ct#2021) Signed-off-by: Lancer <maruixiang6688@gmail.com>
…ct#2021) Signed-off-by: Lancer <maruixiang6688@gmail.com>
Purpose
vllm serve Wan-AI/Wan2.2-TI2V-5B-Diffusers --omni --use-hsdp --hsdp-shard-size 4 --enable-layerwise-offload --port 8004Fix layer-wise offload incompatibility with HSDP by properly handling DTensor to local tensor conversion
Test Plan
Test Result
(APIServer pid=199924) INFO: 127.0.0.1:52912 - "POST /v1/videos HTTP/1.1" 200 OK
(APIServer pid=199924) INFO 03-20 01:33:30 [serving_video.py:118] Boundary ratio parse: request=0.875 gen_params=0.875
(APIServer pid=199924) INFO 03-20 01:33:30 [serving_video.py:128] Video sampling params: steps=10 guidance=4.0 guidance_2=4.0 seed=42
(APIServer pid=199924) INFO 03-20 01:33:30 [serving_video.py:204] Video generation routing: stage_configs=present, has_stage_list=True, engine_type=AsyncOmni
(APIServer pid=199924) INFO 03-20 01:33:30 [async_omni.py:511] [AsyncOrchestrator] Inline diffusion generate for request video_gen_02d35aafeb1e46399788ce9181d04de8
(APIServer pid=199924) INFO 03-20 01:33:30 [diffusion_engine.py:86] Pre-processing completed in 0.0000 seconds
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
INFO 03-20 01:33:30 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
WARNING 03-20 01:33:30 [kv_transfer_manager.py:381] No connector available for receiving KV cache
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████| 10/10 [00:15<00:00, 1.56s/it]
(APIServer pid=199924) INFO 03-20 01:33:47 [diffusion_engine.py:94] Generation completed successfully.
(APIServer pid=199924) INFO 03-20 01:33:47 [diffusion_engine.py:116] Post-processing completed in 0.0542 seconds
(APIServer pid=199924) INFO 03-20 01:33:47 [diffusion_engine.py:119] DiffusionEngine.step breakdown: preprocess=0.04 ms, add_req_and_wait=17790.30 ms, postprocess=54.16 ms, total=17845.49 ms
(APIServer pid=199924) INFO 03-20 01:33:47 [omni_diffusion.py:133] OmniDiffusion.generate total: 17846.18 ms
(APIServer pid=199924) INFO 03-20 01:33:48 [serving_video.py:159] Video response encoding (MP4+base64): 327.40 ms
(APIServer pid=199924) INFO 03-20 01:33:48 [api_server.py:1778] Video request video_gen_02d35aafeb1e46399788ce9181d04de8 persisted /tmp/storage/video_gen_02d35aafeb1e46399788ce9181d04de8.mp4 output file.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)