Skip to content

[Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU#1486

Merged
lishunyang12 merged 16 commits intovllm-project:mainfrom
RuixiangMa:multiblockoffload
Apr 6, 2026
Merged

[Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU#1486
lishunyang12 merged 16 commits intovllm-project:mainfrom
RuixiangMa:multiblockoffload

Conversation

@RuixiangMa
Copy link
Copy Markdown
Contributor

@RuixiangMa RuixiangMa commented Feb 25, 2026

Purpose

Some diffusion models (e.g., Flux, LongCat, Ovis) have two types of transformer blocks(e.g., transformer_blocks and single_transformer_blocks ), the previous implementation only supported single block type, limiting layerwise offloading effectiveness for these models.

  • Implement _layerwise_offload_blocks_attrs attribute to support models with multiple block types
  • Compatible with existing single-block models using _layerwise_offload_blocks_attr
  • Added support for Flux, Flux2-Klein and Z-Image(single block) models
  • Bug fix : Fixed top-level parameters/buffers staying on CPU during offloading

Test Plan

Test Result

NVIDIA-4090(24G)

vllm serve --model /data/models/black-forest-labs/FLUX* --omni --enable_layerwise_offload --port 8004

curl -X POST http://localhost:8004/v1/images/generations   -H "Content-Type: application/json"   -d '{
    "prompt": "a majestic dragon perched on the mountain ridge of Vermont, misty morning atmosphere, photorealistic style",
    "size": "1024x1024",
    "num_inference_steps": 50,
    "cfg_scale": 4.0,
    "guidance_scale": 4.0,
    "seed": 42
  }' | jq -r '.data[0].b64_json' | base64 -d > dragon.png
Model FLUX.1-dev FLUX.2-klein-4B FLUX.2-klein-9B Qwen-Image-2512
Image

Note: FLUX series adopts a multi-block, while Qwen-Image-2512 uses a single-block.

Offload VS no offload

Since FLUX.1-dev and FLUX.2-klein-9B et.al incur OOM without layer offloading, we use FLUX.2-klein-4B and Z-Image as a representative example to illustrate memory usage:

Model No Offload With Offload
Image VRAM Image VRAM
FLUX.2-klein-4B 19.7GB 13.8GB
Z-Image 22.7GB 15.5GB

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py Outdated
Signed-off-by: Lancer <maruixiang6688@gmail.com>

def __init__(self):
self.blocks = nn.ModuleList([...]) # Transformer blocks
```
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR adds multi-block layerwise offloading but provides no test coverage. Add tests to verify: (1) multi-block offloading works correctly with different block types, (2) memory usage is reduced as expected, (3) output quality is maintained, and (4) edge cases like empty or invalid block attributes are handled.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept these out of e2e, as they're pure logic tests for block parsing (single, multi, empty, invalid, etc.). Just put them in a new file instead. pls take a look.

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py
Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple comments on the backend changes. The multi-block approach looks right for Flux-style models.

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py
Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py
Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py
Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa
Copy link
Copy Markdown
Contributor Author

RuixiangMa commented Feb 28, 2026

z-image is also supported in the pr to validate memory savings

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deprecation warning and validation look good. Two minor items still open — see inline threads.

Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa RuixiangMa changed the title [Feat] support for multi-block layerwise offloading [Feat] support for multi-block layerwise offloading and fix top-level parameters/buffers staying on CPU Mar 5, 2026
@RuixiangMa RuixiangMa changed the title [Feat] support for multi-block layerwise offloading and fix top-level parameters/buffers staying on CPU [Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU Mar 5, 2026
Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa
Copy link
Copy Markdown
Contributor Author

Deprecation warning and validation look good. Two minor items still open — see inline threads.

Done

@Gaohan123
Copy link
Copy Markdown
Collaborator

Hello, any updates? There are some left reviews unresolved, especially supplements essentail test cases

@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 14, 2026
@RuixiangMa
Copy link
Copy Markdown
Contributor Author

Hello, any updates? There are some left reviews unresolved, especially supplements essentail test cases

Sorry, forgot to click resolve. most changes are done, I'll add unit tests.

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @alex-jw-brooks @lishunyang12 @wtomin This PR has been open for a while. Could you take a look again? I need the features included. Thx!

Copy link
Copy Markdown
Contributor

@alex-jw-brooks alex-jw-brooks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought, but looks good to me!

model.__class__.__name__,
)
continue
blocks.extend(attr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is dependent on the attribute value being iterable, which I think is fine for most cases, but might be a good idea to check and warn if it isn't and/or append non iterable objects instead

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, fixed it

@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Mar 31, 2026

cc @yuanheng-zhao @wtomin

Copy link
Copy Markdown
Collaborator

@gcanlin gcanlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but it's better to make sure the original layerwise offload didn't happen performance regression, like Wan2.2.

@gcanlin gcanlin added the ready label to trigger buildkite CI label Mar 31, 2026
@yuanheng-zhao
Copy link
Copy Markdown
Contributor

LGTM.

HunyuanVideo15Transformer3DModel and HeliosTransformer3DModel are using _layerwise_offload_blocks_attr for now, could you also update and test them?

@RuixiangMa
Copy link
Copy Markdown
Contributor Author

LGTM, but it's better to make sure the original layerwise offload didn't happen performance regression, like Wan2.2.

Tested on Wan2.2 and Qwen-Image, results look good.

Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa
Copy link
Copy Markdown
Contributor Author

LGTM.

HunyuanVideo15Transformer3DModel and HeliosTransformer3DModel are using _layerwise_offload_blocks_attr for now, could you also update and test them?

Updated,due to insufficient VRAM, only HunyuanVideo 1.5 was verified.

Comment thread tests/diffusion/offloader/test_layerwise_backend.py Outdated
@wtomin
Copy link
Copy Markdown
Collaborator

wtomin commented Apr 1, 2026

Please edit the L4 tests under tests/e2e/online_serving/ for corresponding models, so that the test cases would cover layerwise cpu offloading.

BTW, I do remember in some document, it says layerwise cpu offloading is not compatible with multi-card parallelism. In #2021, you have fixed a bug for the compatibility of layerwise cpu offloading and hsdp. I am a little confused. Is layerwise cpu offloading compatible with most parallelism methods now?

name,
model.__class__.__name__,
)
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if some attr_name does not exist, it should throw an error instead of skipping it with an error log. I recommend a more strict param checking here, since the model's transformer models name wouldn't change very often.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed this to fail fast

@RuixiangMa
Copy link
Copy Markdown
Contributor Author

BTW, I do remember in some document, it says layerwise cpu offloading is not compatible with multi-card parallelism. In #2021, you have fixed a bug for the compatibility of layerwise cpu offloading and hsdp. I am a little confused. Is layerwise cpu offloading compatible with most parallelism methods now?

Based on my tests, layerwise offloading with TP or SP works fine.

RuixiangMa and others added 4 commits April 1, 2026 19:28
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: Lancer <402430575@qq.com>
Signed-off-by: Lancer <maruixiang6688@gmail.com>
# Conflicts:
#	tests/diffusion/offloader/test_layerwise_backend.py
#	tests/e2e/online_serving/test_zimage_expansion.py
Signed-off-by: Lancer <maruixiang6688@gmail.com>
@RuixiangMa
Copy link
Copy Markdown
Contributor Author

Please edit the L4 tests under tests/e2e/online_serving/ for corresponding models, so that the test cases would cover layerwise cpu offloading.

BTW, I do remember in some document, it says layerwise cpu offloading is not compatible with multi-card parallelism. In #2021, you have fixed a bug for the compatibility of layerwise cpu offloading and hsdp. I am a little confused. Is layerwise cpu offloading compatible with most parallelism methods now?

Added/updated layerwise + Ulysses/Ring, layerwise + TP, and layerwise + HSDP coverage for FLUX.2-klein and Z-Image

@RuixiangMa
Copy link
Copy Markdown
Contributor Author

tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[parallel_cachedit_fp8_ring2_tp2] PASSED [ 20%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[parallel_teacache_fp8_ulysses2_ring2] PASSED [ 40%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[layerwise_ulysses2_ring2] PASSED [ 60%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[layerwise_tp2] PASSED [ 80%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[layerwise_hsdp] PASSED [100%]

tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[omni_server0] PASSED [ 25%]
tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[layerwise_ulysses2_ring2] PASSED [ 50%]
tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[layerwise_tp2] PASSED [ 75%]
tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[layerwise_hsdp] PASSED [100%]

Signed-off-by: Lancer <maruixiang6688@gmail.com>
@yuanheng-zhao
Copy link
Copy Markdown
Contributor

Can this be merged if ready? cc @hsliuustc0106 @RuixiangMa

@lishunyang12 lishunyang12 merged commit 328de58 into vllm-project:main Apr 6, 2026
8 checks passed
skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026
…rameters/buffers staying on CPU (vllm-project#1486)

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Lancer <402430575@qq.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…rameters/buffers staying on CPU (vllm-project#1486)

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Lancer <402430575@qq.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
bob-021206 pushed a commit to jasonlee-1024/vllm-omni that referenced this pull request Apr 21, 2026
…rameters/buffers staying on CPU (vllm-project#1486)

Signed-off-by: Lancer <maruixiang6688@gmail.com>
Signed-off-by: Lancer <402430575@qq.com>
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
Signed-off-by: bob-021206 <binyan_github@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants