[Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU by RuixiangMa · Pull Request #1486 · vllm-project/vllm-omni

RuixiangMa · 2026-02-25T17:55:14Z

Purpose

Some diffusion models (e.g., Flux, LongCat, Ovis) have two types of transformer blocks(e.g., transformer_blocks and single_transformer_blocks )， the previous implementation only supported single block type, limiting layerwise offloading effectiveness for these models.

Implement _layerwise_offload_blocks_attrs attribute to support models with multiple block types
Compatible with existing single-block models using _layerwise_offload_blocks_attr
Added support for Flux, Flux2-Klein and Z-Image(single block) models
Bug fix : Fixed top-level parameters/buffers staying on CPU during offloading

Test Plan

Test Result

NVIDIA-4090(24G)

vllm serve --model /data/models/black-forest-labs/FLUX* --omni --enable_layerwise_offload --port 8004

curl -X POST http://localhost:8004/v1/images/generations   -H "Content-Type: application/json"   -d '{
    "prompt": "a majestic dragon perched on the mountain ridge of Vermont, misty morning atmosphere, photorealistic style",
    "size": "1024x1024",
    "num_inference_steps": 50,
    "cfg_scale": 4.0,
    "guidance_scale": 4.0,
    "seed": 42
  }' | jq -r '.data[0].b64_json' | base64 -d > dragon.png

Model	FLUX.1-dev	FLUX.2-klein-4B	FLUX.2-klein-9B	Qwen-Image-2512
Image

Note: FLUX series adopts a multi-block, while Qwen-Image-2512 uses a single-block.

Offload VS no offload

Since FLUX.1-dev and FLUX.2-klein-9B et.al incur OOM without layer offloading, we use FLUX.2-klein-4B and Z-Image as a representative example to illustrate memory usage:

Model	No Offload		With Offload
	Image	VRAM	Image	VRAM
FLUX.2-klein-4B		`19.7GB`		`13.8GB`
Z-Image		`22.7GB`		`15.5GB`

Signed-off-by: Lancer <maruixiang6688@gmail.com>

hsliuustc0106 · 2026-02-27T00:20:36Z


    def __init__(self):
        self.blocks = nn.ModuleList([...])  # Transformer blocks
 ```


This PR adds multi-block layerwise offloading but provides no test coverage. Add tests to verify: (1) multi-block offloading works correctly with different block types, (2) memory usage is reduced as expected, (3) output quality is maintained, and (4) edge cases like empty or invalid block attributes are handled.

I kept these out of e2e, as they're pure logic tests for block parsing (single, multi, empty, invalid, etc.). Just put them in a new file instead. pls take a look.

lishunyang12

Left a couple comments on the backend changes. The multi-block approach looks right for Flux-style models.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa · 2026-02-28T08:25:58Z

z-image is also supported in the pr to validate memory savings

Signed-off-by: Lancer <maruixiang6688@gmail.com>

lishunyang12

Deprecation warning and validation look good. Two minor items still open — see inline threads.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa · 2026-03-05T12:06:33Z

Deprecation warning and validation look good. Two minor items still open — see inline threads.

Done

Gaohan123 · 2026-03-14T15:51:57Z

Hello, any updates? There are some left reviews unresolved, especially supplements essentail test cases

RuixiangMa · 2026-03-15T01:29:34Z

Hello, any updates? There are some left reviews unresolved, especially supplements essentail test cases

Sorry, forgot to click resolve. most changes are done, I'll add unit tests.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa · 2026-03-27T09:02:34Z

@hsliuustc0106 @alex-jw-brooks @lishunyang12 @wtomin This PR has been open for a while. Could you take a look again? I need the features included. Thx!

alex-jw-brooks

One thought, but looks good to me!

alex-jw-brooks · 2026-03-28T03:18:50Z

+                    model.__class__.__name__,
+                )
+                continue
+            blocks.extend(attr)


This is dependent on the attribute value being iterable, which I think is fine for most cases, but might be a good idea to check and warn if it isn't and/or append non iterable objects instead

good point, fixed it

gcanlin · 2026-03-31T00:59:15Z

cc @yuanheng-zhao @wtomin

gcanlin

LGTM, but it's better to make sure the original layerwise offload didn't happen performance regression, like Wan2.2.

yuanheng-zhao · 2026-03-31T01:28:52Z

LGTM.

HunyuanVideo15Transformer3DModel and HeliosTransformer3DModel are using _layerwise_offload_blocks_attr for now, could you also update and test them?

RuixiangMa · 2026-03-31T03:16:15Z

LGTM, but it's better to make sure the original layerwise offload didn't happen performance regression, like Wan2.2.

Tested on Wan2.2 and Qwen-Image, results look good.

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa · 2026-03-31T04:36:45Z

LGTM.

HunyuanVideo15Transformer3DModel and HeliosTransformer3DModel are using _layerwise_offload_blocks_attr for now, could you also update and test them?

Updated，due to insufficient VRAM, only HunyuanVideo 1.5 was verified.

wtomin · 2026-04-01T05:49:24Z

Please edit the L4 tests under tests/e2e/online_serving/ for corresponding models, so that the test cases would cover layerwise cpu offloading.

BTW, I do remember in some document, it says layerwise cpu offloading is not compatible with multi-card parallelism. In #2021, you have fixed a bug for the compatibility of layerwise cpu offloading and hsdp. I am a little confused. Is layerwise cpu offloading compatible with most parallelism methods now?

wtomin · 2026-04-01T05:56:06Z

+                    name,
+                    model.__class__.__name__,
+                )
+                continue


I think if some attr_name does not exist, it should throw an error instead of skipping it with an error log. I recommend a more strict param checking here, since the model's transformer models name wouldn't change very often.

ok, changed this to fail fast

RuixiangMa · 2026-04-01T06:16:29Z

BTW, I do remember in some document, it says layerwise cpu offloading is not compatible with multi-card parallelism. In #2021, you have fixed a bug for the compatibility of layerwise cpu offloading and hsdp. I am a little confused. Is layerwise cpu offloading compatible with most parallelism methods now?

Based on my tests, layerwise offloading with TP or SP works fine.

Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Lancer <402430575@qq.com>

Signed-off-by: Lancer <maruixiang6688@gmail.com>

# Conflicts: # tests/diffusion/offloader/test_layerwise_backend.py # tests/e2e/online_serving/test_zimage_expansion.py

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa · 2026-04-01T15:21:45Z

Please edit the L4 tests under tests/e2e/online_serving/ for corresponding models, so that the test cases would cover layerwise cpu offloading.

BTW, I do remember in some document, it says layerwise cpu offloading is not compatible with multi-card parallelism. In #2021, you have fixed a bug for the compatibility of layerwise cpu offloading and hsdp. I am a little confused. Is layerwise cpu offloading compatible with most parallelism methods now?

Added/updated layerwise + Ulysses/Ring, layerwise + TP, and layerwise + HSDP coverage for FLUX.2-klein and Z-Image

RuixiangMa · 2026-04-01T15:22:10Z

tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[parallel_cachedit_fp8_ring2_tp2] PASSED [ 20%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[parallel_teacache_fp8_ulysses2_ring2] PASSED [ 40%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[layerwise_ulysses2_ring2] PASSED [ 60%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[layerwise_tp2] PASSED [ 80%]
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[layerwise_hsdp] PASSED [100%]

tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[omni_server0] PASSED [ 25%]
tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[layerwise_ulysses2_ring2] PASSED [ 50%]
tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[layerwise_tp2] PASSED [ 75%]
tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[layerwise_hsdp] PASSED [100%]

Signed-off-by: Lancer <maruixiang6688@gmail.com>

yuanheng-zhao · 2026-04-06T03:48:48Z

Can this be merged if ready? cc @hsliuustc0106 @RuixiangMa

…rameters/buffers staying on CPU (vllm-project#1486) Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Lancer <402430575@qq.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>

…rameters/buffers staying on CPU (vllm-project#1486) Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Lancer <402430575@qq.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: bob-021206 <binyan_github@163.com>

[Feat] support for multi-block layerwise offloading

161d948

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa requested a review from hsliuustc0106 as a code owner February 25, 2026 17:55

alex-jw-brooks reviewed Feb 25, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py Outdated

upd

d6324f5

Signed-off-by: Lancer <maruixiang6688@gmail.com>

hsliuustc0106 reviewed Feb 27, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py

hsliuustc0106 reviewed Feb 27, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py

lishunyang12 reviewed Feb 28, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py

lishunyang12 reviewed Feb 28, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py

lishunyang12 reviewed Feb 28, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/offloader/layerwise_backend.py

RuixiangMa added 2 commits February 28, 2026 15:57

upd

f1da0c7

Signed-off-by: Lancer <maruixiang6688@gmail.com>

upd

faa6839

Signed-off-by: Lancer <maruixiang6688@gmail.com>

Merge branch 'main' into multiblockoffload

6ecc0d6

Signed-off-by: Lancer <maruixiang6688@gmail.com>

lishunyang12 reviewed Mar 4, 2026

View reviewed changes

upd

9429c92

Signed-off-by: Lancer <maruixiang6688@gmail.com>

RuixiangMa changed the title ~~[Feat] support for multi-block layerwise offloading~~ [Feat] support for multi-block layerwise offloading and fix top-level parameters/buffers staying on CPU Mar 5, 2026

RuixiangMa changed the title ~~[Feat] support for multi-block layerwise offloading and fix top-level parameters/buffers staying on CPU~~ [Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU Mar 5, 2026

Merge branch 'main' into multiblockoffload

69bbed2

Signed-off-by: Lancer <maruixiang6688@gmail.com>

Gaohan123 added this to the v0.18.0 milestone Mar 14, 2026

RuixiangMa added 3 commits March 15, 2026 10:10

upd

cfacd00

Signed-off-by: Lancer <maruixiang6688@gmail.com>

Merge branch 'main' into multiblockoffload

c5f96de

Signed-off-by: Lancer <maruixiang6688@gmail.com>

Merge branch 'main' into multiblockoffload

47ad4a2

alex-jw-brooks approved these changes Mar 28, 2026

View reviewed changes

RuixiangMa mentioned this pull request Mar 28, 2026

[RFC]: Continuous Diffusion Model Acceleration Support #1217

Open

1 task

gcanlin approved these changes Mar 31, 2026

View reviewed changes

gcanlin added the ready label to trigger buildkite CI label Mar 31, 2026

upd

f8ad032

Signed-off-by: Lancer <maruixiang6688@gmail.com>

wtomin reviewed Apr 1, 2026

View reviewed changes

Comment thread tests/diffusion/offloader/test_layerwise_backend.py Outdated

wtomin reviewed Apr 1, 2026

View reviewed changes

RuixiangMa and others added 4 commits April 1, 2026 19:28

Update tests/diffusion/offloader/test_layerwise_backend.py

0fd342d

Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Lancer <402430575@qq.com>

upd

caa05f9

Signed-off-by: Lancer <maruixiang6688@gmail.com>

Merge branch 'main' into multiblockoffload

83dd707

# Conflicts: # tests/diffusion/offloader/test_layerwise_backend.py # tests/e2e/online_serving/test_zimage_expansion.py

upd

a24eaf7

Signed-off-by: Lancer <maruixiang6688@gmail.com>

Merge branch 'main' into multiblockoffload

b9db2c1

Signed-off-by: Lancer <maruixiang6688@gmail.com>

yuanheng-zhao approved these changes Apr 6, 2026

View reviewed changes

lishunyang12 merged commit 328de58 into vllm-project:main Apr 6, 2026
8 checks passed

BBuf mentioned this pull request Apr 20, 2026

SGLang Diffusion 外部影响力调研：kernel、feature 与平台采用情况 BBuf/how-to-optim-algorithm-in-cuda#14

Open

Conversation

RuixiangMa commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

Uh oh!

hsliuustc0106 Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

RuixiangMa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RuixiangMa commented Feb 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

RuixiangMa commented Mar 5, 2026

Uh oh!

Gaohan123 commented Mar 14, 2026

Uh oh!

RuixiangMa commented Mar 15, 2026

Uh oh!

RuixiangMa commented Mar 27, 2026

Uh oh!

alex-jw-brooks left a comment

Choose a reason for hiding this comment

Uh oh!

alex-jw-brooks Mar 28, 2026

Choose a reason for hiding this comment

Uh oh!

RuixiangMa Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Mar 31, 2026

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

yuanheng-zhao commented Mar 31, 2026

Uh oh!

RuixiangMa commented Mar 31, 2026

Uh oh!

RuixiangMa commented Mar 31, 2026

Uh oh!

Uh oh!

wtomin commented Apr 1, 2026

Uh oh!

wtomin Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

RuixiangMa Apr 1, 2026

Choose a reason for hiding this comment

Uh oh!

RuixiangMa commented Apr 1, 2026

Uh oh!

RuixiangMa commented Apr 1, 2026

Uh oh!

RuixiangMa commented Apr 1, 2026

Uh oh!

yuanheng-zhao commented Apr 6, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

RuixiangMa commented Feb 25, 2026 •

edited

Loading

RuixiangMa commented Feb 28, 2026 •

edited

Loading