[Feat] support for multi-block layerwise offloading, fix top-level parameters/buffers staying on CPU#1486
Conversation
Signed-off-by: Lancer <maruixiang6688@gmail.com>
|
|
||
| def __init__(self): | ||
| self.blocks = nn.ModuleList([...]) # Transformer blocks | ||
| ``` |
There was a problem hiding this comment.
This PR adds multi-block layerwise offloading but provides no test coverage. Add tests to verify: (1) multi-block offloading works correctly with different block types, (2) memory usage is reduced as expected, (3) output quality is maintained, and (4) edge cases like empty or invalid block attributes are handled.
There was a problem hiding this comment.
I kept these out of e2e, as they're pure logic tests for block parsing (single, multi, empty, invalid, etc.). Just put them in a new file instead. pls take a look.
lishunyang12
left a comment
There was a problem hiding this comment.
Left a couple comments on the backend changes. The multi-block approach looks right for Flux-style models.
|
z-image is also supported in the pr to validate memory savings |
Signed-off-by: Lancer <maruixiang6688@gmail.com>
lishunyang12
left a comment
There was a problem hiding this comment.
Deprecation warning and validation look good. Two minor items still open — see inline threads.
Signed-off-by: Lancer <maruixiang6688@gmail.com>
Done |
|
Hello, any updates? There are some left reviews unresolved, especially supplements essentail test cases |
Sorry, forgot to click resolve. most changes are done, I'll add unit tests. |
Signed-off-by: Lancer <maruixiang6688@gmail.com>
|
@hsliuustc0106 @alex-jw-brooks @lishunyang12 @wtomin This PR has been open for a while. Could you take a look again? I need the features included. Thx! |
alex-jw-brooks
left a comment
There was a problem hiding this comment.
One thought, but looks good to me!
| model.__class__.__name__, | ||
| ) | ||
| continue | ||
| blocks.extend(attr) |
There was a problem hiding this comment.
This is dependent on the attribute value being iterable, which I think is fine for most cases, but might be a good idea to check and warn if it isn't and/or append non iterable objects instead
There was a problem hiding this comment.
good point, fixed it
gcanlin
left a comment
There was a problem hiding this comment.
LGTM, but it's better to make sure the original layerwise offload didn't happen performance regression, like Wan2.2.
|
LGTM.
|
Tested on Wan2.2 and Qwen-Image, results look good. |
Updated,due to insufficient VRAM, only HunyuanVideo 1.5 was verified. |
|
Please edit the L4 tests under BTW, I do remember in some document, it says layerwise cpu offloading is not compatible with multi-card parallelism. In #2021, you have fixed a bug for the compatibility of layerwise cpu offloading and hsdp. I am a little confused. Is layerwise cpu offloading compatible with most parallelism methods now? |
| name, | ||
| model.__class__.__name__, | ||
| ) | ||
| continue |
There was a problem hiding this comment.
I think if some attr_name does not exist, it should throw an error instead of skipping it with an error log. I recommend a more strict param checking here, since the model's transformer models name wouldn't change very often.
There was a problem hiding this comment.
ok, changed this to fail fast
Based on my tests, layerwise offloading with TP or SP works fine. |
Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: Lancer <402430575@qq.com>
# Conflicts: # tests/diffusion/offloader/test_layerwise_backend.py # tests/e2e/online_serving/test_zimage_expansion.py
Added/updated layerwise + Ulysses/Ring, layerwise + TP, and layerwise + HSDP coverage for FLUX.2-klein and Z-Image |
|
tests/e2e/online_serving/test_zimage_expansion.py::test_zimage[parallel_cachedit_fp8_ring2_tp2] PASSED [ 20%] tests/e2e/online_serving/test_flux2_expansion.py::test_flux2_klein[omni_server0] PASSED [ 25%] |
Signed-off-by: Lancer <maruixiang6688@gmail.com>
|
Can this be merged if ready? cc @hsliuustc0106 @RuixiangMa |
…rameters/buffers staying on CPU (vllm-project#1486) Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Lancer <402430575@qq.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
…rameters/buffers staying on CPU (vllm-project#1486) Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Lancer <402430575@qq.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com>
…rameters/buffers staying on CPU (vllm-project#1486) Signed-off-by: Lancer <maruixiang6688@gmail.com> Signed-off-by: Lancer <402430575@qq.com> Co-authored-by: Didan Deng <33117903+wtomin@users.noreply.github.com> Signed-off-by: bob-021206 <binyan_github@163.com>
Purpose
Some diffusion models (e.g., Flux, LongCat, Ovis) have two types of transformer blocks(e.g., transformer_blocks and single_transformer_blocks ), the previous implementation only supported single block type, limiting layerwise offloading effectiveness for these models.
Test Plan
Test Result
NVIDIA-4090(24G)
vllm serve --model /data/models/black-forest-labs/FLUX* --omni --enable_layerwise_offload --port 8004Offload VS no offload
Since FLUX.1-dev and FLUX.2-klein-9B et.al incur OOM without layer offloading, we use FLUX.2-klein-4B and Z-Image as a representative example to illustrate memory usage:
19.7GB13.8GB22.7GB15.5GB