[Feat][sleepmode] add omni sleepmode and ack protocol#2022
Conversation
|
Hi @hsliuustc0106 @princepride @Gaohan123 , this is sleep mode ack new PR base on latest code version. Thanks. I will resolve the merge conflicts. |
|
About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9b8f612b67
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
@gcanlin @xuechendi can have a try? |
|
it seems there a lot of conflicts needed to be resolved first |
Will try. Let's rebase it first. |
|
Okay, I'll address them one by one. It seems there's been more refactoring code merged in the last couple of days, so I need to further adjust my logic. |
0fd1f29 to
5e77863
Compare
|
Hi @hsliuustc0106 @gcanlin @xuechendi, |
Thanks, @gcanlin , XPU is waiting for PT2.11 features for sleep/wakeup. Will catch up with this feature later |
gcanlin
left a comment
There was a problem hiding this comment.
Thanks for contributing! Please consider these suggestions to clean code first.
lishunyang12
left a comment
There was a problem hiding this comment.
A few concerns:
-
all_reduceafter sleep may crash (diffusion_worker.py,handle_sleep_task) — afterself.sleep()offloads all weights and callsempty_cache, the code allocates a new GPU tensor forall_reduce. If Level 2 sleep discarded CUDA memory pools, this allocation could fail. Consider doing the reduction before sleep, or using CPU tensors. -
Sleep fallback fires wake events (
diffusion_engine.py,sleep()) — the fallback setswake_events, butworker_busy_loopinterprets that as{"type": "wake_up"}. So the sleep fallback does the opposite of what was intended. -
Dead code in executor shutdown (
multiproc_executor.py) — iterates overwake_eventsbut thetrybody ispass. Wasev.set()intended?
For XPU platform, sleep mode is not fully ready at vLLM side. We have dependency on torch 2.11 APIs and draft PR is ready at 37149. So please go ahead first and we will cover XPU late. |
lishunyang12
left a comment
There was a problem hiding this comment.
Left a few comments. The dead code in shutdown and the getattr default for sleep mode need fixing.
|
Hi @gcanlin , I have updated some code by review suggestion, ready for NPU resource tests. |
Thanks! I will test it today. |
|
@gcanlin We need accelerate the progress, can we merge this feature before v0.18.0? |
I think we can. Will be done tonight on my side. |
|
@Flink-ddd @princepride Hey, could you please give me an example? I'm not familiar with this feature. Is |
|
Hi @gcanlin , yes, that command is correct: |
There was a problem hiding this comment.
I notice that it needs to take some effort to adapt NPU so I will unblock this PR. For NPU I will submit a follow-up PR later. @princepride Please check the code details :)
04c27a6 to
1b278bc
Compare
Signed-off-by: vensen <vensenmu@gmail.com>
|
@hsliuustc0106 @Gaohan123 , I have updated code by review advice, all CI check are passed, Do you have any suggestions for the next step? |
solved
Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: vensen <vensenmu@gmail.com>
Purpose
This PR implements Omni Sleep Mode (Tiered Memory Orchestration) for both NVIDIA and AMD, XPU, NPU etc platforms, as proposed in RFC #1316.
Key enhancements include:
Tiered Offloading Logic: Support for Level 1 (Weight offloading) and Level 2 (Full de-mapping) sleep stages.
Hardware Abstraction Layer: Unified VRAM auditing and reclamation logic across CUDA and ROCm.
Deterministic Orchestration: Ensuring physical memory release before co-located task execution to prevent OOM.
Test Plan
Six unit test classes were run on both AMD and NVIDIA systems. The test class is: tests/entrypoints/test_omni_sleep_mode.py. The test scenarios combine ACK signals, LLM and generation, and the sleep and wake-up states of Diffusion.
Especially the following points:
Unit Test 4: Inference consistency and bit-level precision verification after Diffusion wake-up
Unit Test 6: full-cycle audit of Diffusion memory lifecycle.
The unit test 6 was conducted on different platforms to verify the accuracy and usability of the Diffusion model throughout its entire lifecycle, from Active to sleep to wakeup.
Test Result
NVIDIA A6000 TP = 2 (Core pytest test output)
AMD MI300X TP = 2 (Core pytest test output)
[NVIDIA A6000] Coordinated Cross-Device VRAM Audit
Validating coordinated VRAM auditing for heterogeneous engines (LLM-Talker and Diffusion) on NVIDIA A6000, demonstrating seamless parallel weight offloading across inter-process components.
[NVIDIA A6000] Diffusion VRAM Lifecycle Audit
Full lifecycle audit of a single Diffusion engine on NVIDIA A6000, confirming efficient physical VRAM reclamation in Level 2 Sleep Mode and successful partial weight recovery.
[AMD MI300X] Coordinated Cross-Device VRAM Audit
Demonstrating multi-vendor compatibility of the coordinated sleep mechanism on AMD MI300X (ROCm), showing deterministic VRAM scavenging and state synchronization between heterogeneous engines in a TP environment.
[AMD MI300X] Diffusion VRAM Lifecycle Audit
Auditing dynamic VRAM evolution on AMD MI300X to verify that the Deep Sleep mechanism maintains high-precision physical resource reclamation even within large-capacity memory architectures.
Note: The stable, non-zero VRAM floor observed (approx. 2.007 GiB on MI300X / 1.2 GiB on A6000) represents the mandatory driver runtime footprint and persistent metadata required to ensure deterministic, near-instantaneous recovery after deep sleep.