Skip to content

[Feat][sleepmode] add omni sleepmode and ack protocol#2022

Merged
Gaohan123 merged 7 commits into
vllm-project:mainfrom
Flink-ddd:feat/omni-sleepmode-v1
Apr 21, 2026
Merged

[Feat][sleepmode] add omni sleepmode and ack protocol#2022
Gaohan123 merged 7 commits into
vllm-project:mainfrom
Flink-ddd:feat/omni-sleepmode-v1

Conversation

@Flink-ddd

@Flink-ddd Flink-ddd commented Mar 19, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR implements Omni Sleep Mode (Tiered Memory Orchestration) for both NVIDIA and AMD, XPU, NPU etc platforms, as proposed in RFC #1316.

Key enhancements include:

Tiered Offloading Logic: Support for Level 1 (Weight offloading) and Level 2 (Full de-mapping) sleep stages.

Hardware Abstraction Layer: Unified VRAM auditing and reclamation logic across CUDA and ROCm.

Deterministic Orchestration: Ensuring physical memory release before co-located task execution to prevent OOM.

Test Plan

Six unit test classes were run on both AMD and NVIDIA systems. The test class is: tests/entrypoints/test_omni_sleep_mode.py. The test scenarios combine ACK signals, LLM and generation, and the sleep and wake-up states of Diffusion.

Especially the following points:
Unit Test 4: Inference consistency and bit-level precision verification after Diffusion wake-up
Unit Test 6: full-cycle audit of Diffusion memory lifecycle.

The unit test 6 was conducted on different platforms to verify the accuracy and usability of the Diffusion model throughout its entire lifecycle, from Active to sleep to wakeup.

  1. e2e test

Test Result

NVIDIA A6000 TP = 2 (Core pytest test output)

tests/entrypoints/test_omni_sleep_mode.py::TestOmniSleepMode::test_diffusion_vram_lifecycle_audit 
=== PRE-TEST GPU CLEANUP ===
GPU cleanup disabled
INFO 02-24 09:58:39 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
--- Running test: test_diffusion_vram_lifecycle_audit
INFO 02-24 09:58:39 [omni.py:117] Initializing stages for model: ByteDance-Seed/BAGEL-7B-MoT
..................................................
[Stage-0] INFO 02-24 09:59:08 [async_omni_diffusion.py:113] AsyncOmniDiffusion initialized with model: black-forest-labs/FLUX.2-klein-4B
INFO 02-24 09:59:08 [omni.py:334] [AsyncOrchestrator] Stage-0 reported ready
INFO 02-24 09:59:08 [omni.py:360] [AsyncOrchestrator] All stages initialized successfully
WARNING 02-24 09:59:08 [async_omni.py:274] [AsyncOrchestrator] No LLM stage found, processors will not be available. This may cause issues with OpenAIServingModels.
INFO 02-24 09:59:08 [omni_stage.py:566] [Stage-0] Status transitioned to: TRANSITIONING
INFO 02-24 09:59:08 [omni_stage.py:570] [Stage-0] Submitting SLEEP task (Level: 2)
INFO 02-24 09:59:08 [async_omni.py:894] [AsyncOrchestrator] Sleep initiated. Awaiting confirmation from 1 workers...
[Stage-0] INFO 02-24 09:59:08 [async_omni_diffusion.py:309] [Entrypoint] Relaying Sleep Task: 05eda0a7-27f5-4843-ad1a-24291b0dc7d4 (Level: 2)
[Stage-0] INFO 02-24 09:59:08 [diffusion_engine.py:401] [Diffusion Engine Relay] Dispatching Sleep Task 05eda0a7-27f5-4843-ad1a-24291b0dc7d4 (Level: 2)
[Stage-0] INFO 02-24 09:59:08 [diffusion_worker.py:221] [Diffusion Worker 0] Handshake Received: Task 05eda0a7-27f5-4843-ad1a-24291b0dc7d4, Level 2
[Stage-0] INFO 02-24 09:59:08 [cumem.py:213] CuMemAllocator: sleep freed 14.89 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 14.89 GiB is discarded directly.
[Stage-0] INFO 02-24 09:59:09 [diffusion_worker.py:198] [Diffusion Worker 0] Level 2 Sleep: Freed 18.40 GiB. 0.62GiB memory is still in use.
[Stage-0] INFO 02-24 09:59:09 [diffusion_worker.py:239] [Diffusion Worker 0]: ACK emitted. Freed 18.40 GiB.
[Stage-0] INFO 02-24 09:59:09 [omni_stage.py:1323] [Stage-0] Sleep ACKs forwarded to Orchestrator
INFO 02-24 09:59:09 [async_omni.py:641] [AsyncOrchestrator] Intercepted wrapped ACK for task 05eda0a7-27f5-4843-ad1a-24291b0dc7d4 from stage-0
INFO 02-24 09:59:09 [omni_stage.py:566] [Stage-0] Status transitioned to: SLEEPING
INFO 02-24 09:59:10 [omni_stage.py:579] [Stage-0] Submitting WAKE_UP task
INFO 02-24 09:59:10 [async_omni.py:921] [AsyncOrchestrator] Wake-up initiated. Awaiting confirmation from 1 workers...
[Stage-0] INFO 02-24 09:59:10 [async_omni_diffusion.py:318] [Entrypoint] Relaying WakeUp Task: c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43
[Stage-0] INFO 02-24 09:59:10 [diffusion_engine.py:424] [Diffusion Engine Relay] Dispatching Wake-up Task c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43 to workers...
[Stage-0] INFO 02-24 09:59:10 [diffusion_worker.py:250] [Diffusion Worker 0] Responding to Wake-up Task: c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43
[Stage-0] INFO 02-24 09:59:10 [diffusion_worker.py:210] [Diffusion Worker 0] Wake-up complete.
[Stage-0] INFO 02-24 09:59:10 [diffusion_worker.py:259] [Diffusion Worker 0] Wake-up confirmed.
INFO 02-24 09:59:10 [async_omni.py:641] [AsyncOrchestrator] Intercepted wrapped ACK for task c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43 from stage-0
INFO 02-24 09:59:10 [omni_stage.py:566] [Stage-0] Status transitioned to: RUNNING
INFO 02-24 09:59:11 [async_omni.py:378] [AsyncOrchestrator] Entering scheduling loop: stages=1, final_stage=0
[Stage-0] INFO 02-24 09:59:11 [manager.py:538] Deactivating all adapters: 0 layers
[Stage-0] WARNING 02-24 09:59:11 [kv_transfer_manager.py:356] No connector available for receiving KV cache
[Stage-0] INFO 02-24 09:59:11 [diffusion_engine.py:92] Generation completed successfully.
[Stage-0] INFO 02-24 09:59:11 [diffusion_engine.py:110] Post-processing completed in 0.0583 seconds
PASSED[Stage-0] INFO 02-24 09:59:26 [diffusion_worker.py:400] Worker 0: Received shutdown message
[Stage-0] INFO 02-24 09:59:26 [diffusion_worker.py:421] event loop terminated.
[Stage-0] INFO 02-24 09:59:26 [diffusion_worker.py:452] Worker 0: Shutdown complete.
[Stage-0] INFO 02-24 09:59:29 [async_omni_diffusion.py:213] AsyncOmniDiffusion closed
[Stage-0] INFO 02-24 09:59:29 [omni_stage.py:1428] Stage worker exiting
GPU cleanup disabled


=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 6 passed, 2 warnings in 309.81s (0:05:09) ===================

AMD MI300X TP = 2 (Core pytest test output)

WARNING 03-19 18:07:36 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[aiter] import [module_aiter_enum] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_aiter_enum.so
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3.12
cachedir: .pytest_cache
rootdir: /app/vllm-omni
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.12.1
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 6 items
................................................................................................................................................................
tests/entrypoints/test_omni_sleep_mode.py::TestOmniSleepMode::test_cross_device_cleanup 
=== PRE-TEST GPU CLEANUP ===
GPU cleanup disabled
INFO 03-19 18:09:27 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
--- Running test: test_cross_device_cleanup
INFO 03-19 18:09:27 [weight_utils.py:50] Using model weights format ['*']
INFO 03-19 18:09:27 [omni.py:195] Initializing stages for model: ByteDance-Seed/BAGEL-7B-MoT
INFO 03-19 18:09:27 [omni.py:329] No omni_master_address provided, defaulting to localhost (127.0.0.1)
INFO 03-19 18:09:27 [omni.py:269] [AsyncOrchestrator] Using stages provided directly via arguments
INFO 03-19 18:09:27 [initialization.py:35] No OmniTransferConfig provided
INFO 03-19 18:09:27 [omni.py:363] [AsyncOrchestrator] Loaded 1 stages
INFO 03-19 18:09:27 [multiproc_executor.py:88] Starting server...
WARNING 03-19 18:09:34 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
WARNING 03-19 18:09:34 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[aiter] import [module_aiter_enum] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_aiter_enum.so
INFO 03-19 18:09:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-19 18:09:39 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 03-19 18:09:39 [diffusion_worker.py:424] Worker 0 created result MessageQueue
INFO 03-19 18:09:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-19 18:09:39 [vllm.py:754] Asynchronous scheduling is enabled.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 03-19 18:09:39 [diffusion_worker.py:122] Worker 0: Initialized device and distributed environment.
INFO 03-19 18:09:39 [diffusion_worker.py:122] Worker 1: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-19 18:09:39 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 03-19 18:09:39 [parallel_state.py:630] SP group details for rank 1: sp_group=[1], ulysses_group=[1], ring_group=[1]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-19 18:09:39 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 03-19 18:09:39 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-19 18:09:40 [weight_utils.py:50] Using model weights format ['*']
INFO 03-19 18:09:40 [weight_utils.py:50] Using model weights format ['*']
INFO 03-19 18:09:42 [weight_utils.py:618] No diffusion_pytorch_model.safetensors.index.json found in remote.

Multi-thread loading shards:   0% Completed | 0/2 [00:00<?, ?it/s]
INFO 03-19 18:09:42 [weight_utils.py:618] No diffusion_pytorch_model.safetensors.index.json found in remote.

Multi-thread loading shards:  50% Completed | 1/2 [00:01<00:01,  1.10s/it]

Multi-thread loading shards: 100% Completed | 2/2 [00:08<00:00,  4.66s/it]

Multi-thread loading shards: 100% Completed | 2/2 [00:08<00:00,  4.13s/it]

INFO 03-19 18:09:50 [pipeline_bagel.py:743] BagelPipeline weight filter kept 1466/1467 tensors (shape mismatches seen: 0)
INFO 03-19 18:09:50 [pipeline_bagel.py:743] BagelPipeline weight filter kept 1466/1467 tensors (shape mismatches seen: 0)
INFO 03-19 18:09:51 [diffusers_loader.py:321] Loading weights took 9.68 seconds
INFO 03-19 18:09:52 [diffusers_loader.py:321] Loading weights took 10.06 seconds
INFO 03-19 18:09:52 [diffusion_model_runner.py:134] Model loading took 14.2129 GiB and 12.788842 seconds
INFO 03-19 18:09:52 [diffusion_model_runner.py:139] Model runner: Model loaded successfully.
INFO 03-19 18:09:52 [diffusion_model_runner.py:173] Model runner: Initialization complete.
WARNING 03-19 18:09:52 [gpu_memory_utils.py:88] NVML init failed, will use profiling fallback: NVML Shared Library Not Found
INFO 03-19 18:09:52 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 03-19 18:09:52 [diffusion_worker.py:91] Worker 0: Initialization complete.
INFO 03-19 18:09:52 [diffusion_worker.py:589] Worker 0: Scheduler loop started.
INFO 03-19 18:09:52 [diffusion_worker.py:495] Worker 0 ready to receive requests via shared memory
INFO 03-19 18:09:53 [diffusion_model_runner.py:134] Model loading took 14.2129 GiB and 13.289822 seconds
INFO 03-19 18:09:53 [diffusion_model_runner.py:139] Model runner: Model loaded successfully.
INFO 03-19 18:09:53 [diffusion_model_runner.py:173] Model runner: Initialization complete.
WARNING 03-19 18:09:53 [gpu_memory_utils.py:88] NVML init failed, will use profiling fallback: NVML Shared Library Not Found
INFO 03-19 18:09:53 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:1, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 03-19 18:09:53 [diffusion_worker.py:91] Worker 1: Initialization complete.
INFO 03-19 18:09:53 [diffusion_worker.py:589] Worker 1: Scheduler loop started.
INFO 03-19 18:09:53 [diffusion_worker.py:495] Worker 1 ready to receive requests via shared memory
INFO 03-19 18:09:53 [scheduler.py:42] SyncScheduler initialized result MessageQueue
INFO 03-19 18:09:53 [diffusion_engine.py:415] dummy run to warm up the model
INFO 03-19 18:09:53 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-19 18:09:53 [kv_transfer_manager.py:479] Request has no ID, cannot receive KV cache
INFO 03-19 18:09:53 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-19 18:09:53 [kv_transfer_manager.py:479] Request has no ID, cannot receive KV cache
[aiter] import [module_fmha_v3_varlen_fwd] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_fmha_v3_varlen_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_varlen_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int, min_seqlen_q: int, dropout_p: float, softmax_scale: float, logits_soft_cap: float, zero_tensors: bool, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, block_table: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None, cu_seqlens_q_padded: Optional[torch.Tensor] = None, cu_seqlens_k_padded: Optional[torch.Tensor] = None) -> List[torch.Tensor]
[aiter] hipModuleLoad: /usr/local/lib/python3.12/dist-packages/aiter_meta/hsa//gfx942/fmha_v3_fwd/MI300/fwd_hd128_bf16_causal_rtna_group.co GetFunction: _ZN5aiter37fmha_fwd_hd128_bf16_causal_rtna_groupE Success
[aiter] import [module_fmha_v3_varlen_fwd] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_fmha_v3_varlen_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_varlen_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int, min_seqlen_q: int, dropout_p: float, softmax_scale: float, logits_soft_cap: float, zero_tensors: bool, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, block_table: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None, cu_seqlens_q_padded: Optional[torch.Tensor] = None, cu_seqlens_k_padded: Optional[torch.Tensor] = None) -> List[torch.Tensor]
[aiter] hipModuleLoad: /usr/local/lib/python3.12/dist-packages/aiter_meta/hsa//gfx942/fmha_v3_fwd/MI300/fwd_hd128_bf16_causal_rtna_group.co GetFunction: _ZN5aiter37fmha_fwd_hd128_bf16_causal_rtna_groupE Success
INFO 03-19 18:10:04 [omni.py:444] [AsyncOrchestrator] Inline diffusion mode active – stage worker subprocess bypassed
INFO 03-19 18:10:04 [async_omni.py:229] [AsyncOrchestrator] Pro-active link: Inline Engine -> Stage-0
INFO 03-19 18:10:04 [omni_stage.py:700] [Stage-0] Status transitioned to: TRANSITIONING
INFO 03-19 18:10:04 [omni_stage.py:704] [Stage-0] Submitting SLEEP task (Level: 1)
INFO 03-19 18:10:04 [diffusion_engine.py:459] [Diffusion Engine] Attempting sleep 8b99f389-d23b-4514-8711-9de42f734922 (Level: 1)...
INFO 03-19 18:10:04 [diffusion_engine.py:463] [Diffusion Engine] Physical Sleep Command 8b99f389-d23b-4514-8711-9de42f734922 dispatched via MQ.
INFO 03-19 18:10:04 [async_omni.py:1187] [AsyncOrchestrator] Sleep initiated (Task: 8b99f389-d23b-4514-8711-9de42f734922). Awaiting 1 ACKs...
INFO 03-19 18:10:04 [async_omni.py:932] Orchestrator is polling Stage-0
INFO 03-19 18:10:04 [diffusion_worker.py:268] [Worker 0] Handshake Received: Task 8b99f389-d23b-4514-8711-9de42f734922
INFO 03-19 18:10:04 [diffusion_worker.py:268] [Worker 1] Handshake Received: Task 8b99f389-d23b-4514-8711-9de42f734922
INFO 03-19 18:10:07 [cumem.py:216] CuMemAllocator: sleep freed 14.03 GiB memory in total, of which 14.03 GiB is backed up in CPU and the rest 0.00 GiB is discarded directly.
INFO 03-19 18:10:08 [diffusion_worker.py:241] [Worker 1] Sleep Level 1 scavenged 14.03 GiB from GPU.
INFO 03-19 18:10:08 [diffusion_worker.py:275] [Worker 1] Preparing ACK: freed_bytes=15.04 GiB.
INFO 03-19 18:10:08 [cumem.py:216] CuMemAllocator: sleep freed 14.03 GiB memory in total, of which 14.03 GiB is backed up in CPU and the rest 0.00 GiB is discarded directly.
INFO 03-19 18:10:08 [diffusion_worker.py:241] [Worker 0] Sleep Level 1 scavenged 14.03 GiB from GPU.
INFO 03-19 18:10:08 [diffusion_worker.py:275] [Worker 0] Preparing ACK: freed_bytes=15.04 GiB.
INFO 03-19 18:10:08 [diffusion_worker.py:298] [Worker 0] ACK emitted. Freed 30.07 GiB.
INFO 03-19 18:10:09 [async_omni.py:954] [AsyncOrchestrator] Intercepted wrapped ACK for task 8b99f389-d23b-4514-8711-9de42f734922 from stage-0
INFO 03-19 18:10:09 [async_omni.py:156] [Resolver] Task 8b99f389-d23b-4514-8711-9de42f734922 progress: 1/1 ACKs received.
INFO 03-19 18:10:09 [async_omni.py:163] [Resolver] Task 8b99f389-d23b-4514-8711-9de42f734922 completed successfully in 4.44s.
INFO 03-19 18:10:09 [omni_stage.py:700] [Stage-0] Status transitioned to: SLEEPING
PASSEDGPU cleanup disabled
..............................................................................................................................
------------------------------ Captured log call -------------------------------
ERROR    OmniTest:test_omni_sleep_mode.py:254 Coordinated test failed:
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/entrypoints/test_omni_sleep_mode.py::TestOmniSleepMode::test_coordinated_cross_device
============= 6 passed, 2 warnings in 426.55s (0:07:06) ==============

[NVIDIA A6000] Coordinated Cross-Device VRAM Audit

Validating coordinated VRAM auditing for heterogeneous engines (LLM-Talker and Diffusion) on NVIDIA A6000, demonstrating seamless parallel weight offloading across inter-process components.

Screenshot 2026-03-20 at 01 08 52

[NVIDIA A6000] Diffusion VRAM Lifecycle Audit

Full lifecycle audit of a single Diffusion engine on NVIDIA A6000, confirming efficient physical VRAM reclamation in Level 2 Sleep Mode and successful partial weight recovery.

Screenshot 2026-03-20 at 01 09 12

[AMD MI300X] Coordinated Cross-Device VRAM Audit

Demonstrating multi-vendor compatibility of the coordinated sleep mechanism on AMD MI300X (ROCm), showing deterministic VRAM scavenging and state synchronization between heterogeneous engines in a TP environment.

Screenshot 2026-03-20 at 01 08 26

[AMD MI300X] Diffusion VRAM Lifecycle Audit

Auditing dynamic VRAM evolution on AMD MI300X to verify that the Deep Sleep mechanism maintains high-precision physical resource reclamation even within large-capacity memory architectures.

Screenshot 2026-03-20 at 01 08 40

Note: The stable, non-zero VRAM floor observed (approx. 2.007 GiB on MI300X / 1.2 GiB on A6000) represents the mandatory driver runtime footprint and persistent metadata required to ensure deterministic, near-instantaneous recovery after deep sleep.

  1. Introduces Omni Sleep Mode to enable deterministic physical VRAM reclamation and restoration across standalone Diffusion and multi-stage models, with full support for tensor_parallel_size=1,2
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /app/vllm-omni
configfile: pyproject.toml
plugins: mock-3.15.1, asyncio-1.3.0, anyio-4.12.1
asyncio: mode=Mode.AUTO
collected 4 items

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_diffusion_model_sleep_tp[1] 
--- Running test: test_diffusion_model_sleep_tp[1]
[TP=1] Triggering Level 2 Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 80.21 GiB.
[TP=1] VRAM Reserved: 29.97 GiB is still in use.
[TP=1] Waking up...
Diffusion TP=1 Lifecycle OK
PASSED

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_diffusion_model_sleep_tp[2] 
--- Running test: test_diffusion_model_sleep_tp[2]
[TP=2] Triggering Level 2 Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 80.21 GiB.
[TP=2] VRAM Reserved: 29.98 GiB is still in use.
[TP=2] Waking up...
Diffusion TP=2 Lifecycle OK
PASSED

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_omni_model_sleep_tp[1] 
--- Running test: test_omni_model_sleep_tp[1]
Testing Stage 0 (LLM) Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 79.89 GiB.
Testing Stage 1 (Diffusion) Sleep...
INFO: [Diffusion Worker 0] Sleep Level 2: physically freed 27.61 GiB, 3.77 GiB is still use.
Omni Multi-stage TP=1 Lifecycle OK
PASSED

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_omni_model_sleep_tp[2] 
--- Running test: test_omni_model_sleep_tp[2]
Testing Stage 0 (LLM) Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 79.72 GiB.
Testing Stage 1 (Diffusion) Sleep...
INFO: [Diffusion Worker 0] Sleep Level 2: physically freed 27.61 GiB, 3.77 GiB is still use.
Omni Multi-stage TP=2 Lifecycle OK
PASSED

================== 4 passed, 6 warnings in 399.89s (0:06:39) ===================

@Flink-ddd

Flink-ddd commented Mar 19, 2026

Copy link
Copy Markdown
Contributor Author

Hi @hsliuustc0106 @princepride @Gaohan123 , this is sleep mode ack new PR base on latest code version. Thanks.

I will resolve the merge conflicts.

@Flink-ddd

Copy link
Copy Markdown
Contributor Author

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@Flink-ddd Flink-ddd marked this pull request as ready for review March 19, 2026 18:42

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b8f612b67

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
Comment thread vllm_omni/entrypoints/async_omni.py Outdated
Comment thread vllm_omni/diffusion/diffusion_engine.py Outdated
@hsliuustc0106

Copy link
Copy Markdown
Collaborator

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

@hsliuustc0106

Copy link
Copy Markdown
Collaborator

it seems there a lot of conflicts needed to be resolved first

@gcanlin

gcanlin commented Mar 20, 2026

Copy link
Copy Markdown
Collaborator

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

Will try. Let's rebase it first.

@Flink-ddd

Copy link
Copy Markdown
Contributor Author

Okay, I'll address them one by one. It seems there's been more refactoring code merged in the last couple of days, so I need to further adjust my logic.

@Flink-ddd Flink-ddd force-pushed the feat/omni-sleepmode-v1 branch 2 times, most recently from 0fd1f29 to 5e77863 Compare March 20, 2026 11:43
@Flink-ddd

Copy link
Copy Markdown
Contributor Author

Hi @hsliuustc0106 @gcanlin @xuechendi,
All conflicts resolved and synchronized with main, Logic successfully verified on AMD and NVIDIA. ready for XPU and NPU resource tests. Please let me know if you have any questions, I'll address them promptly.

@xuechendi

Copy link
Copy Markdown
Contributor

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

Will try. Let's rebase it first.

Thanks, @gcanlin , XPU is waiting for PT2.11 features for sleep/wakeup. Will catch up with this feature later

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated

@gcanlin gcanlin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for contributing! Please consider these suggestions to clean code first.

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated
Comment thread vllm_omni/engine/arg_utils.py Outdated
Comment thread vllm_omni/entrypoints/async_omni.py Outdated
Comment thread vllm_omni/platforms/cuda/platform.py Outdated
Comment thread vllm_omni/platforms/rocm/platform.py Outdated
Comment thread vllm_omni/worker/gpu_ar_model_runner.py

@lishunyang12 lishunyang12 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few concerns:

  1. all_reduce after sleep may crash (diffusion_worker.py, handle_sleep_task) — after self.sleep() offloads all weights and calls empty_cache, the code allocates a new GPU tensor for all_reduce. If Level 2 sleep discarded CUDA memory pools, this allocation could fail. Consider doing the reduction before sleep, or using CPU tensors.

  2. Sleep fallback fires wake events (diffusion_engine.py, sleep()) — the fallback sets wake_events, but worker_busy_loop interprets that as {"type": "wake_up"}. So the sleep fallback does the opposite of what was intended.

  3. Dead code in executor shutdown (multiproc_executor.py) — iterates over wake_events but the try body is pass. Was ev.set() intended?

@yma11

yma11 commented Mar 22, 2026

Copy link
Copy Markdown
Contributor

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

For XPU platform, sleep mode is not fully ready at vLLM side. We have dependency on torch 2.11 APIs and draft PR is ready at 37149. So please go ahead first and we will cover XPU late.

@lishunyang12 lishunyang12 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. The dead code in shutdown and the getattr default for sleep mode need fixing.

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py Outdated
Comment thread vllm_omni/diffusion/worker/diffusion_worker.py
Comment thread vllm_omni/worker/base.py
Comment thread vllm_omni/worker/base.py
Comment thread vllm_omni/diffusion/diffusion_engine.py Outdated
@Flink-ddd Flink-ddd requested a review from gcanlin March 23, 2026 06:18
@Flink-ddd

Copy link
Copy Markdown
Contributor Author

Hi @gcanlin , I have updated some code by review suggestion, ready for NPU resource tests.

@gcanlin

gcanlin commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator

Hi @gcanlin , I have updated some code by review suggestion, ready for NPU resource tests.

Thanks! I will test it today.

@princepride

Copy link
Copy Markdown
Collaborator

@gcanlin We need accelerate the progress, can we merge this feature before v0.18.0?

@gcanlin

gcanlin commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator

@gcanlin We need accelerate the progress, can we merge this feature before v0.18.0?

I think we can. Will be done tonight on my side.

@gcanlin

gcanlin commented Mar 23, 2026

Copy link
Copy Markdown
Collaborator

@Flink-ddd @princepride Hey, could you please give me an example? I'm not familiar with this feature. Is vllm serve Wan-AI/Wan2.1-T2V-14B-Diffusers --omni --enable-sleep-mode correct?

@gcanlin gcanlin added this to the v0.18.0 milestone Mar 23, 2026
@Flink-ddd

Copy link
Copy Markdown
Contributor Author

Hi @gcanlin , yes, that command is correct:
vllm serve Wan-AI/Wan2.1-T2V-14B-Diffusers --omni --enable-sleep-mode
This flag enables the underlying memory pool capability. Once the server is running, you can trigger the physical VRAM (or NPU memory) reclamation by sending a Sleep RPC to the engine (or through the Orchestrator or a test script).

Comment thread vllm_omni/worker/base.py Outdated

@gcanlin gcanlin left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice that it needs to take some effort to adapt NPU so I will unblock this PR. For NPU I will submit a follow-up PR later. @princepride Please check the code details :)

@Flink-ddd Flink-ddd force-pushed the feat/omni-sleepmode-v1 branch from 04c27a6 to 1b278bc Compare April 18, 2026 10:30
Signed-off-by: vensen <vensenmu@gmail.com>
@Flink-ddd

Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @Gaohan123 , I have updated code by review advice, all CI check are passed, Do you have any suggestions for the next step?

@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 20, 2026
@Gaohan123 Gaohan123 added merge-test label to trigger buildkite merge test CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 20, 2026
@Gaohan123 Gaohan123 dismissed stale reviews from lishunyang12, knlnguyen1802, and hsliuustc0106 April 20, 2026 13:48

solved

@Gaohan123 Gaohan123 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@Gaohan123 Gaohan123 enabled auto-merge (squash) April 20, 2026 13:49
@Gaohan123 Gaohan123 disabled auto-merge April 20, 2026 13:51
Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated
Signed-off-by: vensen <vensenmu@gmail.com>
Signed-off-by: vensen <vensenmu@gmail.com>
@Gaohan123 Gaohan123 merged commit e076378 into vllm-project:main Apr 21, 2026
8 checks passed
lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 21, 2026
nainiu258 pushed a commit to nainiu258/vllm-omni that referenced this pull request Apr 21, 2026
qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request May 28, 2026
quyifei23 pushed a commit to quyifei23/vllm-omni that referenced this pull request Jun 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-test label to trigger buildkite merge test CI ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.