[Feat][sleepmode] add omni sleepmode and ack protocol by Flink-ddd · Pull Request #2022 · vllm-project/vllm-omni

Flink-ddd · 2026-03-19T18:32:23Z

Purpose

This PR implements Omni Sleep Mode (Tiered Memory Orchestration) for both NVIDIA and AMD, XPU, NPU etc platforms, as proposed in RFC #1316.

Key enhancements include:

Tiered Offloading Logic: Support for Level 1 (Weight offloading) and Level 2 (Full de-mapping) sleep stages.

Hardware Abstraction Layer: Unified VRAM auditing and reclamation logic across CUDA and ROCm.

Deterministic Orchestration: Ensuring physical memory release before co-located task execution to prevent OOM.

Test Plan

Six unit test classes were run on both AMD and NVIDIA systems. The test class is: tests/entrypoints/test_omni_sleep_mode.py. The test scenarios combine ACK signals, LLM and generation, and the sleep and wake-up states of Diffusion.

Especially the following points:
Unit Test 4: Inference consistency and bit-level precision verification after Diffusion wake-up
Unit Test 6: full-cycle audit of Diffusion memory lifecycle.

The unit test 6 was conducted on different platforms to verify the accuracy and usability of the Diffusion model throughout its entire lifecycle, from Active to sleep to wakeup.

e2e test

Test Result

NVIDIA A6000 TP = 2 （Core pytest test output）

tests/entrypoints/test_omni_sleep_mode.py::TestOmniSleepMode::test_diffusion_vram_lifecycle_audit 
=== PRE-TEST GPU CLEANUP ===
GPU cleanup disabled
INFO 02-24 09:58:39 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=2048.
--- Running test: test_diffusion_vram_lifecycle_audit
INFO 02-24 09:58:39 [omni.py:117] Initializing stages for model: ByteDance-Seed/BAGEL-7B-MoT
..................................................
[Stage-0] INFO 02-24 09:59:08 [async_omni_diffusion.py:113] AsyncOmniDiffusion initialized with model: black-forest-labs/FLUX.2-klein-4B
INFO 02-24 09:59:08 [omni.py:334] [AsyncOrchestrator] Stage-0 reported ready
INFO 02-24 09:59:08 [omni.py:360] [AsyncOrchestrator] All stages initialized successfully
WARNING 02-24 09:59:08 [async_omni.py:274] [AsyncOrchestrator] No LLM stage found, processors will not be available. This may cause issues with OpenAIServingModels.
INFO 02-24 09:59:08 [omni_stage.py:566] [Stage-0] Status transitioned to: TRANSITIONING
INFO 02-24 09:59:08 [omni_stage.py:570] [Stage-0] Submitting SLEEP task (Level: 2)
INFO 02-24 09:59:08 [async_omni.py:894] [AsyncOrchestrator] Sleep initiated. Awaiting confirmation from 1 workers...
[Stage-0] INFO 02-24 09:59:08 [async_omni_diffusion.py:309] [Entrypoint] Relaying Sleep Task: 05eda0a7-27f5-4843-ad1a-24291b0dc7d4 (Level: 2)
[Stage-0] INFO 02-24 09:59:08 [diffusion_engine.py:401] [Diffusion Engine Relay] Dispatching Sleep Task 05eda0a7-27f5-4843-ad1a-24291b0dc7d4 (Level: 2)
[Stage-0] INFO 02-24 09:59:08 [diffusion_worker.py:221] [Diffusion Worker 0] Handshake Received: Task 05eda0a7-27f5-4843-ad1a-24291b0dc7d4, Level 2
[Stage-0] INFO 02-24 09:59:08 [cumem.py:213] CuMemAllocator: sleep freed 14.89 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 14.89 GiB is discarded directly.
[Stage-0] INFO 02-24 09:59:09 [diffusion_worker.py:198] [Diffusion Worker 0] Level 2 Sleep: Freed 18.40 GiB. 0.62GiB memory is still in use.
[Stage-0] INFO 02-24 09:59:09 [diffusion_worker.py:239] [Diffusion Worker 0]: ACK emitted. Freed 18.40 GiB.
[Stage-0] INFO 02-24 09:59:09 [omni_stage.py:1323] [Stage-0] Sleep ACKs forwarded to Orchestrator
INFO 02-24 09:59:09 [async_omni.py:641] [AsyncOrchestrator] Intercepted wrapped ACK for task 05eda0a7-27f5-4843-ad1a-24291b0dc7d4 from stage-0
INFO 02-24 09:59:09 [omni_stage.py:566] [Stage-0] Status transitioned to: SLEEPING
INFO 02-24 09:59:10 [omni_stage.py:579] [Stage-0] Submitting WAKE_UP task
INFO 02-24 09:59:10 [async_omni.py:921] [AsyncOrchestrator] Wake-up initiated. Awaiting confirmation from 1 workers...
[Stage-0] INFO 02-24 09:59:10 [async_omni_diffusion.py:318] [Entrypoint] Relaying WakeUp Task: c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43
[Stage-0] INFO 02-24 09:59:10 [diffusion_engine.py:424] [Diffusion Engine Relay] Dispatching Wake-up Task c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43 to workers...
[Stage-0] INFO 02-24 09:59:10 [diffusion_worker.py:250] [Diffusion Worker 0] Responding to Wake-up Task: c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43
[Stage-0] INFO 02-24 09:59:10 [diffusion_worker.py:210] [Diffusion Worker 0] Wake-up complete.
[Stage-0] INFO 02-24 09:59:10 [diffusion_worker.py:259] [Diffusion Worker 0] Wake-up confirmed.
INFO 02-24 09:59:10 [async_omni.py:641] [AsyncOrchestrator] Intercepted wrapped ACK for task c9a8e0a2-699a-4f3c-ac62-6b7bb08b3b43 from stage-0
INFO 02-24 09:59:10 [omni_stage.py:566] [Stage-0] Status transitioned to: RUNNING
INFO 02-24 09:59:11 [async_omni.py:378] [AsyncOrchestrator] Entering scheduling loop: stages=1, final_stage=0
[Stage-0] INFO 02-24 09:59:11 [manager.py:538] Deactivating all adapters: 0 layers
[Stage-0] WARNING 02-24 09:59:11 [kv_transfer_manager.py:356] No connector available for receiving KV cache
[Stage-0] INFO 02-24 09:59:11 [diffusion_engine.py:92] Generation completed successfully.
[Stage-0] INFO 02-24 09:59:11 [diffusion_engine.py:110] Post-processing completed in 0.0583 seconds
PASSED[Stage-0] INFO 02-24 09:59:26 [diffusion_worker.py:400] Worker 0: Received shutdown message
[Stage-0] INFO 02-24 09:59:26 [diffusion_worker.py:421] event loop terminated.
[Stage-0] INFO 02-24 09:59:26 [diffusion_worker.py:452] Worker 0: Shutdown complete.
[Stage-0] INFO 02-24 09:59:29 [async_omni_diffusion.py:213] AsyncOmniDiffusion closed
[Stage-0] INFO 02-24 09:59:29 [omni_stage.py:1428] Stage worker exiting
GPU cleanup disabled


=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
================== 6 passed, 2 warnings in 309.81s (0:05:09) ===================

AMD MI300X TP = 2 （Core pytest test output）

WARNING 03-19 18:07:36 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[aiter] import [module_aiter_enum] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_aiter_enum.so
============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3.12
cachedir: .pytest_cache
rootdir: /app/vllm-omni
configfile: pyproject.toml
plugins: asyncio-1.3.0, anyio-4.12.1
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collecting ... collected 6 items
................................................................................................................................................................
tests/entrypoints/test_omni_sleep_mode.py::TestOmniSleepMode::test_cross_device_cleanup 
=== PRE-TEST GPU CLEANUP ===
GPU cleanup disabled
INFO 03-19 18:09:27 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
--- Running test: test_cross_device_cleanup
INFO 03-19 18:09:27 [weight_utils.py:50] Using model weights format ['*']
INFO 03-19 18:09:27 [omni.py:195] Initializing stages for model: ByteDance-Seed/BAGEL-7B-MoT
INFO 03-19 18:09:27 [omni.py:329] No omni_master_address provided, defaulting to localhost (127.0.0.1)
INFO 03-19 18:09:27 [omni.py:269] [AsyncOrchestrator] Using stages provided directly via arguments
INFO 03-19 18:09:27 [initialization.py:35] No OmniTransferConfig provided
INFO 03-19 18:09:27 [omni.py:363] [AsyncOrchestrator] Loaded 1 stages
INFO 03-19 18:09:27 [multiproc_executor.py:88] Starting server...
WARNING 03-19 18:09:34 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
WARNING 03-19 18:09:34 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm
[aiter] import [module_aiter_enum] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_aiter_enum.so
[aiter] import [module_aiter_enum] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_aiter_enum.so
INFO 03-19 18:09:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-19 18:09:39 [vllm.py:754] Asynchronous scheduling is enabled.
INFO 03-19 18:09:39 [diffusion_worker.py:424] Worker 0 created result MessageQueue
INFO 03-19 18:09:39 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 03-19 18:09:39 [vllm.py:754] Asynchronous scheduling is enabled.
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
INFO 03-19 18:09:39 [diffusion_worker.py:122] Worker 0: Initialized device and distributed environment.
INFO 03-19 18:09:39 [diffusion_worker.py:122] Worker 1: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-19 18:09:39 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 03-19 18:09:39 [parallel_state.py:630] SP group details for rank 1: sp_group=[1], ulysses_group=[1], ring_group=[1]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-19 18:09:39 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 03-19 18:09:39 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 03-19 18:09:40 [weight_utils.py:50] Using model weights format ['*']
INFO 03-19 18:09:40 [weight_utils.py:50] Using model weights format ['*']
INFO 03-19 18:09:42 [weight_utils.py:618] No diffusion_pytorch_model.safetensors.index.json found in remote.

Multi-thread loading shards:   0% Completed | 0/2 [00:00<?, ?it/s]
INFO 03-19 18:09:42 [weight_utils.py:618] No diffusion_pytorch_model.safetensors.index.json found in remote.

Multi-thread loading shards:  50% Completed | 1/2 [00:01<00:01,  1.10s/it]

Multi-thread loading shards: 100% Completed | 2/2 [00:08<00:00,  4.66s/it]

Multi-thread loading shards: 100% Completed | 2/2 [00:08<00:00,  4.13s/it]

INFO 03-19 18:09:50 [pipeline_bagel.py:743] BagelPipeline weight filter kept 1466/1467 tensors (shape mismatches seen: 0)
INFO 03-19 18:09:50 [pipeline_bagel.py:743] BagelPipeline weight filter kept 1466/1467 tensors (shape mismatches seen: 0)
INFO 03-19 18:09:51 [diffusers_loader.py:321] Loading weights took 9.68 seconds
INFO 03-19 18:09:52 [diffusers_loader.py:321] Loading weights took 10.06 seconds
INFO 03-19 18:09:52 [diffusion_model_runner.py:134] Model loading took 14.2129 GiB and 12.788842 seconds
INFO 03-19 18:09:52 [diffusion_model_runner.py:139] Model runner: Model loaded successfully.
INFO 03-19 18:09:52 [diffusion_model_runner.py:173] Model runner: Initialization complete.
WARNING 03-19 18:09:52 [gpu_memory_utils.py:88] NVML init failed, will use profiling fallback: NVML Shared Library Not Found
INFO 03-19 18:09:52 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 03-19 18:09:52 [diffusion_worker.py:91] Worker 0: Initialization complete.
INFO 03-19 18:09:52 [diffusion_worker.py:589] Worker 0: Scheduler loop started.
INFO 03-19 18:09:52 [diffusion_worker.py:495] Worker 0 ready to receive requests via shared memory
INFO 03-19 18:09:53 [diffusion_model_runner.py:134] Model loading took 14.2129 GiB and 13.289822 seconds
INFO 03-19 18:09:53 [diffusion_model_runner.py:139] Model runner: Model loaded successfully.
INFO 03-19 18:09:53 [diffusion_model_runner.py:173] Model runner: Initialization complete.
WARNING 03-19 18:09:53 [gpu_memory_utils.py:88] NVML init failed, will use profiling fallback: NVML Shared Library Not Found
INFO 03-19 18:09:53 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:1, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 03-19 18:09:53 [diffusion_worker.py:91] Worker 1: Initialization complete.
INFO 03-19 18:09:53 [diffusion_worker.py:589] Worker 1: Scheduler loop started.
INFO 03-19 18:09:53 [diffusion_worker.py:495] Worker 1 ready to receive requests via shared memory
INFO 03-19 18:09:53 [scheduler.py:42] SyncScheduler initialized result MessageQueue
INFO 03-19 18:09:53 [diffusion_engine.py:415] dummy run to warm up the model
INFO 03-19 18:09:53 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-19 18:09:53 [kv_transfer_manager.py:479] Request has no ID, cannot receive KV cache
INFO 03-19 18:09:53 [manager.py:608] Deactivating all adapters: 0 layers
WARNING 03-19 18:09:53 [kv_transfer_manager.py:479] Request has no ID, cannot receive KV cache
[aiter] import [module_fmha_v3_varlen_fwd] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_fmha_v3_varlen_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_varlen_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int, min_seqlen_q: int, dropout_p: float, softmax_scale: float, logits_soft_cap: float, zero_tensors: bool, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, block_table: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None, cu_seqlens_q_padded: Optional[torch.Tensor] = None, cu_seqlens_k_padded: Optional[torch.Tensor] = None) -> List[torch.Tensor]
[aiter] hipModuleLoad: /usr/local/lib/python3.12/dist-packages/aiter_meta/hsa//gfx942/fmha_v3_fwd/MI300/fwd_hd128_bf16_causal_rtna_group.co GetFunction: _ZN5aiter37fmha_fwd_hd128_bf16_causal_rtna_groupE Success
[aiter] import [module_fmha_v3_varlen_fwd] under /usr/local/lib/python3.12/dist-packages/aiter/jit/module_fmha_v3_varlen_fwd.so
[aiter] type hints mismatch, override to --> fmha_v3_varlen_fwd(q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, cu_seqlens_q: torch.Tensor, cu_seqlens_k: torch.Tensor, max_seqlen_q: int, max_seqlen_k: int, min_seqlen_q: int, dropout_p: float, softmax_scale: float, logits_soft_cap: float, zero_tensors: bool, is_causal: bool, window_size_left: int, window_size_right: int, return_softmax_lse: bool, return_dropout_randval: bool, how_v3_bf16_cvt: int, out: Optional[torch.Tensor] = None, block_table: Optional[torch.Tensor] = None, bias: Optional[torch.Tensor] = None, alibi_slopes: Optional[torch.Tensor] = None, gen: Optional[torch.Generator] = None, cu_seqlens_q_padded: Optional[torch.Tensor] = None, cu_seqlens_k_padded: Optional[torch.Tensor] = None) -> List[torch.Tensor]
[aiter] hipModuleLoad: /usr/local/lib/python3.12/dist-packages/aiter_meta/hsa//gfx942/fmha_v3_fwd/MI300/fwd_hd128_bf16_causal_rtna_group.co GetFunction: _ZN5aiter37fmha_fwd_hd128_bf16_causal_rtna_groupE Success
INFO 03-19 18:10:04 [omni.py:444] [AsyncOrchestrator] Inline diffusion mode active – stage worker subprocess bypassed
INFO 03-19 18:10:04 [async_omni.py:229] [AsyncOrchestrator] Pro-active link: Inline Engine -> Stage-0
INFO 03-19 18:10:04 [omni_stage.py:700] [Stage-0] Status transitioned to: TRANSITIONING
INFO 03-19 18:10:04 [omni_stage.py:704] [Stage-0] Submitting SLEEP task (Level: 1)
INFO 03-19 18:10:04 [diffusion_engine.py:459] [Diffusion Engine] Attempting sleep 8b99f389-d23b-4514-8711-9de42f734922 (Level: 1)...
INFO 03-19 18:10:04 [diffusion_engine.py:463] [Diffusion Engine] Physical Sleep Command 8b99f389-d23b-4514-8711-9de42f734922 dispatched via MQ.
INFO 03-19 18:10:04 [async_omni.py:1187] [AsyncOrchestrator] Sleep initiated (Task: 8b99f389-d23b-4514-8711-9de42f734922). Awaiting 1 ACKs...
INFO 03-19 18:10:04 [async_omni.py:932] Orchestrator is polling Stage-0
INFO 03-19 18:10:04 [diffusion_worker.py:268] [Worker 0] Handshake Received: Task 8b99f389-d23b-4514-8711-9de42f734922
INFO 03-19 18:10:04 [diffusion_worker.py:268] [Worker 1] Handshake Received: Task 8b99f389-d23b-4514-8711-9de42f734922
INFO 03-19 18:10:07 [cumem.py:216] CuMemAllocator: sleep freed 14.03 GiB memory in total, of which 14.03 GiB is backed up in CPU and the rest 0.00 GiB is discarded directly.
INFO 03-19 18:10:08 [diffusion_worker.py:241] [Worker 1] Sleep Level 1 scavenged 14.03 GiB from GPU.
INFO 03-19 18:10:08 [diffusion_worker.py:275] [Worker 1] Preparing ACK: freed_bytes=15.04 GiB.
INFO 03-19 18:10:08 [cumem.py:216] CuMemAllocator: sleep freed 14.03 GiB memory in total, of which 14.03 GiB is backed up in CPU and the rest 0.00 GiB is discarded directly.
INFO 03-19 18:10:08 [diffusion_worker.py:241] [Worker 0] Sleep Level 1 scavenged 14.03 GiB from GPU.
INFO 03-19 18:10:08 [diffusion_worker.py:275] [Worker 0] Preparing ACK: freed_bytes=15.04 GiB.
INFO 03-19 18:10:08 [diffusion_worker.py:298] [Worker 0] ACK emitted. Freed 30.07 GiB.
INFO 03-19 18:10:09 [async_omni.py:954] [AsyncOrchestrator] Intercepted wrapped ACK for task 8b99f389-d23b-4514-8711-9de42f734922 from stage-0
INFO 03-19 18:10:09 [async_omni.py:156] [Resolver] Task 8b99f389-d23b-4514-8711-9de42f734922 progress: 1/1 ACKs received.
INFO 03-19 18:10:09 [async_omni.py:163] [Resolver] Task 8b99f389-d23b-4514-8711-9de42f734922 completed successfully in 4.44s.
INFO 03-19 18:10:09 [omni_stage.py:700] [Stage-0] Status transitioned to: SLEEPING
PASSEDGPU cleanup disabled
..............................................................................................................................
------------------------------ Captured log call -------------------------------
ERROR    OmniTest:test_omni_sleep_mode.py:254 Coordinated test failed:
=============================== warnings summary ===============================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/entrypoints/test_omni_sleep_mode.py::TestOmniSleepMode::test_coordinated_cross_device
============= 6 passed, 2 warnings in 426.55s (0:07:06) ==============

[NVIDIA A6000] Coordinated Cross-Device VRAM Audit

Validating coordinated VRAM auditing for heterogeneous engines (LLM-Talker and Diffusion) on NVIDIA A6000, demonstrating seamless parallel weight offloading across inter-process components.

[NVIDIA A6000] Diffusion VRAM Lifecycle Audit

Full lifecycle audit of a single Diffusion engine on NVIDIA A6000, confirming efficient physical VRAM reclamation in Level 2 Sleep Mode and successful partial weight recovery.

[AMD MI300X] Coordinated Cross-Device VRAM Audit

Demonstrating multi-vendor compatibility of the coordinated sleep mechanism on AMD MI300X (ROCm), showing deterministic VRAM scavenging and state synchronization between heterogeneous engines in a TP environment.

[AMD MI300X] Diffusion VRAM Lifecycle Audit

Auditing dynamic VRAM evolution on AMD MI300X to verify that the Deep Sleep mechanism maintains high-precision physical resource reclamation even within large-capacity memory architectures.

Note: The stable, non-zero VRAM floor observed (approx. 2.007 GiB on MI300X / 1.2 GiB on A6000) represents the mandatory driver runtime footprint and persistent metadata required to ensure deterministic, near-instantaneous recovery after deep sleep.

Introduces Omni Sleep Mode to enable deterministic physical VRAM reclamation and restoration across standalone Diffusion and multi-stage models, with full support for tensor_parallel_size=1,2

============================= test session starts ==============================
platform linux -- Python 3.12.12, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /app/vllm-omni
configfile: pyproject.toml
plugins: mock-3.15.1, asyncio-1.3.0, anyio-4.12.1
asyncio: mode=Mode.AUTO
collected 4 items

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_diffusion_model_sleep_tp[1] 
--- Running test: test_diffusion_model_sleep_tp[1]
[TP=1] Triggering Level 2 Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 80.21 GiB.
[TP=1] VRAM Reserved: 29.97 GiB is still in use.
[TP=1] Waking up...
Diffusion TP=1 Lifecycle OK
PASSED

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_diffusion_model_sleep_tp[2] 
--- Running test: test_diffusion_model_sleep_tp[2]
[TP=2] Triggering Level 2 Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 80.21 GiB.
[TP=2] VRAM Reserved: 29.98 GiB is still in use.
[TP=2] Waking up...
Diffusion TP=2 Lifecycle OK
PASSED

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_omni_model_sleep_tp[1] 
--- Running test: test_omni_model_sleep_tp[1]
Testing Stage 0 (LLM) Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 79.89 GiB.
Testing Stage 1 (Diffusion) Sleep...
INFO: [Diffusion Worker 0] Sleep Level 2: physically freed 27.61 GiB, 3.77 GiB is still use.
Omni Multi-stage TP=1 Lifecycle OK
PASSED

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_omni_model_sleep_tp[2] 
--- Running test: test_omni_model_sleep_tp[2]
Testing Stage 0 (LLM) Sleep...
INFO: [LLM Worker 0] Level 2 Sleep: Freed 79.72 GiB.
Testing Stage 1 (Diffusion) Sleep...
INFO: [Diffusion Worker 0] Sleep Level 2: physically freed 27.61 GiB, 3.77 GiB is still use.
Omni Multi-stage TP=2 Lifecycle OK
PASSED

================== 4 passed, 6 warnings in 399.89s (0:06:39) ===================

Flink-ddd · 2026-03-19T18:42:20Z

Hi @hsliuustc0106 @princepride @Gaohan123 , this is sleep mode ack new PR base on latest code version. Thanks.

I will resolve the merge conflicts.

Flink-ddd · 2026-03-19T18:42:50Z

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9b8f612b67

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hsliuustc0106 · 2026-03-20T00:27:05Z

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

hsliuustc0106 · 2026-03-20T00:27:34Z

it seems there a lot of conflicts needed to be resolved first

gcanlin · 2026-03-20T00:52:18Z

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

Will try. Let's rebase it first.

Flink-ddd · 2026-03-20T06:34:10Z

Okay, I'll address them one by one. It seems there's been more refactoring code merged in the last couple of days, so I need to further adjust my logic.

Flink-ddd · 2026-03-20T13:47:08Z

Hi @hsliuustc0106 @gcanlin @xuechendi,
All conflicts resolved and synchronized with main, Logic successfully verified on AMD and NVIDIA. ready for XPU and NPU resource tests. Please let me know if you have any questions, I'll address them promptly.

xuechendi · 2026-03-20T22:07:47Z

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

Will try. Let's rebase it first.

Thanks, @gcanlin , XPU is waiting for PT2.11 features for sleep/wakeup. Will catch up with this feature later

gcanlin

Thanks for contributing! Please consider these suggestions to clean code first.

lishunyang12

A few concerns:

all_reduce after sleep may crash (diffusion_worker.py, handle_sleep_task) — after self.sleep() offloads all weights and calls empty_cache, the code allocates a new GPU tensor for all_reduce. If Level 2 sleep discarded CUDA memory pools, this allocation could fail. Consider doing the reduction before sleep, or using CPU tensors.
Sleep fallback fires wake events (diffusion_engine.py, sleep()) — the fallback sets wake_events, but worker_busy_loop interprets that as {"type": "wake_up"}. So the sleep fallback does the opposite of what was intended.
Dead code in executor shutdown (multiproc_executor.py) — iterates over wake_events but the try body is pass. Was ev.set() intended?

yma11 · 2026-03-22T03:07:26Z

About NPU / XPU test details, @hsliuustc0106 please help me coordinate testing resources. I will respond and make changes promptly based on the review feedback. Thanks.

@gcanlin @xuechendi can have a try?

For XPU platform, sleep mode is not fully ready at vLLM side. We have dependency on torch 2.11 APIs and draft PR is ready at 37149. So please go ahead first and we will cover XPU late.

lishunyang12

Left a few comments. The dead code in shutdown and the getattr default for sleep mode need fixing.

Flink-ddd · 2026-03-23T06:25:05Z

Hi @gcanlin , I have updated some code by review suggestion, ready for NPU resource tests.

gcanlin · 2026-03-23T06:34:06Z

Hi @gcanlin , I have updated some code by review suggestion, ready for NPU resource tests.

Thanks! I will test it today.

princepride · 2026-03-23T07:11:04Z

@gcanlin We need accelerate the progress, can we merge this feature before v0.18.0?

gcanlin · 2026-03-23T07:24:28Z

@gcanlin We need accelerate the progress, can we merge this feature before v0.18.0?

I think we can. Will be done tonight on my side.

gcanlin · 2026-03-23T12:24:10Z

@Flink-ddd @princepride Hey, could you please give me an example? I'm not familiar with this feature. Is vllm serve Wan-AI/Wan2.1-T2V-14B-Diffusers --omni --enable-sleep-mode correct?

Flink-ddd · 2026-03-23T12:34:27Z

Hi @gcanlin , yes, that command is correct:
vllm serve Wan-AI/Wan2.1-T2V-14B-Diffusers --omni --enable-sleep-mode
This flag enables the underlying memory pool capability. Once the server is running, you can trigger the physical VRAM (or NPU memory) reclamation by sending a Sleep RPC to the engine (or through the Orchestrator or a test script).

gcanlin

I notice that it needs to take some effort to adapt NPU so I will unblock this PR. For NPU I will submit a follow-up PR later. @princepride Please check the code details :)

Signed-off-by: vensen <vensenmu@gmail.com>

Flink-ddd · 2026-04-19T09:15:04Z

@hsliuustc0106 @Gaohan123 , I have updated code by review advice, all CI check are passed, Do you have any suggestions for the next step?

solved

Gaohan123

LGTM. Thanks

Signed-off-by: vensen <vensenmu@gmail.com>

)

) Signed-off-by: nainiu258 <cperfect02@163.com>

)

Flink-ddd marked this pull request as ready for review March 19, 2026 18:42

Flink-ddd requested a review from hsliuustc0106 as a code owner March 19, 2026 18:42

chatgpt-codex-connector Bot reviewed Mar 19, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/omni_stage.py Outdated

Comment thread vllm_omni/entrypoints/async_omni.py Outdated

Comment thread vllm_omni/diffusion/diffusion_engine.py Outdated

Flink-ddd force-pushed the feat/omni-sleepmode-v1 branch 2 times, most recently from 0fd1f29 to 5e77863 Compare March 20, 2026 11:43

gcanlin reviewed Mar 21, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated

gcanlin reviewed Mar 21, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated

gcanlin reviewed Mar 21, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py Outdated

gcanlin requested changes Mar 21, 2026

View reviewed changes

lishunyang12 reviewed Mar 21, 2026

View reviewed changes

lishunyang12 reviewed Mar 22, 2026

View reviewed changes

Comment thread vllm_omni/diffusion/executor/multiproc_executor.py Outdated

Comment thread vllm_omni/diffusion/worker/diffusion_worker.py

Comment thread vllm_omni/worker/base.py

Comment thread vllm_omni/worker/base.py

Comment thread vllm_omni/diffusion/diffusion_engine.py Outdated

Flink-ddd requested a review from gcanlin March 23, 2026 06:18

gcanlin added this to the v0.18.0 milestone Mar 23, 2026

gcanlin reviewed Mar 23, 2026

View reviewed changes

Comment thread vllm_omni/worker/base.py Outdated

gcanlin approved these changes Mar 23, 2026

View reviewed changes

Flink-ddd force-pushed the feat/omni-sleepmode-v1 branch from 04c27a6 to 1b278bc Compare April 18, 2026 10:30

remove test code

1925d8c

Signed-off-by: vensen <vensenmu@gmail.com>

Gaohan123 added this to the v0.20.0 milestone Apr 20, 2026

Gaohan123 added merge-test label to trigger buildkite merge test CI and removed nightly-test label to trigger buildkite nightly test CI labels Apr 20, 2026

Merge branch 'main' into feat/omni-sleepmode-v1

0715fa0

Gaohan123 approved these changes Apr 20, 2026

View reviewed changes

Gaohan123 enabled auto-merge (squash) April 20, 2026 13:49

Gaohan123 disabled auto-merge April 20, 2026 13:51

hsliuustc0106 reviewed Apr 20, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/api_server.py Outdated

Flink-ddd added 3 commits April 20, 2026 16:42

update code

d2acf23

Signed-off-by: vensen <vensenmu@gmail.com>

resolve CI error

8c8dde1

Signed-off-by: vensen <vensenmu@gmail.com>

Merge branch 'main' into feat/omni-sleepmode-v1

a6d4ccf

Gaohan123 merged commit e076378 into vllm-project:main Apr 21, 2026
8 checks passed

lvliang-intel pushed a commit to lvliang-intel/vllm-omni that referenced this pull request Apr 21, 2026

[Feat][sleepmode] add omni sleepmode and ack protocol (vllm-project#2022

38b3fad

)

nainiu258 pushed a commit to nainiu258/vllm-omni that referenced this pull request Apr 21, 2026

[Feat][sleepmode] add omni sleepmode and ack protocol (vllm-project#2022

04c557a

) Signed-off-by: nainiu258 <cperfect02@163.com>

lengrongfu mentioned this pull request Apr 22, 2026

add sleep and wake_up api in sleep model #2742

Closed

5 tasks

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026

[Feat][sleepmode] add omni sleepmode and ack protocol (vllm-project#2022

4a3ec58

)

knlnguyen1802 mentioned this pull request Apr 29, 2026

[Rebase] Rebase to vllm 0.20.0 #3232

Merged

5 tasks

lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026

[Feat][sleepmode] add omni sleepmode and ack protocol (vllm-project#2022

29ee2f1

)

This was referenced May 4, 2026

[Frontend] Enable sleep/wake HTTP API without VLLM_SERVER_DEV_MODE #2224

Open

[Feature]: online inference support enable-sleep-mode level2 for diffusion model #1502

Open

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026

[Feat][sleepmode] add omni sleepmode and ack protocol (vllm-project#2022

ececf9b

)

daixinning pushed a commit to daixinning/vllm-omni that referenced this pull request May 28, 2026

[Feat][sleepmode] add omni sleepmode and ack protocol (vllm-project#2022

e355b50

)

quyifei23 pushed a commit to quyifei23/vllm-omni that referenced this pull request Jun 6, 2026

[Feat][sleepmode] add omni sleepmode and ack protocol (vllm-project#2022

457e2b9

)

hsliuustc0106 mentioned this pull request Jun 13, 2026

[RFC/Bug] Improve diffusion worker/engine control-plane reliability and output contracts #4400

Open

Conversation

Flink-ddd commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

[NVIDIA A6000] Coordinated Cross-Device VRAM Audit

[NVIDIA A6000] Diffusion VRAM Lifecycle Audit

[AMD MI300X] Coordinated Cross-Device VRAM Audit

[AMD MI300X] Diffusion VRAM Lifecycle Audit

Uh oh!

Flink-ddd commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Flink-ddd commented Mar 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Mar 20, 2026

Uh oh!

hsliuustc0106 commented Mar 20, 2026

Uh oh!

gcanlin commented Mar 20, 2026

Uh oh!

Flink-ddd commented Mar 20, 2026

Uh oh!

Flink-ddd commented Mar 20, 2026

Uh oh!

xuechendi commented Mar 20, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gcanlin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

yma11 commented Mar 22, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Flink-ddd commented Mar 23, 2026

Uh oh!

gcanlin commented Mar 23, 2026

Uh oh!

princepride commented Mar 23, 2026

Uh oh!

gcanlin commented Mar 23, 2026

Uh oh!

gcanlin commented Mar 23, 2026

Uh oh!

Flink-ddd commented Mar 23, 2026

Uh oh!

Uh oh!

gcanlin left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Flink-ddd commented Mar 19, 2026 •

edited

Loading

Flink-ddd commented Mar 19, 2026 •

edited

Loading

gcanlin left a comment •

edited

Loading