Skip to content

[Feat](sleep mode): implement sleep and wake_up APIs for engine lifecycle management#1160

Open
Flink-ddd wants to merge 2 commits into
vllm-project:mainfrom
Flink-ddd:fix/omni-sleep-fp8-garbage
Open

[Feat](sleep mode): implement sleep and wake_up APIs for engine lifecycle management#1160
Flink-ddd wants to merge 2 commits into
vllm-project:mainfrom
Flink-ddd:fix/omni-sleep-fp8-garbage

Conversation

@Flink-ddd
Copy link
Copy Markdown
Contributor

Purpose

Currently, the Omni orchestrator lacks programmatic control for sleep and wake_up states. This prevents users from releasing VRAM during idle periods in a multi-stage distributed environment.

This PR implements the necessary infrastructure to broadcast lifecycle commands from the Orchestrator down to the distributed workers, enabling efficient memory management without losing model state.

Key Changes
Orchestrator API: Added sleep(level) and wake_up() to the Omni entry point to manage all active stages.

Instruction Forwarding: Updated StageController and OmniStageTaskType to support standardized lifecycle task signaling across processes.

Worker Integration: Enabled the underlying workers to receive and execute memory release/recovery instructions via the established task queue.

Test Result

The implementation was verified using Qwen/Qwen2.5-Omni-3B with FP8 quantization enabled. Verification machine: RTX A6000 x 2

Verification logic:

Initial failure confirmed (AttributeError) when calling engine.sleep().

After applying the fix, the engine successfully entered Level 2 Sleep (verified VRAM release).

Upon wake_up(), the model produced bit-identical Token IDs compared to the baseline, confirming that the internal quantization states and KV caches are preserved correctly in the omni architecture.

  1. Reproduction of Missing Methods
    Initially, the orchestrator crashed as it could not handle lifecycle commands.
(vllm-venv) root@8ed4e0d980ee:/workspace/vllm-omni# python test.py
# ... 
[1, 284, 220, 15, 59, 701, 714, 2474, 17767, 81, 57758, 4157, 387, 7168, 320, 1782, 6118, 1035, 614, 902, 3082, 701, 582, 2908, 279, 3930, 438, 17767, 81, 57758, 19827, 220, 15, 504, 279, 1290, 13, 1634, 17767, 81, 57758, 19827, 220, 15, 11, 1124, 1188, 16, 3795, 29776, 18, 57758, 19827, 220, 16, 11, 773, 17767, 16, 481, 320, 16, 3795, 29776, 18, 57758, 19827, 220, 15, 13, 4354, 1, 31267, 11, 279, 7192, 3082, 374, 16994, 979, 17767, 81, 57758, 374, 1101, 10078, 7046, 1091, 220, 15, 11, 714, 369, 279, 7428, 315, 419, 3491, 11, 582, 2908, 279, 31787, 7192, 979, 17767, 81, 57758, 374, 1602, 2613, 714, 537, 7168, 382, 44500, 11, 279, 7192, 3204, 3082, 315, 279, 6118, 374, 510, 59, 9640, 59, 79075, 35702, 37018, 69094, 15170, 18, 3417, 624, 59, 2533, 33975, 25, 21144, 264, 8500, 4512, 52847, 553, 400, 64, 62, 16, 284, 220, 17, 54876, 323, 369, 400, 77, 1124, 709, 80, 220, 17, 54876, 400, 64, 1089, 284, 308, 1124, 50853, 264, 15159, 77, 12, 16, 92, 488, 308, 61, 17, 12947, 7379, 279, 24632, 7546, 400, 74, 3, 1741, 429, 400, 64, 4698, 861, 220, 16, 15, 15, 15, 3, 382, 1249, 1477, 279, 24632, 7546, 17767, 595, 1124, 8, 1741, 429, 17767, 264, 4698, 861, 220, 16, 15, 15, 15, 1124, 8, 369, 279, 8500, 4512, 553, 17767, 264, 62, 16, 284, 220, 17, 1124, 8, 323, 17767, 264, 1089, 284, 308, 1124, 50853, 264, 15159, 77, 12, 16, 92, 488, 308, 61, 17, 1124, 8, 369, 17767, 308, 1124, 709, 80, 220, 17, 1124, 701, 582, 686, 12564, 279, 3793, 315, 279, 8500, 3019, 553, 3019, 3080, 582, 1477, 279, 12685, 4647, 382, 5338, 11, 582, 12564, 17767, 264, 62, 17, 1124, 982, 59, 9640, 64, 62, 17, 284, 220, 17, 1124, 50853, 264, 62, 16, 488, 220, 17, 61, 17, 284, 220, 17, 61, 17, 284, 220, 17, 488, 220, 19, 284, 220, 23, 198, 59, 2533, 5847, 11, 582, 12564, 17767, 264, 62, 18, 1124, 982, 59, 9640, 64, 62, 18, 284, 220, 18, 1124, 50853, 264, 62, 17, 488, 220, 18, 61, 17, 284, 220, 18, 1124, 50853, 220, 23, 488, 220, 24, 284, 220, 18, 18, 198, 59, 2533, 5847, 11, 582, 12564, 17767, 264, 62, 19, 1124, 982, 59, 9640, 64, 62, 19, 284, 220, 19, 1124, 50853, 264, 62, 18, 488, 220, 19, 61, 17, 284, 220, 19, 1124, 50853, 220, 18, 18, 488, 220, 16, 21, 284, 220, 16, 19, 23, 198, 59, 2533, 5847, 11, 582, 12564, 17767, 264, 62, 20, 1124, 982, 59, 9640, 64, 62, 20, 1124, 50853, 264, 62, 19, 488, 220, 20, 61, 17, 284, 220, 16, 19, 23, 488, 220, 17, 20, 284, 220, 22, 21, 20, 198, 59, 2533, 5847, 11, 582, 12564, 17767, 264, 62, 21, 1124, 982, 59, 9640, 64, 62, 21, 284, 220, 21, 1124, 50853, 264, 62, 20, 488, 220, 21, 61, 17, 284, 220, 21, 1124, 50853, 220, 22, 21, 20, 488, 220, 18, 21, 284, 220, 19, 21, 17, 21, 198, 59, 2533, 1654, 1490, 429, 17767, 264, 62, 21, 284, 220, 19, 21, 17, 21, 1124, 701, 892, 374, 7046, 1091, 220, 16, 15, 15, 15, 13, 15277, 11, 279, 24632, 7546, 17767, 595, 1124, 8, 1741, 429, 17767, 264, 4698, 861, 220, 16, 15, 15, 15, 1124, 8, 374, 17767, 595, 284, 220, 21, 1124, 3593, 44500, 11, 279, 4226, 374, 510, 59, 9640, 59, 79075]

3. Testing Sleep Level 2 (VRAM Release)...
Traceback (most recent call last):
  File "/workspace/vllm-omni/test.py", line 63, in <module>
    main()
  File "/workspace/vllm-omni/test.py", line 41, in main
    engine.sleep(level=2)
AttributeError: 'Omni' object has no attribute 'sleep'

[Stage-2] INFO 02-02 17:28:35 [omni_stage.py:779] Received shutdown signal
[Stage-0] INFO 02-02 17:28:35 [omni_stage.py:779] Received shutdown signal
[Stage-1] INFO 02-02 17:28:35 [omni_stage.py:779] Received shutdown signal
  1. after the fix
    After the fix, running test.py again, the program was able to correctly recognize and execute the sleep-related logic.
3. Testing Sleep Level 2 (VRAM Release)...
Checking VRAM... (Should be 0 in nvidia-smi). Waiting 10s...

4. Waking up the engine...

5. Testing Post-Wakeup Generation...
Adding requests: 0%| ...
Processed prompts: 0%| ...
(vllm-venv) root@8ed4e0d980ee:/workspace/vllm-omni# python test.py

Next, we compute $a_3$:
$$a_3 = 3 \cdot a_2 + 3^2 = 3 \cdot 8 + 9 = 33$$

Next, we compute $a_4$:
$$a_4 = 4 \cdot a_3 + 4^2 = 4 \cdot 33 + 16 = 148$$

Next, we compute $a_5$:
$$a_5 = 5 \cdot a_4 + 5^2 = 5 \cdot 148 + 25 = 765$$

Next, we compute $a_6$:
$$a_6 = 6 \cdot a_5 + 6^2 = 6 \cdot 765 + 36 = 4626$$

We see that $a_6 = 4626$, which is greater than 1000. Therefore, the smallest integer $k$ such that $a_k > 1000$ is $k = 6$.

Thus, the answer is:
\boxed{6}

SUCCESS: Model output is identical (Text & Tokens)!
FP8 Scaling factors re-initialized and KV Cache cleared correctly.
[Stage-0] INFO 02-02 17:39:19 [omni_stage.py:787] Received shutdown signal
[Stage-2] INFO 02-02 17:39:19 [omni_stage.py:787] Received shutdown signal
[Stage-1] INFO 02-02 17:39:19 [omni_stage.py:787] Received shutdown signal
[rank0]:[W202 17:39:19.216876953 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
  1. Related scripts
import time
import logging
from vllm_omni.entrypoints.omni import Omni

logging.basicConfig(level=logging.INFO, format='%(levelname)s [%(filename)s] %(message)s')
logger = logging.getLogger(__name__)

def main():
    model_name = "Qwen/Qwen2.5-Omni-3B"
    
    stage_0 = {
        "stage_id": 0,
        "stage_type": "llm",
        "runtime": {"process": True, "devices": "0", "max_batch_size": 1},
        "engine_args": {
            "model": model_name,
            "model_stage": "thinker",
            "enable_sleep_mode": True,
            "quantization": "fp8",
            "kv_cache_dtype": "fp8",
            "gpu_memory_utilization": 0.4,
            "enforce_eager": True,
            "trust_remote_code": True
        }
    }

    print("\n 1. Initializing Omni Engine (FP8 Enabled)")
    engine = Omni(model_name, stages=[stage_0], enable_sleep_mode=True)

    prompt = {"prompt": "The capital of France is", "max_tokens": 10}

    # 1. Baseline
    print("\n 2.Running Baseline Generation...")
    res_base = engine.generate(prompt)
    base_ids = res_base[0].request_output[0].outputs[0].token_ids
    print(f"Baseline Output IDs: {base_ids}")

    # 2. Sleep Test
    print("\n 3. Testing Sleep Level 2 (VRAM Release)...")
    engine.sleep(level=2)
    print("Checking VRAM... (Should be 0 in nvidia-smi). Waiting 10s...")
    time.sleep(10)

    # 3. Wake up
    print("\n 4.Waking up the engine...")
    engine.wake_up()
    time.sleep(5)

   # 4. Verification
    print("\n 5. Testing Post-Wakeup Generation...")
    res_post = engine.generate(prompt)
    
    base_text = res_base[0].request_output[0].outputs[0].text
    post_text = res_post[0].request_output[0].outputs[0].text
    
    base_ids = res_base[0].request_output[0].outputs[0].token_ids
    post_ids = res_post[0].request_output[0].outputs[0].token_ids


    try:
        assert base_text == post_text, f"Text match! \nBase: {base_text}\nPost: {post_text}"
        assert base_ids == post_ids, f"Token IDs match! \nBase: {base_ids}\nPost: {post_ids}"
        
        print("\n SUCCESS: Model output is identical (Text & Tokens)!")
        print("FP8 Scaling factors re-initialized and KV Cache cleared correctly.")
        
    except AssertionError as e:
        print(f"\n FAIL: Consistency Check Failed!")
        print(str(e))
        print(f"Last 5 IDs (Base): {base_ids[-5:]}")
        print(f"Last 5 IDs (Post): {post_ids[-5:]}")

if __name__ == "__main__":
    main()

Signed-off-by: vensen <vensenmu@gmail.com>
@Flink-ddd Flink-ddd force-pushed the fix/omni-sleep-fp8-garbage branch from cf33769 to 294a798 Compare February 2, 2026 18:12
@Flink-ddd Flink-ddd marked this pull request as ready for review February 2, 2026 18:14
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 294a7980a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
Signed-off-by: vensen <vensenmu@gmail.com>
@Flink-ddd Flink-ddd force-pushed the fix/omni-sleep-fp8-garbage branch from 294a798 to e47653c Compare February 2, 2026 18:48
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@knlnguyen1802 PTAL

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@princepride

@princepride
Copy link
Copy Markdown
Collaborator

@Flink-ddd I think you had better raise a RFC first, because this feature is related to RL when we want inference engine and training engine using the same device. Besides, I have some question to ask:

  • Seems you only finished an sleep mode frontend API, I want know have you noticed that OmniDiffusion don't have the sleep method.
  • I run your test code on 2*H200, seems the code didn't work, I got an OOM error.
Details ``` root@deepseek-v3-2-vllm-85c4fdb9f9-6nzg9:/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni# python3 test.py
  1. Initializing Omni Engine (FP8 Enabled)
    INFO 02-02 18:24:22 [omni.py:119] Initializing stages for model: Qwen/Qwen2.5-Omni-3B
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    INFO 02-02 18:24:23 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
    INFO 02-02 18:24:23 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
    INFO 02-02 18:24:23 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
    INFO 02-02 18:24:23 [factory.py:46] Created connector: SharedMemoryConnector
    INFO 02-02 18:24:23 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
    INFO 02-02 18:24:23 [factory.py:46] Created connector: SharedMemoryConnector
    INFO 02-02 18:24:23 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
    INFO 02-02 18:24:23 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 0, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'thinker', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.8, 'enforce_eager': True, 'trust_remote_code': True, 'engine_output_type': 'latent', 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'max_num_seqs': 1, 'async_chunk': False}, 'is_comprehension': True, 'final_output': True, 'final_output_type': 'text', 'default_sampling_params': {'temperature': 0.0, 'top_p': 1.0, 'top_k': -1, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.1}}
    INFO 02-02 18:24:23 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 1, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '1', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'talker', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'ar', 'scheduler_cls': 'vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler', 'gpu_memory_utilization': 0.8, 'enforce_eager': True, 'trust_remote_code': True, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'engine_output_type': 'latent', 'max_num_seqs': 1, 'async_chunk': False}, 'engine_input_source': [0], 'custom_process_input_func': 'vllm_omni.model_executor.stage_input_processors.qwen2_5_omni.thinker2talker', 'default_sampling_params': {'temperature': 0.9, 'top_p': 0.8, 'top_k': 40, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.05, 'stop_token_ids': [8294]}}
    INFO 02-02 18:24:23 [omni_stage.py:100] [OmniStage] stage_config: {'stage_id': 2, 'stage_type': 'llm', 'runtime': {'process': True, 'devices': '0', 'max_batch_size': 1}, 'engine_args': {'model_stage': 'code2wav', 'model_arch': 'Qwen2_5OmniForConditionalGeneration', 'worker_type': 'generation', 'scheduler_cls': 'vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler', 'gpu_memory_utilization': 0.15, 'enforce_eager': True, 'trust_remote_code': True, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'async_scheduling': False, 'engine_output_type': 'audio', 'max_num_seqs': 1, 'async_chunk': False}, 'engine_input_source': [1], 'final_output': True, 'final_output_type': 'audio', 'default_sampling_params': {'temperature': 0.0, 'top_p': 1.0, 'top_k': -1, 'max_tokens': 2048, 'seed': 42, 'detokenize': True, 'repetition_penalty': 1.1}}
    INFO 02-02 18:24:23 [omni.py:338] [Orchestrator] Waiting for 3 stages to initialize (timeout: 300s)
    [Stage-0] INFO 02-02 18:24:31 [omni_stage.py:511] Starting stage worker with model: Qwen/Qwen2.5-Omni-3B
    [Stage-0] INFO 02-02 18:24:31 [omni_stage.py:524] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    [Stage-2] INFO 02-02 18:24:31 [omni_stage.py:511] Starting stage worker with model: Qwen/Qwen2.5-Omni-3B
    [Stage-2] INFO 02-02 18:24:31 [omni_stage.py:524] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
    [Stage-1] INFO 02-02 18:24:31 [omni_stage.py:511] Starting stage worker with model: Qwen/Qwen2.5-Omni-3B
    [Stage-1] INFO 02-02 18:24:31 [omni_stage.py:524] [Stage] Set VLLM_WORKER_MULTIPROC_METHOD=spawn
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    [Stage-0] INFO 02-02 18:24:32 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
    [Stage-0] INFO 02-02 18:24:32 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
    [Stage-0] INFO 02-02 18:24:32 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
    [Stage-0] INFO 02-02 18:24:32 [factory.py:46] Created connector: SharedMemoryConnector
    [Stage-0] INFO 02-02 18:24:32 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
    [Stage-0] INFO 02-02 18:24:32 [factory.py:46] Created connector: SharedMemoryConnector
    [Stage-0] INFO 02-02 18:24:32 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    [Stage-1] INFO 02-02 18:24:32 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
    [Stage-1] INFO 02-02 18:24:32 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
    [Stage-1] INFO 02-02 18:24:32 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
    [Stage-1] INFO 02-02 18:24:32 [factory.py:46] Created connector: SharedMemoryConnector
    [Stage-1] INFO 02-02 18:24:32 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
    [Stage-1] INFO 02-02 18:24:32 [factory.py:46] Created connector: SharedMemoryConnector
    [Stage-1] INFO 02-02 18:24:32 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    [Stage-0] INFO 02-02 18:24:32 [model.py:530] Resolved architecture: Qwen2_5OmniModel
    [Stage-0] INFO 02-02 18:24:32 [model.py:1545] Using max model len 32768
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    [Stage-1] INFO 02-02 18:24:33 [model.py:530] Resolved architecture: Qwen2_5OmniModel
    [Stage-1] INFO 02-02 18:24:33 [model.py:1545] Using max model len 32768
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    [Stage-0] INFO 02-02 18:24:41 [model.py:212] Resolved architecture: Qwen2_5OmniForConditionalGeneration
    [Stage-0] INFO 02-02 18:24:41 [model.py:1545] Using max model len 32768
    [Stage-0] INFO 02-02 18:24:41 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=32768.
    [Stage-0] INFO 02-02 18:24:41 [vllm.py:630] Asynchronous scheduling is enabled.
    [Stage-0] INFO 02-02 18:24:41 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
    [Stage-0] WARNING 02-02 18:24:41 [vllm.py:665] Enforce eager set, overriding optimization level to -O0
    [Stage-0] INFO 02-02 18:24:41 [vllm.py:765] Cudagraph is disabled under eager mode
    [Stage-1] INFO 02-02 18:24:42 [model.py:212] Resolved architecture: Qwen2_5OmniForConditionalGeneration
    [Stage-1] INFO 02-02 18:24:42 [model.py:1545] Using max model len 32768
    [Stage-1] INFO 02-02 18:24:42 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=32768.
    [Stage-1] INFO 02-02 18:24:42 [vllm.py:630] Asynchronous scheduling is enabled.
    [Stage-1] INFO 02-02 18:24:42 [vllm.py:637] Disabling NCCL for DP synchronization when using async scheduling.
    [Stage-1] WARNING 02-02 18:24:42 [vllm.py:665] Enforce eager set, overriding optimization level to -O0
    [Stage-1] INFO 02-02 18:24:42 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:50 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='Qwen/Qwen2.5-Omni-3B', speculative_config=None, tokenizer='Qwen/Qwen2.5-Omni-3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-Omni-3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:50 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='Qwen/Qwen2.5-Omni-3B', speculative_config=None, tokenizer='Qwen/Qwen2.5-Omni-3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-Omni-3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:51 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.45.52:57047 backend=nccl
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:51 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
    [Stage-2] WARNING 02-02 18:24:51 [omni_stage.py:665] Timeout waiting for device 0 initialization lock, proceeding anyway
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:51 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.45.52:57559 backend=nccl
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:52 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
    (EngineCore_DP0 pid=181642) The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False. Note that this behavior will be extended to all models in a future release.
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    (EngineCore_DP0 pid=181645) The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False. Note that this behavior will be extended to all models in a future release.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    [Stage-2] INFO 02-02 18:24:52 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('0', '1')
    [Stage-2] INFO 02-02 18:24:52 [initialization.py:197] Auto-configuring SharedMemoryConnector for edge ('1', '2')
    [Stage-2] INFO 02-02 18:24:52 [initialization.py:234] Loaded OmniTransferConfig with 2 connector configurations
    [Stage-2] INFO 02-02 18:24:52 [factory.py:46] Created connector: SharedMemoryConnector
    [Stage-2] INFO 02-02 18:24:52 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
    [Stage-2] INFO 02-02 18:24:52 [factory.py:46] Created connector: SharedMemoryConnector
    [Stage-2] INFO 02-02 18:24:52 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    [Stage-2] INFO 02-02 18:24:53 [model.py:530] Resolved architecture: Qwen2_5OmniModel
    [Stage-2] INFO 02-02 18:24:53 [model.py:1545] Using max model len 32768
    The argument trust_remote_code is to be used with Auto classes. It has no effect here and is ignored.
    Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:57 [gpu_model_runner.py:3808] Starting to load model Qwen/Qwen2.5-Omni-3B...
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:57 [vllm.py:630] Asynchronous scheduling is enabled.
    (EngineCore_DP0 pid=181642) [Stage-0] WARNING 02-02 18:24:57 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:57 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=181642) [Stage-0] WARNING 02-02 18:24:57 [qwen2_5_omni_thinker.py:272] flash_attn is not available, the model may not yield the exactly same result as the transformers implementation in the audio tower part.
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:57 [mm_encoder_attention.py:86] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
    (EngineCore_DP0 pid=181642) [Stage-0] WARNING 02-02 18:24:57 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:57 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=181642) Warning: mrope_section check is disabled in Qwen2.5-Omni, this may cause errors, and should be restored in the future.
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:24:57 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
    Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
    Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 86.38it/s]
    (EngineCore_DP0 pid=181642)
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:59 [gpu_model_runner.py:3808] Starting to load model Qwen/Qwen2.5-Omni-3B...
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:59 [vllm.py:630] Asynchronous scheduling is enabled.
    (EngineCore_DP0 pid=181645) [Stage-1] WARNING 02-02 18:24:59 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:59 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=181645) [Stage-1] WARNING 02-02 18:24:59 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:59 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=181645) Warning: mrope_section check is disabled in Qwen2.5-Omni, this may cause errors, and should be restored in the future.
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:59 [cuda.py:351] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:24:59 [mm_encoder_attention.py:86] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
    Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:00 [default_loader.py:291] Loading weights took 2.33 seconds
    Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 87.12it/s]
    (EngineCore_DP0 pid=181645)
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:01 [gpu_model_runner.py:3905] Model loading took 8.84 GiB memory and 3.225263 seconds
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:01 [gpu_model_runner.py:4715] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 video items of the maximum feature size.
    [Stage-2] INFO 02-02 18:25:01 [model.py:212] Resolved architecture: Qwen2_5OmniForConditionalGeneration
    [Stage-2] INFO 02-02 18:25:01 [model.py:1545] Using max model len 32768
    [Stage-2] INFO 02-02 18:25:01 [scheduler.py:229] Chunked prefill is enabled with max_num_batched_tokens=32768.
    [Stage-2] INFO 02-02 18:25:01 [vllm.py:630] Asynchronous scheduling is disabled.
    [Stage-2] WARNING 02-02 18:25:01 [vllm.py:665] Enforce eager set, overriding optimization level to -O0
    [Stage-2] INFO 02-02 18:25:01 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:03 [gpu_worker.py:358] Available KV cache memory: 100.43 GiB
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:03 [kv_cache_utils.py:1305] GPU KV cache size: 2,925,232 tokens
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:03 [kv_cache_utils.py:1310] Maximum concurrency for 32,768 tokens per request: 89.27x
    (EngineCore_DP0 pid=181642) 2026-02-02 18:25:03,837 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
    (EngineCore_DP0 pid=181642) 2026-02-02 18:25:03,867 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:04 [core.py:273] init engine (profile, create kv cache, warmup model) took 2.60 seconds
    (EngineCore_DP0 pid=181642) [Stage-0] WARNING 02-02 18:25:04 [scheduler.py:171] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
    (EngineCore_DP0 pid=181642) [Stage-0] WARNING 02-02 18:25:04 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=181642) [Stage-0] INFO 02-02 18:25:04 [vllm.py:765] Cudagraph is disabled under eager mode
    [Stage-0] INFO 02-02 18:25:04 [omni_llm.py:174] Supported_tasks: ['generate']
    [Stage-0] INFO 02-02 18:25:04 [initialization.py:288] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
    [Stage-0] INFO 02-02 18:25:04 [omni_stage.py:745] Max batch size: 1
    INFO 02-02 18:25:04 [omni.py:331] [Orchestrator] Stage-0 reported ready
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:07 [qwen2_5_omni_talker.py:196] [Model Loaded] name=Qwen2_5OmniTalkerForConditionalGeneration, success=True, size=3225.26 MB, device=cuda:0
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:07 [default_loader.py:291] Loading weights took 6.40 seconds
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:07 [gpu_model_runner.py:3905] Model loading took 3.76 GiB memory and 7.483684 seconds
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:07 [gpu_model_runner.py:4715] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 1 video items of the maximum feature size.
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:09 [gpu_worker.py:358] Available KV cache memory: 106.79 GiB
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:09 [kv_cache_utils.py:1305] GPU KV cache size: 9,331,536 tokens
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:09 [kv_cache_utils.py:1310] Maximum concurrency for 32,768 tokens per request: 284.78x
    (EngineCore_DP0 pid=181645) 2026-02-02 18:25:09,464 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
    (EngineCore_DP0 pid=181645) 2026-02-02 18:25:09,494 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:09 [core.py:273] init engine (profile, create kv cache, warmup model) took 1.81 seconds
    (EngineCore_DP0 pid=181645) [Stage-1] WARNING 02-02 18:25:09 [scheduler.py:171] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:09 [core.py:97] Initializing a V1 LLM engine (v0.14.0) with config: model='Qwen/Qwen2.5-Omni-3B', speculative_config=None, tokenizer='Qwen/Qwen2.5-Omni-3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen2.5-Omni-3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': True}, 'local_cache_dir': None}
    (EngineCore_DP0 pid=181645) [Stage-1] WARNING 02-02 18:25:10 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=181645) [Stage-1] INFO 02-02 18:25:10 [vllm.py:765] Cudagraph is disabled under eager mode
    [Stage-1] INFO 02-02 18:25:10 [omni_llm.py:174] Supported_tasks: ['generate']
    [Stage-1] INFO 02-02 18:25:10 [initialization.py:288] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
    [Stage-1] INFO 02-02 18:25:10 [factory.py:46] Created connector: SharedMemoryConnector
    [Stage-1] INFO 02-02 18:25:10 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
    [Stage-1] INFO 02-02 18:25:10 [omni_stage.py:745] Max batch size: 1
    INFO 02-02 18:25:10 [omni.py:331] [Orchestrator] Stage-1 reported ready
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:11 [parallel_state.py:1214] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.244.45.52:39251 backend=nccl
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:11 [parallel_state.py:1425] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
    (EngineCore_DP0 pid=182507) The image processor of type Qwen2VLImageProcessor is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with use_fast=False. Note that this behavior will be extended to all models in a future release.
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:16 [gpu_model_runner.py:3808] Starting to load model Qwen/Qwen2.5-Omni-3B...
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:16 [vllm.py:630] Asynchronous scheduling is disabled.
    (EngineCore_DP0 pid=182507) [Stage-2] WARNING 02-02 18:25:16 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:16 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=182507) [Stage-2] WARNING 02-02 18:25:16 [vllm.py:672] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:16 [vllm.py:765] Cudagraph is disabled under eager mode
    (EngineCore_DP0 pid=182507) [Stage-2] WARNING 02-02 18:25:16 [utils.py:59] Trying to guess the arguments for old-style model class <class 'vllm_omni.model_executor.models.qwen2_5_omni.qwen2_5_omni_token2wav.Qwen2_5OmniToken2WavModel'>
    Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
    Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:00<00:00, 88.92it/s]
    (EngineCore_DP0 pid=182507)
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:17 [weight_utils.py:46] Using model weights format ['*.pt']
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:18 [qwen2_5_omni_token2wav.py:1759] [Model Loaded] name=Qwen2_5OmniToken2WavForConditionalGenerationVLLM, success=True, size=1492.80 MB, device=cuda:0
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:18 [default_loader.py:291] Loading weights took 1.47 seconds
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:19 [gpu_model_runner.py:3905] Model loading took 1.46 GiB memory and 2.075440 seconds
    (EngineCore_DP0 pid=182507) 2026-02-02 18:25:19,297 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
    (EngineCore_DP0 pid=182507) [Stage-2] INFO 02-02 18:25:19 [qwen2_5_omni.py:941] Currently, we do not use the chunked process, we only use the token2wav.process_chunk for the whole sequence. The stream mode will be implemented in the future.
    (EngineCore_DP0 pid=182507) 2026-02-02 18:25:19,620 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] EngineCore failed to start.
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] Traceback (most recent call last):
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in init
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] super().init(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 113, in init
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 270, in _initialize_kv_caches
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] self.model_executor.initialize_from_config(kv_cache_configs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] self.collective_rpc("compile_or_warm_up_model")
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] result = run_method(self.driver_worker, method, args, kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 451, in compile_or_warm_up_model
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] kernel_warmup(self)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 41, in kernel_warmup
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] flashinfer_autotune(worker.model_runner)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 93, in flashinfer_autotune
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] runner._dummy_run(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/worker/gpu_generation_model_runner.py", line 633, in _dummy_run
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] outputs = self.model(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py", line 343, in forward
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] audio_tensor = self.generate_audio(code, voice_type)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py", line 541, in generate_audio
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] audio_tensor = self._codec_to_audio(code_tensor, voice_type)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py", line 961, in _codec_to_audio
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] _, audio_chunk = self.token2wav.process_chunk(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1871, in process_chunk
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] _mel, out = self.process_little_chunk(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1851, in process_little_chunk
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] mel = self.token2wav(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1518, in forward
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] mel_spectrogram = self.code2wav_dit_model.sample(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1333, in sample
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] solution_trajectory = ode_solver.integrate(time_embedding)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1150, in integrate
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] delta_value, _ = self._compute_step(self.function, time_start, time_step, time_end, current_value)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1116, in _compute_step
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] function_value_start = function(time_start, value_start)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1309, in ode_function
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] model_output = self(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1253, in forward
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] hidden_states = transformer_block(
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 652, in forward
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] attention_mask=(block_diff >= -float(self.look_backward_block))
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) [Stage-2] ERROR 02-02 18:25:19 [core.py:936] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.82 GiB. GPU 0 has a total capacity of 139.81 GiB of which 16.54 GiB is free. Process 181642 has 113.43 GiB memory in use. Including non-PyTorch memory, this process has 9.83 GiB memory in use. Of the allocated memory 9.07 GiB is allocated by PyTorch, and 28.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    (EngineCore_DP0 pid=182507) Process EngineCore_DP0:
    (EngineCore_DP0 pid=182507) Traceback (most recent call last):
    (EngineCore_DP0 pid=182507) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    (EngineCore_DP0 pid=182507) self.run()
    (EngineCore_DP0 pid=182507) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    (EngineCore_DP0 pid=182507) self._target(*self._args, **self._kwargs)
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 940, in run_engine_core
    (EngineCore_DP0 pid=182507) raise e
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 927, in run_engine_core
    (EngineCore_DP0 pid=182507) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 692, in init
    (EngineCore_DP0 pid=182507) super().init(
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 113, in init
    (EngineCore_DP0 pid=182507) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 270, in _initialize_kv_caches
    (EngineCore_DP0 pid=182507) self.model_executor.initialize_from_config(kv_cache_configs)
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 116, in initialize_from_config
    (EngineCore_DP0 pid=182507) self.collective_rpc("compile_or_warm_up_model")
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 75, in collective_rpc
    (EngineCore_DP0 pid=182507) result = run_method(self.driver_worker, method, args, kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/serial_utils.py", line 461, in run_method
    (EngineCore_DP0 pid=182507) return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 451, in compile_or_warm_up_model
    (EngineCore_DP0 pid=182507) kernel_warmup(self)
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 41, in kernel_warmup
    (EngineCore_DP0 pid=182507) flashinfer_autotune(worker.model_runner)
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/warmup/kernel_warmup.py", line 93, in flashinfer_autotune
    (EngineCore_DP0 pid=182507) runner._dummy_run(
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore_DP0 pid=182507) return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/worker/gpu_generation_model_runner.py", line 633, in _dummy_run
    (EngineCore_DP0 pid=182507) outputs = self.model(
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py", line 343, in forward
    (EngineCore_DP0 pid=182507) audio_tensor = self.generate_audio(code, voice_type)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py", line 541, in generate_audio
    (EngineCore_DP0 pid=182507) audio_tensor = self._codec_to_audio(code_tensor, voice_type)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni.py", line 961, in _codec_to_audio
    (EngineCore_DP0 pid=182507) _, audio_chunk = self.token2wav.process_chunk(
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore_DP0 pid=182507) return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1871, in process_chunk
    (EngineCore_DP0 pid=182507) _mel, out = self.process_little_chunk(
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    (EngineCore_DP0 pid=182507) return func(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1851, in process_little_chunk
    (EngineCore_DP0 pid=182507) mel = self.token2wav(
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1518, in forward
    (EngineCore_DP0 pid=182507) mel_spectrogram = self.code2wav_dit_model.sample(
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1333, in sample
    (EngineCore_DP0 pid=182507) solution_trajectory = ode_solver.integrate(time_embedding)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1150, in integrate
    (EngineCore_DP0 pid=182507) delta_value, _ = self._compute_step(self.function, time_start, time_step, time_end, current_value)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1116, in _compute_step
    (EngineCore_DP0 pid=182507) function_value_start = function(time_start, value_start)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1309, in ode_function
    (EngineCore_DP0 pid=182507) model_output = self(
    (EngineCore_DP0 pid=182507) ^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 1253, in forward
    (EngineCore_DP0 pid=182507) hidden_states = transformer_block(
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    (EngineCore_DP0 pid=182507) return self._call_impl(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    (EngineCore_DP0 pid=182507) return forward_call(*args, **kwargs)
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/model_executor/models/qwen2_5_omni/qwen2_5_omni_token2wav.py", line 652, in forward
    (EngineCore_DP0 pid=182507) attention_mask=(block_diff >= -float(self.look_backward_block))
    (EngineCore_DP0 pid=182507) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    (EngineCore_DP0 pid=182507) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 26.82 GiB. GPU 0 has a total capacity of 139.81 GiB of which 16.54 GiB is free. Process 181642 has 113.43 GiB memory in use. Including non-PyTorch memory, this process has 9.83 GiB memory in use. Of the allocated memory 9.07 GiB is allocated by PyTorch, and 28.50 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
    [rank0]:[W202 18:25:20.118736521 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
    Process SpawnProcess-3:
    Traceback (most recent call last):
    File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
    File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
    File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 715, in _stage_worker
    stage_engine = OmniLLM(model=model, **engine_args)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni_llm.py", line 158, in init
    self.llm_engine = LLMEngine.from_engine_args(engine_args=engine_args, usage_context=UsageContext.LLM_CLASS)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 176, in from_engine_args
    return cls(
    ^^^^
    File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/llm_engine.py", line 110, in init
    self.engine_core = EngineCoreClient.make_client(
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 94, in make_client
    return SyncMPClient(vllm_config, executor_class, log_stats)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 652, in init
    super().init(
    File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 479, in init
    with launch_core_engines(vllm_config, executor_class, log_stats) as (
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/usr/lib/python3.12/contextlib.py", line 144, in exit
    next(self.gen)
    File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 921, in launch_core_engines
    wait_for_engine_startup(
    File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 980, in wait_for_engine_startup
    raise RuntimeError(
    RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
    WARNING 02-02 18:29:23 [omni.py:362] [Orchestrator] Initialization timeout: 2/3 stages ready. Missing stages: [2]
    WARNING 02-02 18:29:23 [omni.py:377] [Orchestrator] Stage initialization timeout. Troubleshooting Steps:
    WARNING 02-02 18:29:23 [omni.py:377] 1) Ignore this warning if the model weight download / load from disk time is longer than 300s.
    WARNING 02-02 18:29:23 [omni.py:377] 2) Verify GPU/device assignment in config (runtime.devices) is correct.
    WARNING 02-02 18:29:23 [omni.py:377] 3) Check GPU/host memory availability; reduce model or batch size if needed.
    WARNING 02-02 18:29:23 [omni.py:377] 4) Check model weights path and network reachability (if loading remotely).
    WARNING 02-02 18:29:23 [omni.py:377] 5) Increase initialization wait time (stage_init_timeout or call-site timeout).

2.Running Baseline Generation...
Adding requests: 0%| | 0/1 [00:00<?, ?it/s^CTraceback (most recent call last): | 0/1 [00:00<?, ?it/s, est. speed input: 0.00 unit/s, output: 0.00 unit/s]
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/test.py", line 96, in
main()
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/test.py", line 56, in main
res_base = engine.generate(prompt)
^^^^^^^^^^^^^^^^^^^^^^^
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 597, in generate
outputs = list(self._run_generation(prompts, sampling_params_list, use_tqdm))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 859, in _run_generation
time.sleep(0.005)
KeyboardInterrupt
[rank0]:[W202 18:33:28.190464993 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[rank0]:[W202 18:33:28.194966739 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
[Stage-1] ERROR 02-02 18:33:28 [core_client.py:610] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
[Stage-0] ERROR 02-02 18:33:28 [core_client.py:610] Engine core proc EngineCore_DP0 died unexpectedly, shutting down client.
Process SpawnProcess-1:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 788, in _stage_worker
task = in_q.get()
^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/queues.py", line 103, in get
res = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 395, in _recv
chunk = read(handle, remaining)
^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
Process SpawnProcess-2:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 788, in _stage_worker
task = in_q.get()
^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/queues.py", line 103, in get
res = self._recv_bytes()
^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 430, in _recv_bytes
buf = self._recv(4)
^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/connection.py", line 395, in _recv
chunk = read(handle, remaining)
^^^^^^^^^^^^^^^^^^^^^^^
KeyboardInterrupt
^CException ignored in atexit callback: <bound method finalize._exitfunc of <class 'weakref.finalize'>>
Traceback (most recent call last):
File "/usr/lib/python3.12/weakref.py", line 666, in _exitfunc
f()
File "/usr/lib/python3.12/weakref.py", line 590, in call
return info.func(*info.args, **(info.kwargs or {}))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni.py", line 60, in _weak_close_cleanup
stage.stop_stage_worker()
File "/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 374, in stop_stage_worker
self._proc.join(timeout=5)
File "/usr/lib/python3.12/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 43, in wait
return self.poll(os.WNOHANG if timeout == 0.0 else 0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/multiprocessing/popen_fork.py", line 27, in poll
pid, sts = os.waitpid(self.pid, flag)

</details>

@Flink-ddd
Copy link
Copy Markdown
Contributor Author

Hi @princepride , thanks for your feedback! Regarding your concerns:

About the OOM Error: I checked your log, and the OOM actually happened during the initialization of Stage-2 (EngineCore failed to start). This is before any sleep and wake commands were even issued. Lowering the gpu_memory_utilization or using separate devices should fix this initialization issue.

About OmniDiffusion: Yes, OmniDiffusion currently lacks the sleep method. Further improvements are needed, the sleep mode functionality on Omni is still somewhat lacking.

About the RFC: I'd be happy to raise an RFC, Let's discuss how to build a complete sleep mode.

@knlnguyen1802
Copy link
Copy Markdown
Contributor

@Flink-ddd You can take a look at this PR #355

If you want to add sleep wake_up please do it for AsyncOmni too

@Flink-ddd
Copy link
Copy Markdown
Contributor Author

Hi @knlnguyen1802 , sure, but I saw this PR: #355 , There are some methods regarding sleep and wakeup. I'd like to plan them out first and submit an RFC, and then we can discuss the omni sleep mode function.

@knlnguyen1802
Copy link
Copy Markdown
Contributor

Hi @knlnguyen1802 , sure, but I saw this PR: #355 , There are some methods regarding sleep and wakeup. I'd like to plan them out first and submit an RFC, and then we can discuss the omni sleep mode function.

Sure please also notify me when you submit a new RFC thanks

@Flink-ddd
Copy link
Copy Markdown
Contributor Author

Flink-ddd commented Feb 4, 2026

sure, Thanks.

@Gaohan123
Copy link
Copy Markdown
Collaborator

@Flink-ddd Any updates? Is there a RFC now?

@Flink-ddd
Copy link
Copy Markdown
Contributor Author

Hi @Gaohan123 , I'm running a demo for verification, and I'll be able to submit an RFC version soon. Thanks

@Flink-ddd
Copy link
Copy Markdown
Contributor Author

Flink-ddd commented Feb 10, 2026

@princepride @hsliuustc0106 @Gaohan123 @knlnguyen1802 I've submitted RFC, PTAL. Thank you for your time.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@vllm-omni-reviewer

@Gaohan123
Copy link
Copy Markdown
Collaborator

@Flink-ddd Hello, any updates?

@Flink-ddd
Copy link
Copy Markdown
Contributor Author

Hi @Gaohan123 , this PR I consider close it, because the sleep mode ACK function will include completely function and sleep mode ack will open new PR soon, I'm testing and integrating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants