[BugFix][CI]: Fix test_omni_sleep_mode ci bug by princepride · Pull Request #3010 · vllm-project/vllm-omni

princepride · 2026-04-22T03:57:16Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Related:
#2936
https://buildkite.com/vllm/vllm-omni/builds/7576/steps/canvas?sid=019db2f8-b300-4acb-a9d9-b75a0ae1a896&tab=output

The evidence from the error output confirms this: with tp_size=2, the ack list contained exactly 2 entries (one for each stage, both with rank: 0). The original assertion len(acks) == 2 * tp_size was 4, but the actual count was 2. The sleep API aggregates across TP ranks and returns one ack per stage.

Test Plan

pytest -v tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2]

Test Result

Before

root@job-afbdf05bd2b0de31-7qvjv:/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni# pytest -v tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2]
=========================================== test session starts ===========================================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0 -- /usr/bin/python3.12
cachedir: .pytest_cache
rootdir: /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni
configfile: pyproject.toml
plugins: forked-1.6.0, timeout-2.4.0, hydra-core-1.3.2, asyncio-1.3.0, rerunfailures-16.1, shard-0.1.2, anyio-4.13.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                                          
Running 1 items in this shard: tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2]

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2] FAILED           [100%]

================================================ FAILURES =================================================
______________________________________ test_multistage_sleep_h100[2] ______________________________________

tp_size = 2

    @pytest.mark.advanced_model
    @pytest.mark.omni
    @pytest.mark.parametrize("tp_size", [1, 2])
    @hardware_test(res={"cuda": "H100", "rocm": "MI325"}, num_cards=2)
    @pytest.mark.asyncio
    async def test_multistage_sleep_h100(tp_size):
        num_gpus = torch.cuda.device_count()
        if num_gpus < tp_size * 2:
            pytest.skip("Not enough GPUs")
    
        stages = []
        for i in range(2):
            devs = get_dynamic_devices(i, 2, tp_size)
            stages.append(
                {
                    "stage_id": i,
                    "stage_type": "llm" if i == 0 else "diffusion",
                    "runtime": {"process": True, "devices": devs},
                    "engine_args": {
                        "model": MODEL,
                        "model_stage": "thinker" if i == 0 else "base",
                        "tensor_parallel_size": tp_size,
                        "gpu_memory_utilization": 0.4,
                        "dtype": "bfloat16",
                        "enable_sleep_mode": True,
                        "trust_remote_code": True,
                    },
                }
            )
    
        connectors = [{"src_stage_id": 0, "dst_stage_id": 1, "connector_type": "queue"}]
    
        engine = AsyncOmni(
            model=MODEL, stages=stages, connectors=connectors, enable_sleep_mode=True, stage_init_timeout=1200
        )
        try:
            sp = OmniDiffusionSamplingParams(num_inference_steps=2)
            async for _ in engine.generate("warmup", sampling_params=[SamplingParams(), sp]):
                pass
    
            acks = await engine.sleep(stage_ids=[0, 1], level=2)
>           assert len(acks) == 2 * tp_size
E           AssertionError: assert 2 == (2 * 2)
E            +  where 2 = len([{'error_msg': None, 'freed_bytes': 0, 'metadata': {'rank_residual_gib': '62.08', 'source': 'omni_platform_audit', 'total_freed_gib': '0.00'}, 'rank': 0, ...}, {'error_msg': None, 'freed_bytes': 0, 'metadata': {'rank_residual_gib': '27.52', 'source': 'Platform_NVIDIA H200', 'total_freed_gib': '0.00'}, 'rank': 0, ...}])

tests/e2e/offline_inference/test_omni_sleep_mode.py:126: AssertionError
------------------------------------------ Captured stdout setup ------------------------------------------
INFO 04-22 03:50:44 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-22 03:50:44 [vllm.py:790] Asynchronous scheduling is enabled.

=== PRE-TEST GPU CLEANUP ===

Skipping GPU memory cleanup check (typically: instance already up; no check needed between tests)

--- Running test: test_multistage_sleep_h100[2]
------------------------------------------ Captured stdout call -------------------------------------------
INFO 04-22 03:50:44 [weight_utils.py:50] Using model weights format ['*']
INFO 04-22 03:50:44 [omni_base.py:146] [AsyncOmni] Initializing with model ByteDance-Seed/BAGEL-7B-MoT
INFO 04-22 03:50:44 [async_omni_engine.py:274] [AsyncOmniEngine] Initializing with model ByteDance-Seed/BAGEL-7B-MoT
INFO 04-22 03:50:45 [async_omni_engine.py:331] [AsyncOmniEngine] Launching Orchestrator thread with 2 stages
INFO 04-22 03:50:45 [initialization.py:314] Auto-configuring SharedMemoryConnector for edge ('0', '1')
INFO 04-22 03:50:45 [initialization.py:351] Loaded OmniTransferConfig with 1 connector configurations
INFO 04-22 03:50:45 [async_omni_engine.py:735] [AsyncOmniEngine] Initializing stage 0
INFO 04-22 03:50:45 [stage_init_utils.py:385] [stage_init] Stage-0 set runtime devices: 0
INFO 04-22 03:50:45 [async_omni_engine.py:735] [AsyncOmniEngine] Initializing stage 1
WARNING 04-22 03:50:46 [config.py:347] Config format `mistral` is already registered, and will be overwritten by the new parser class `<class 'vllm_omni.model_executor.models.voxtral_tts.configuration_voxtral_tts.VoxtralTTSConfigParser'>`.
INFO 04-22 03:50:46 [config.py:358] Registered config parser `<class 'vllm_omni.model_executor.models.voxtral_tts.configuration_voxtral_tts.VoxtralTTSConfigParser'>` with config format `mistral`
INFO 04-22 03:50:46 [model.py:549] Resolved architecture: OmniBagelForConditionalGeneration
INFO 04-22 03:50:46 [model.py:1678] Using max model len 32768
INFO 04-22 03:50:46 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=32768.
WARNING 04-22 03:50:46 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
WARNING 04-22 03:50:46 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 04-22 03:50:46 [vllm.py:1025] Cudagraph is disabled under eager mode
WARNING 04-22 03:50:46 [cuda.py:199] Forcing --disable_chunked_mm_input for models with multimodal-bidirectional attention.
INFO 04-22 03:50:46 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
INFO 04-22 03:50:46 [async_omni_engine.py:447] [AsyncOmniEngine] Stage 0 engine launch started
INFO 04-22 03:50:46 [stage_init_utils.py:385] [stage_init] Stage-1 set runtime devices: 0
(StageEngineCoreProc pid=1534834) INFO 04-22 03:50:53 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='ByteDance-Seed/BAGEL-7B-MoT', speculative_config=None, tokenizer='ByteDance-Seed/BAGEL-7B-MoT', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=ByteDance-Seed/BAGEL-7B-MoT, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(StageEngineCoreProc pid=1534834) WARNING 04-22 03:50:53 [multiproc_executor.py:1014] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(StageEngineCoreProc pid=1534834) INFO 04-22 03:50:53 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.68.221 (local), world_size=1, local_world_size=1
INFO 04-22 03:50:56 [multiproc_executor.py:138] Starting server...
(Worker pid=1535327) INFO 04-22 03:51:02 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:55693 backend=nccl
(Worker pid=1535327) INFO 04-22 03:51:02 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
INFO 04-22 03:51:05 [diffusion_worker.py:527] Worker 0 created result MessageQueue
INFO 04-22 03:51:05 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=2048.
INFO 04-22 03:51:05 [vllm.py:790] Asynchronous scheduling is enabled.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-22 03:51:05 [diffusion_worker.py:131] Worker 0: Initialized device and distributed environment.
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-22 03:51:05 [parallel_state.py:588] Building SP subgroups from explicit sp_group_ranks (sp_size=1, ulysses=1, ring=1, use_ulysses_low=True).
INFO 04-22 03:51:05 [parallel_state.py:630] SP group details for rank 0: sp_group=[0], ulysses_group=[0], ring_group=[0]
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
INFO 04-22 03:51:05 [diffusion_worker.py:462] [Worker 0] Activating Diffusion CuMem pool for tag: weights
INFO 04-22 03:51:06 [weight_utils.py:50] Using model weights format ['*']
INFO 04-22 03:51:07 [weight_utils.py:625] No diffusion_pytorch_model.safetensors.index.json found in remote.
(Worker pid=1535327) INFO 04-22 03:51:11 [kv_transfer_manager.py:428] Initializing OmniConnector type=SharedMemoryConnector role=sender
(Worker pid=1535327) INFO 04-22 03:51:11 [factory.py:46] Created connector: SharedMemoryConnector
(Worker pid=1535327) INFO 04-22 03:51:11 [kv_transfer_manager.py:333] Sender connector eagerly initialized
(Worker pid=1535327) INFO 04-22 03:51:11 [base.py:185] [LLM Worker 0] Sleep Mode ENABLED. Activating CuMem pool for tag: weights
(Worker pid=1535327) INFO 04-22 03:51:12 [base.py:185] [LLM Worker 0] Sleep Mode ENABLED. Activating CuMem pool for tag: weights
(Worker pid=1535327) INFO 04-22 03:51:12 [gpu_model_runner.py:4735] Starting to load model ByteDance-Seed/BAGEL-7B-MoT...
(Worker pid=1535327) INFO 04-22 03:51:12 [vllm.py:790] Asynchronous scheduling is enabled.
(Worker pid=1535327) WARNING 04-22 03:51:12 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(Worker pid=1535327) WARNING 04-22 03:51:12 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker pid=1535327) INFO 04-22 03:51:12 [vllm.py:1025] Cudagraph is disabled under eager mode
(Worker pid=1535327) INFO 04-22 03:51:12 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(Worker pid=1535327) INFO 04-22 03:51:12 [cuda.py:334] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=1535327) WARNING 04-22 03:51:12 [bagel.py:391] Overriding vit_config.num_hidden_layers from 27 to 26 to match the Bagel model checkpoint.
(Worker pid=1535327) WARNING 04-22 03:51:12 [bagel.py:397] Setting vit_config.vision_use_head to False as it is not present in the Bagel model checkpoint.
(Worker pid=1535327) INFO 04-22 03:51:12 [cuda.py:390] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker pid=1535327) INFO 04-22 03:51:12 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
INFO 04-22 03:51:13 [pipeline_bagel.py:939] BagelPipeline weight filter kept 1467/1467 tensors (shape mismatches seen: 0)
INFO 04-22 03:51:14 [diffusers_loader.py:324] Loading weights took 6.80 seconds
INFO 04-22 03:51:15 [diffusion_model_runner.py:142] Model loading took 27.4895 GiB and 9.598238 seconds
INFO 04-22 03:51:15 [diffusion_model_runner.py:147] Model runner: Model loaded successfully.
INFO 04-22 03:51:15 [diffusion_model_runner.py:188] Model runner: Initialization complete.
INFO 04-22 03:51:15 [diffusion_worker.py:183] Worker 0: Process-scoped GPU memory after model loading: 28.21 GiB.
INFO 04-22 03:51:15 [manager.py:96] Initializing DiffusionLoRAManager: device=cuda:0, dtype=torch.bfloat16, max_cached_adapters=1, static_lora_path=None
INFO 04-22 03:51:15 [diffusion_worker.py:95] Worker 0: Initialization complete.
INFO 04-22 03:51:15 [diffusion_worker.py:687] Worker 0: Scheduler loop started.
INFO 04-22 03:51:15 [diffusion_worker.py:597] Worker 0 ready to receive requests via shared memory
INFO 04-22 03:51:15 [diffusion_engine.py:446] dummy run to warm up the model
INFO 04-22 03:51:15 [kv_transfer_manager.py:428] Initializing OmniConnector type=SharedMemoryConnector role=receiver
INFO 04-22 03:51:15 [factory.py:46] Created connector: SharedMemoryConnector
INFO 04-22 03:51:15 [kv_transfer_manager.py:1010] Wait for KV cache for request dummy_req_id from stage 0 to 1 via 1 key(s)...
(Worker pid=1535327) INFO 04-22 03:51:20 [default_loader.py:384] Loading weights took 5.98 seconds
(Worker pid=1535327) INFO 04-22 03:51:20 [gpu_model_runner.py:4820] Model loading took 27.37 GiB memory and 7.980278 seconds
(Worker pid=1535327) INFO 04-22 03:51:21 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 32768 tokens, and profiled with 3 img2img items of the maximum feature size.
(Worker pid=1535327) INFO 04-22 03:51:24 [base.py:142] Available KV cache memory: 34.44 GiB (process-scoped)
(StageEngineCoreProc pid=1534834) INFO 04-22 03:51:24 [kv_cache_utils.py:1319] GPU KV cache size: 644,880 tokens
(StageEngineCoreProc pid=1534834) INFO 04-22 03:51:24 [kv_cache_utils.py:1324] Maximum concurrency for 32,768 tokens per request: 19.68x
(StageEngineCoreProc pid=1534834) INFO 04-22 03:51:25 [core.py:283] init engine (profile, create kv cache, warmup model) took 4.63 seconds
(StageEngineCoreProc pid=1534834) WARNING 04-22 03:51:26 [scheduler.py:180] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(StageEngineCoreProc pid=1534834) INFO 04-22 03:51:35 [vllm.py:790] Asynchronous scheduling is enabled.
(StageEngineCoreProc pid=1534834) WARNING 04-22 03:51:35 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(StageEngineCoreProc pid=1534834) WARNING 04-22 03:51:35 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 04-22 03:51:35 [async_omni_engine.py:464] [AsyncOmniEngine] Stage 0 engine startup completed
(StageEngineCoreProc pid=1534834) INFO 04-22 03:51:35 [vllm.py:1025] Cudagraph is disabled under eager mode
(StageEngineCoreProc pid=1534834) INFO 04-22 03:51:35 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
ERROR 04-22 03:51:45 [kv_transfer_manager.py:1111] Timeout waiting for KV cache for request dummy_req_id after 30.0s
INFO 04-22 03:51:46 [diffusion_model_runner.py:213] Peak GPU memory (this request): 28.69 GB reserved, 28.06 GB allocated, 0.64 GB pool overhead (2.2%)
INFO 04-22 03:51:46 [stage_diffusion_proc.py:67] StageDiffusionProc initialized with model: ByteDance-Seed/BAGEL-7B-MoT
INFO 04-22 03:51:46 [stage_diffusion_client.py:144] [StageDiffusionClient] Stage-1 initialized (owns_process=True, batch_size=1)
INFO 04-22 03:51:46 [async_omni_engine.py:791] [AsyncOmniEngine] Stage 1 initialized (diffusion, batch_size=1)
INFO 04-22 03:51:46 [stage_engine_core_client.py:131] [StageEngineCoreClient] Stage-0 initializing EngineCore
INFO 04-22 03:51:46 [stage_engine_core_client.py:171] [StageEngineCoreClient] Stage-0 EngineCore running
INFO 04-22 03:51:56 [async_omni_engine.py:670] [AsyncOmniEngine] Stage 0 initialized
INFO 04-22 03:51:56 [orchestrator.py:187] [Orchestrator] Starting event loop
INFO 04-22 03:51:56 [async_omni_engine.py:358] [AsyncOmniEngine] Orchestrator ready with 2 stages
INFO 04-22 03:51:56 [omni_base.py:159] [AsyncOmni] AsyncOmniEngine initialized in 72.30 seconds
INFO 04-22 03:51:56 [omni_base.py:178] [AsyncOmni] Initialized with 2 stages for model ByteDance-Seed/BAGEL-7B-MoT
WARNING 04-22 03:51:56 [input_processor.py:235] Passing raw prompts to InputProcessor is deprecated and will be removed in v0.18. You should instead pass the outputs of Renderer.render_cmpl() or Renderer.render_chat().
INFO 04-22 03:51:56 [orchestrator.py:859] [Orchestrator] _handle_add_request: stage=0 req= prompt_type=OmniEngineCoreRequest original_prompt_type=str final_stage=1 num_sampling_params=2
INFO 04-22 03:51:56 [stage_engine_core_client.py:227] [StageEngineCoreClient] Stage-0 adding request: 
(Worker pid=1535327) INFO 04-22 03:51:56 [kv_transfer_manager.py:906] KV cache serialized for  in 1.2 ms
(Worker pid=1535327) INFO 04-22 03:51:56 [kv_transfer_manager.py:920] KV transfer OK: , 119072 bytes across 1 key(s), 0.000s, 354.4 MB/s
INFO 04-22 03:51:56 [kv_transfer_manager.py:713] Sender info updated: host=10.244.68.221, base_port=50151, adjusted_port=50151 (local_rank=0)
WARNING 04-22 03:51:56 [kv_transfer_manager.py:1172] Request has no ID, cannot receive KV cache
INFO 04-22 03:52:17 [diffusion_model_runner.py:213] Peak GPU memory (this request): 32.43 GB reserved, 29.79 GB allocated, 2.64 GB pool overhead (8.1%)
INFO 04-22 03:52:17 [diffusion_engine.py:127] Generation completed successfully.
INFO 04-22 03:52:17 [diffusion_engine.py:174] Post-processing completed in 0.0000 seconds
INFO 04-22 03:52:17 [diffusion_engine.py:177] DiffusionEngine.step breakdown: preprocess=0.00 ms, add_req_and_wait=20863.19 ms, postprocess=0.00 ms, total=20863.37 ms
INFO 04-22 03:52:17 [omni_base.py:251] [Summary] {}
INFO 04-22 03:52:17 [async_omni.py:768] [AsyncOrchestrator] Sleep initiated (Task: 89c4b240-0622-499b-8983-6d117a8dc151). Awaiting 2 ACKs...
(Worker pid=1535327) INFO 04-22 03:52:17 [base.py:230] [LLM Worker 0] Handshake Received: Task 89c4b240-0622-499b-8983-6d117a8dc151, Level 2
(Worker pid=1535327) INFO 04-22 03:52:17 [cumem.py:216] CuMemAllocator: sleep freed 61.95 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 61.95 GiB is discarded directly.
(Worker pid=1535327) INFO 04-22 03:52:18 [base.py:208] [LLM Worker 0] Level 2 Sleep: Freed 0.00 GiB. 62.08GiB memory is still in use.
(Worker pid=1535327) INFO 04-22 03:52:18 [base.py:263] [LLM Worker 0] ACK emitted for Task 89c4b240-0622-499b-8983-6d117a8dc151
INFO 04-22 03:52:18 [diffusion_worker.py:367] [Worker 0] Handshake Received: Task 89c4b240-0622-499b-8983-6d117a8dc151
INFO 04-22 03:52:18 [cumem.py:216] CuMemAllocator: sleep freed 27.61 GiB memory in total, of which 0.00 GiB is backed up in CPU and the rest 27.61 GiB is discarded directly.
INFO 04-22 03:52:19 [diffusion_worker.py:325] [Diffusion Worker 0] Sleep Level 2: physically freed 27.61 GiB, 2.20 GiB is still in use.
INFO 04-22 03:52:19 [diffusion_worker.py:332] [Worker 0] Memory usage before sleep: 27.61 GiB.
INFO 04-22 03:52:19 [diffusion_worker.py:375] [Worker 0] Preparing ACK: freed_bytes=0.00 GiB.
INFO 04-22 03:52:19 [diffusion_worker.py:399] [Worker 0] ACK emitted. Freed 0.00 GiB.
INFO 04-22 03:52:19 [async_omni.py:92] [Resolver] Task 89c4b240-0622-499b-8983-6d117a8dc151 progress: 1/2 ACKs received.
INFO 04-22 03:52:19 [async_omni.py:92] [Resolver] Task 89c4b240-0622-499b-8983-6d117a8dc151 progress: 2/2 ACKs received.
INFO 04-22 03:52:19 [async_omni.py:99] [Resolver] Task 89c4b240-0622-499b-8983-6d117a8dc151 completed successfully in 1.85s.
INFO 04-22 03:52:19 [omni_base.py:415] [AsyncOmni] Shutting down
INFO 04-22 03:52:19 [async_omni_engine.py:1723] [AsyncOmniEngine] Shutting down Orchestrator
INFO 04-22 03:52:19 [orchestrator.py:247] [Orchestrator] Received shutdown signal
INFO 04-22 03:52:19 [orchestrator.py:1163] [Orchestrator] Shutting down all stages
(Worker pid=1535327) INFO 04-22 03:52:19 [multiproc_executor.py:764] Parent process exited, terminating worker queues
(Worker pid=1535327) INFO 04-22 03:52:19 [multiproc_executor.py:859] WorkerProc shutting down.
INFO 04-22 03:52:22 [orchestrator.py:1167] [Orchestrator] Stage 0 shut down
INFO 04-22 03:52:22 [diffusion_worker.py:637] Worker 0: Received shutdown message
INFO 04-22 03:52:22 [diffusion_worker.py:658] event loop terminated.
INFO 04-22 03:52:23 [diffusion_worker.py:695] Worker 0: Shutdown complete.
WARNING 04-22 03:52:29 [async_omni_engine.py:1732] [AsyncOmniEngine] Orchestrator thread did not exit in time
------------------------------------------ Captured stderr call -------------------------------------------
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.18.1.dev285+gc1ba86ab5
 --> vLLM version 0.19.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.18.1.dev285+gc1ba86ab5
 --> vLLM version 0.19.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.18.1.dev285+gc1ba86ab5
 --> vLLM version 0.19.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
 --> vLLM-Omni version 0.18.1.dev285+gc1ba86ab5
 --> vLLM version 0.19.0
This will likely cause compatibility issues.
  warn_if_misaligned_vllm_version()
[W422 03:51:02.347730596 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
Multi-thread loading shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Multi-thread loading shards:  50% Completed | 1/2 [00:00<00:00,  1.07it/s]
(Worker pid=1535327) /usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for `SiglipImageProcessor.preprocess` and were ignored: 'truncation'
(Worker pid=1535327)   return self.preprocess(images, **kwargs)
Multi-thread loading shards: 100% Completed | 2/2 [00:06<00:00,  3.37s/it]
Multi-thread loading shards: 100% Completed | 2/2 [00:06<00:00,  3.01s/it]

Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00, 77.67it/s]
(Worker pid=1535327) 
(Worker pid=1535327) /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/diffusion/models/bagel/bagel_transformer.py:1067: UserWarning: Using a non-tuple sequence for multidimensional indexing is deprecated and will be changed in pytorch 2.9; use x[tuple(seq)] instead of x[seq]. In pytorch 2.9 this will be interpreted as tensor index, x[torch.tensor(seq)], which will result either in an error or a different result (Triggered internally at /pytorch/torch/csrc/autograd/python_variable_indexing.cpp:347.)
(Worker pid=1535327)   return self.pos_embed[position_ids]
(Worker pid=1535327) 2026-04-22 03:51:24,201 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=1535327) 2026-04-22 03:51:25,879 - INFO - autotuner.py:268 - flashinfer.jit: [Autotuner]: Autotuning process ends
(StageEngineCoreProc pid=1534834) /usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for `SiglipImageProcessor.preprocess` and were ignored: 'truncation'
(StageEngineCoreProc pid=1534834)   return self.preprocess(images, **kwargs)
(Worker pid=1535327) /usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py:83: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
(Worker pid=1535327)   return func(*args, **kwargs)
!!!!!!! Segfault encountered !!!!!!!

---------------------------------------- Captured stdout teardown -----------------------------------------

Skipping GPU memory cleanup check (typically: instance already up; no check needed between tests)

============================================ warnings summary =============================================
vllm_omni/version.py:55
  /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
   --> vLLM-Omni version 0.18.1.dev285+gc1ba86ab5
   --> vLLM version 0.19.0
  This will likely cause compatibility issues.
    warn_if_misaligned_vllm_version()

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../../../../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2]
  /usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for `SiglipImageProcessor.preprocess` and were ignored: 'truncation'
    return self.preprocess(images, **kwargs)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
========================================= short test summary info =========================================
FAILED tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2] - AssertionError: assert 2 == (2 * 2)
=============================== 1 failed, 18 warnings in 108.19s (0:01:48) ================================

After

root@job-afbdf05bd2b0de31-7qvjv:/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni# p
ytest -v tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2]
================================ test session starts ================================
platform linux -- Python 3.12.13, pytest-9.0.3, pluggy-1.6.0 -- /usr/bin/python3.12
cachedir: .pytest_cache
rootdir: /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni
configfile: pyproject.toml
plugins: forked-1.6.0, timeout-2.4.0, hydra-core-1.3.2, asyncio-1.3.0, rerunfailures-16.1, shard-0.1.2, anyio-4.13.0
asyncio: mode=Mode.AUTO, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 1 item                                                                    
Running 1 items in this shard: tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2]

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2] PASSED [100%]

================================= warnings summary ==================================
vllm_omni/version.py:55
  /proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/version.py:55: RuntimeWarning: vLLM and vLLM-Omni appear to have mismatched major/minor versions:
   --> vLLM-Omni version 0.18.1.dev285+gc1ba86ab5
   --> vLLM version 0.19.0
  This will likely cause compatibility issues.
    warn_if_misaligned_vllm_version()

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

../../../../../usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: 14 warnings
  /usr/local/lib/python3.12/dist-packages/torch/jit/_script.py:362: DeprecationWarning: `torch.jit.script_method` is deprecated. Please switch to `torch.compile` or `torch.export`.
    warnings.warn(

tests/e2e/offline_inference/test_omni_sleep_mode.py::test_multistage_sleep_h100[2]
  /usr/local/lib/python3.12/dist-packages/transformers/image_processing_utils.py:51: UserWarning: The following named arguments are not valid for `SiglipImageProcessor.preprocess` and were ignored: 'truncation'
    return self.preprocess(images, **kwargs)

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
--- Running Summary
==================== 1 passed, 18 warnings in 129.21s (0:02:09) =====================

Signed-off-by: princepride <wangzhipeng628@gmail.com>

chatgpt-codex-connector · 2026-04-22T03:57:20Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

princepride · 2026-04-22T04:04:11Z

@lishunyang12 @hsliuustc0106 Who can help me approve it.

hsliuustc0106 · 2026-04-22T04:14:28Z

            pass

        acks = await engine.sleep(stage_ids=[0, 1], level=2)
-        assert len(acks) == 2 * tp_size


why previously test can pass?

I don't know, but now I can't pass in my own branch and latest main branch

Maybe it's a L4 test and just merged into main

And my own branch have merge-test and nightly-test so I can immediately find this bug

Signed-off-by: princepride <wangzhipeng628@gmail.com>

fix test_omni_sleep_mode ci bug

b708d8b

Signed-off-by: princepride <wangzhipeng628@gmail.com>

princepride changed the title ~~[CI][]BugFix]: Fix test_omni_sleep_mode ci bug~~ [BugFix][CI]: Fix test_omni_sleep_mode ci bug Apr 22, 2026

princepride added the ready label to trigger buildkite CI label Apr 22, 2026

princepride enabled auto-merge (squash) April 22, 2026 04:01

hsliuustc0106 reviewed Apr 22, 2026

View reviewed changes

hsliuustc0106 added the merge-test label to trigger buildkite merge test CI label Apr 22, 2026

hsliuustc0106 disabled auto-merge April 22, 2026 04:35

hsliuustc0106 merged commit e1cdd0c into vllm-project:main Apr 22, 2026
5 of 8 checks passed

qinganrice pushed a commit to qinganrice/vllm-omni that referenced this pull request Apr 23, 2026

[BugFix][CI]: Fix test_omni_sleep_mode ci bug (vllm-project#3010)

63e17c6

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BugFix][CI]: Fix test_omni_sleep_mode ci bug#3010

[BugFix][CI]: Fix test_omni_sleep_mode ci bug#3010
hsliuustc0106 merged 1 commit intovllm-project:mainfrom
princepride:fix-ci-bug

princepride commented Apr 22, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot commented Apr 22, 2026

Uh oh!

princepride commented Apr 22, 2026

Uh oh!

hsliuustc0106 Apr 22, 2026

Uh oh!

princepride Apr 22, 2026

Uh oh!

princepride Apr 22, 2026

Uh oh!

princepride Apr 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

princepride commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Before

After

Uh oh!

chatgpt-codex-connector Bot commented Apr 22, 2026

Uh oh!

princepride commented Apr 22, 2026

Uh oh!

hsliuustc0106 Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

princepride Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

princepride Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

princepride Apr 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

princepride commented Apr 22, 2026 •

edited

Loading