Skip to content

[Bugfix] Remove duplicated config keyword max batch size#1851

Merged
Gaohan123 merged 7 commits into
vllm-project:mainfrom
tzhouam:dev/remove-max-batch-size
Mar 20, 2026
Merged

[Bugfix] Remove duplicated config keyword max batch size#1851
Gaohan123 merged 7 commits into
vllm-project:mainfrom
tzhouam:dev/remove-max-batch-size

Conversation

@tzhouam
Copy link
Copy Markdown
Collaborator

@tzhouam tzhouam commented Mar 12, 2026

Purpose

This PR migrates stage concurrency configuration from runtime.max_batch_size to engine_args.max_num_seqs across code, docs, examples, and test configs, as described in #695 .

The change standardizes concurrency control under engine_args, aligns stage config semantics with vLLM scheduler behavior, and removes legacy runtime-to-engine arg mapping.

  • Replaced runtime.max_batch_size with engine_args.max_num_seqs in:
    • stage config docs and quickstart references
    • model/executor stage YAMLs
    • platform-specific stage configs (NPU/ROCm/XPU)
    • E2E/perf test stage configs
    • offline inference example READMEs
  • Updated runtime code paths to read concurrency from engine_args.max_num_seqs:
    • default stage construction in omni.py and async_omni.py
    • stage worker batching path in omni_stage.py
  • Removed legacy conversion logic from load_stage_configs_from_yaml that copied
    runtime.max_batch_size into engine_args.max_num_seqs.
  • Renamed helper parameter in dynamic chunk-size computation from
    max_batch_size to max_num_seqs for consistency.

Test Plan

Tested on qwen 3 omni online.

Test Result

/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
INFO 03-12 13:19:00 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀ 
INFO 03-12 13:19:00 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █  
INFO 03-12 13:19:00 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █  
INFO 03-12 13:19:00 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀ 
INFO 03-12 13:19:00 [logo.py:45] 
(APIServer pid=440639) INFO 03-12 13:19:00 [utils.py:302] vLLM server version 0.17.0, serving model Qwen/Qwen3-Omni-30B-A3B-Instruct
(APIServer pid=440639) INFO 03-12 13:19:00 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen3-Omni-30B-A3B-Instruct', 'port': 8091, 'model': 'Qwen/Qwen3-Omni-30B-A3B-Instruct'}
(APIServer pid=440639) INFO 03-12 13:19:00 [weight_utils.py:50] Using model weights format ['*']
(APIServer pid=440639) INFO 03-12 13:19:00 [omni.py:194] Initializing stages for model: Qwen/Qwen3-Omni-30B-A3B-Instruct
(APIServer pid=440639) INFO 03-12 13:19:00 [omni.py:389] No omni_master_address provided, defaulting to localhost (127.0.0.1)
(APIServer pid=440639) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(APIServer pid=440639) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(APIServer pid=440639) WARNING 03-12 13:19:00 [utils.py:111] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. 
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:251] Auto-configuring SharedMemoryConnector for edge ('0', '1')
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:251] Auto-configuring SharedMemoryConnector for edge ('1', '2')
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:270] Loaded OmniTransferConfig with 2 connector configurations
(APIServer pid=440639) INFO 03-12 13:19:00 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [omni.py:426] [AsyncOrchestrator] Loaded 3 stages
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
[Stage-0] INFO 03-12 13:19:07 [omni_stage.py:1289] [Stage-0] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-0] INFO 03-12 13:19:07 [initialization.py:324] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 03-12 13:19:07 [arg_utils.py:640] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-Omni-30B-A3B-Instruct] to model_path [/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695]
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved', 'mrope_interleaved'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved'}
[Stage-1] INFO 03-12 13:19:07 [omni_stage.py:1289] [Stage-1] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-1] INFO 03-12 13:19:07 [initialization.py:324] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 03-12 13:19:07 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 03-12 13:19:07 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 03-12 13:19:07 [arg_utils.py:640] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-Omni-30B-A3B-Instruct] to model_path [/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695]
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved', 'mrope_interleaved'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved'}
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(APIServer pid=440639) INFO 03-12 13:19:16 [omni.py:609] [AsyncOrchestrator] Waiting for 3 stages to initialize (timeout: 600s)
[Stage-2] INFO 03-12 13:19:16 [omni_stage.py:1289] [Stage-2] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-2] INFO 03-12 13:19:16 [initialization.py:324] [Stage-2] Initializing OmniConnectors with config keys: ['from_stage_1']
[Stage-2] INFO 03-12 13:19:16 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 03-12 13:19:16 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
[Stage-0] INFO 03-12 13:19:17 [model.py:531] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-0] INFO 03-12 13:19:17 [model.py:1554] Using max model len 65536
[Stage-0] INFO 03-12 13:19:17 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-0] INFO 03-12 13:19:17 [vllm.py:747] Asynchronous scheduling is enabled.
[Stage-1] INFO 03-12 13:19:17 [model.py:531] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-1] INFO 03-12 13:19:17 [model.py:1554] Using max model len 65536
[Stage-1] INFO 03-12 13:19:17 [model.py:1554] Using max model len 65536
[Stage-1] INFO 03-12 13:19:17 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-1] INFO 03-12 13:19:17 [vllm.py:747] Asynchronous scheduling is enabled.
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:19:31 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', speculative_config=None, tokenizer='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 128, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=442543) [Stage-1] WARNING 03-12 13:19:31 [multiproc_executor.py:945] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:19:31 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.5 (local), world_size=1, local_world_size=1
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:19:31 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', speculative_config=None, tokenizer='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 128, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=442528) [Stage-0] WARNING 03-12 13:19:31 [multiproc_executor.py:945] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:19:31 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.5 (local), world_size=1, local_world_size=1
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(Worker pid=443104) [Stage-1] INFO 03-12 13:19:38 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:38153 backend=nccl
(Worker pid=443104) [Stage-1] INFO 03-12 13:19:38 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=443112) [Stage-0] INFO 03-12 13:19:39 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48707 backend=nccl
(Worker pid=443112) [Stage-0] INFO 03-12 13:19:39 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=443104) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=443112) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=443104) [Stage-1] INFO 03-12 13:19:42 [base.py:106] Offloader set to NoopOffloader
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:42 [gpu_model_runner.py:4255] Starting to load model /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695...
(Worker pid=443112) [Stage-0] INFO 03-12 13:19:42 [base.py:106] Offloader set to NoopOffloader
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:42 [gpu_model_runner.py:4255] Starting to load model /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695...
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:42 [vllm.py:747] Asynchronous scheduling is enabled.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:42 [vllm.py:747] Asynchronous scheduling is enabled.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [flash_attn.py:587] Using FlashAttention version 3
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(Worker pid=443104) (Worker pid=443104) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=443104) (Worker pid=443104) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [flash_attn.py:587] Using FlashAttention version 3
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [unquantized.py:186] Using TRITON backend for Unquantized MoE
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
(Worker pid=443112) (Worker pid=443112) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=443112) (Worker pid=443112) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:08,  1.64it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:09,  1.40it/s]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:00<00:04,  2.81it/s]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:00<00:04,  3.19it/s]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:00<00:02,  4.50it/s]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:00<00:02,  4.07it/s]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:01<00:02,  5.29it/s]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:00<00:01,  5.72it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:01<00:01,  6.41it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:01<00:01,  6.06it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:01<00:01,  6.83it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:01<00:01,  7.11it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:01<00:01,  7.40it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:01<00:01,  7.62it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:01<00:00,  7.76it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:01<00:00,  7.59it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:01<00:00,  7.81it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:01<00:00,  7.69it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:01<00:00,  8.22it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:01<00:00,  8.14it/s]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:01<00:00,  8.68it/s]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:01<00:00,  8.61it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:01<00:00,  8.52it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:01<00:00,  8.47it/s]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:02<00:00,  5.65it/s]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:02<00:00,  4.13it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:02<00:00,  3.52it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:02<00:00,  3.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:03<00:00,  3.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:02<00:00,  4.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:03<00:00,  4.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:02<00:00,  5.16it/s]

(Worker pid=443112) (Worker pid=443112) 
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:50 [qwen3_omni_moe_talker.py:424] [Model Loaded] name=Qwen3OmniMoeTalkerForConditionalGeneration, success=True, size=8604.35 MB, device=cuda:0
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:50 [qwen3_omni.py:1186] Loaded 1200 weights for Qwen3OmniMoe (stage=talker)
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:51 [default_loader.py:293] Loading weights took 7.69 seconds
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:51 [gpu_model_runner.py:4338] Model loading took 8.5 GiB memory and 8.300574 seconds
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:51 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 62720 tokens, and profiled with 1 video items of the maximum feature size.
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] THIS FUNCTION RETURNS DUMMY MULTIMODAL EMBEDDINGS FOR PROFILE RUN, SHOULD NOT BE CALLED IN INFERENCE.
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:53 [qwen3_omni_moe_code_predictor_mtp.py:351] code_predictor: torch.compile enabled
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:01 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/0ca64d7e43/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:01 [backends.py:976] Dynamo bytecode transform time: 3.08 s
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:02 [backends.py:350] Cache the graph of compile range (1, 32768) for later use
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:20:02 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /app/rebase/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=384,device_name=NVIDIA_H800.json
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:03 [backends.py:366] Compiling a graph for compile range (1, 32768) takes 1.38 s
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:03 [monitor.py:35] torch.compile takes 5.08 s in total
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:03 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/9d9a6c693dbb9103abdcbf3bc6971f25c9b4a26c66877586ee4a61161628b9be/rank_0_0/model
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:04 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/9d9a6c693dbb9103abdcbf3bc6971f25c9b4a26c66877586ee4a61161628b9be/rank_0_0/model
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:05 [base.py:102] Available KV cache memory: 37.61 GiB (profiling fallback)
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:05 [kv_cache_utils.py:1314] GPU KV cache size: 1,971,664 tokens
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:05 [kv_cache_utils.py:1319] Maximum concurrency for 65,536 tokens per request: 30.09x
(Worker pid=443104) (Worker pid=443104) 2026-03-12 13:20:05,019 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=443104) (Worker pid=443104) 2026-03-12 13:20:05,077 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:04<00:00,  4.70it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  9.60it/s]
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:11 [gpu_model_runner.py:5360] Graph capturing finished in 6 secs, took -0.48 GiB
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:11 [core.py:282] init engine (profile, create kv cache, warmup model) took 19.42 seconds
(EngineCore_DP0 pid=442543) [Stage-1] WARNING 03-12 13:20:11 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=442543) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:15 [vllm.py:747] Asynchronous scheduling is enabled.
[Stage-2] INFO 03-12 13:20:15 [arg_utils.py:640] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-Omni-30B-A3B-Instruct] to model_path [/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695]
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved'}
(APIServer pid=440639) INFO 03-12 13:20:16 [omni.py:599] [AsyncOrchestrator] Stage-1 reported ready
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:16 [qwen3_omni.py:1186] Loaded 1183 weights for Qwen3OmniMoe (stage=thinker)
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:17 [default_loader.py:293] Loading weights took 34.23 seconds
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:18 [gpu_model_runner.py:4338] Model loading took 59.54 GiB memory and 34.846597 seconds
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:18 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 62720 tokens, and profiled with 1 video items of the maximum feature size.
[Stage-2] INFO 03-12 13:20:23 [model.py:531] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-2] INFO 03-12 13:20:23 [model.py:1554] Using max model len 65536
[Stage-2] INFO 03-12 13:20:23 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=1000000.
[Stage-2] INFO 03-12 13:20:23 [vllm.py:747] Asynchronous scheduling is disabled.
[Stage-2] WARNING 03-12 13:20:23 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
[Stage-2] WARNING 03-12 13:20:23 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
[Stage-2] INFO 03-12 13:20:23 [vllm.py:957] Cudagraph is disabled under eager mode
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:27 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/b013c44458/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:27 [backends.py:976] Dynamo bytecode transform time: 6.98 s
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:28 [backends.py:350] Cache the graph of compile range (1, 32768) for later use
(Worker pid=443112) (Worker pid=443112) [Stage-0] WARNING 03-12 13:20:28 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /app/rebase/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_H800.json
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:30 [backends.py:366] Compiling a graph for compile range (1, 32768) takes 2.12 s
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:30 [monitor.py:35] torch.compile takes 10.43 s in total
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:30 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/0af8c31dda6c4febc9f63271538997fbfe88f8101ae37f480eefeaaec970591f/rank_0_0/model
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:31 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/0af8c31dda6c4febc9f63271538997fbfe88f8101ae37f480eefeaaec970591f/rank_0_0/model
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:32 [base.py:102] Available KV cache memory: 8.68 GiB (profiling fallback)
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:32 [kv_cache_utils.py:1314] GPU KV cache size: 94,768 tokens
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:32 [kv_cache_utils.py:1319] Maximum concurrency for 65,536 tokens per request: 1.45x
(Worker pid=443112) (Worker pid=443112) 2026-03-12 13:20:32,218 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=443112) (Worker pid=443112) 2026-03-12 13:20:32,323 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:01<00:00, 10.56it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 14.47it/s]
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:35 [gpu_model_runner.py:5360] Graph capturing finished in 3 secs, took -1.64 GiB
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:35 [core.py:282] init engine (profile, create kv cache, warmup model) took 17.36 seconds
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(EngineCore_DP0 pid=442528) [Stage-0] WARNING 03-12 13:20:36 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=442528) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:20:36 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', speculative_config=None, tokenizer='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [1000000], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:20:36 [multiproc_executor.py:945] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:20:36 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.5 (local), world_size=1, local_world_size=1
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:39 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=440639) INFO 03-12 13:20:40 [omni.py:599] [AsyncOrchestrator] Stage-0 reported ready
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(Worker pid=444935) [Stage-2] INFO 03-12 13:20:44 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:59477 backend=nccl
(Worker pid=444935) [Stage-2] INFO 03-12 13:20:44 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=444935) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [base.py:106] Offloader set to NoopOffloader
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [gpu_model_runner.py:4255] Starting to load model /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695...
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [vllm.py:747] Asynchronous scheduling is disabled.
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [vllm.py:957] Cudagraph is disabled under eager mode
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [vllm.py:957] Cudagraph is disabled under eager mode
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:00<00:00, 55.33it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:00<00:00, 55.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 47.46it/s]
(Worker pid=444935) (Worker pid=444935) 
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:50 [qwen3_omni_code2wav.py:273] [Model Loaded] name=Qwen3OmniMoeCode2Wav, success=True, size=412.02 MB, device=cuda:0
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:50 [qwen3_omni.py:1186] Loaded 230 weights for Qwen3OmniMoe (stage=code2wav)
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:50 [default_loader.py:293] Loading weights took 0.87 seconds
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:51 [gpu_model_runner.py:4338] Model loading took 0.41 GiB memory and 1.004897 seconds
(Worker pid=444935) (Worker pid=444935) 2026-03-12 13:20:51,545 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=444935) (Worker pid=444935) 2026-03-12 13:20:58,018 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:58 [gpu_generation_model_runner.py:462] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:20:58 [core.py:282] init engine (profile, create kv cache, warmup model) took 6.84 seconds
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:20:58 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:20:58 [core.py:137] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=444604) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:21:01 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:21:01 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:21:01 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:21:01 [vllm.py:957] Cudagraph is disabled under eager mode
(APIServer pid=440639) INFO 03-12 13:21:02 [omni.py:599] [AsyncOrchestrator] Stage-2 reported ready
(APIServer pid=440639) INFO 03-12 13:21:02 [omni.py:628] [AsyncOrchestrator] All stages initialized successfully
(APIServer pid=440639) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(APIServer pid=440639) INFO 03-12 13:21:09 [async_omni.py:267] [AsyncOrchestrator] Initialized input_processor, io_processor, and model_config from stage-0
(APIServer pid=440639) INFO 03-12 13:21:09 [api_server.py:455] Supported tasks: {'speech', 'generate'}
(APIServer pid=440639) INFO 03-12 13:21:09 [api_server.py:517] Initialized io_processor for AsyncOmni
(APIServer pid=440639) INFO 03-12 13:21:09 [serving.py:185] Warming up chat template processing...
(APIServer pid=440639) INFO 03-12 13:21:10 [hf.py:318] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=440639) INFO 03-12 13:21:10 [serving.py:210] Chat template warmup completed in 739.3ms
(APIServer pid=440639) INFO 03-12 13:21:10 [serving_speech.py:114] Loaded 3 supported speakers: ['aiden', 'chelsie', 'ethan']
(APIServer pid=440639) INFO 03-12 13:21:10 [serving_speech.py:115] Loaded 0 uploaded speakers
(APIServer pid=440639) WARNING 03-12 13:21:10 [serving_speech.py:140] Failed to load codec frame rate from speech tokenizer config: /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695 does not appear to have a file named speech_tokenizer/config.json. Checkout 'https://huggingface.co//models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695/tree/main' for available files.
(APIServer pid=440639) INFO 03-12 13:21:10 [api_server.py:248] Starting vLLM API server 0 on http://0.0.0.0:8091
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:38] Available routes are:
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /docs, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /redoc, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/audio/speech, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/audio/voices, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/images/generations, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/images/edits, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:58] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=440639) INFO:     Started server process [440639]
(APIServer pid=440639) INFO:     Waiting for application startup.
(APIServer pid=440639) INFO:     Application startup complete.
(APIServer pid=440639) WARNING 03-12 13:22:11 [protocol.py:51] The following fields were present in the request but ignored: {'sampling_params_list', 'modalities'}
(APIServer pid=440639) /app/rebase/vllm-omni/vllm_omni/entrypoints/chat_utils.py:31: UserWarning: PySoundFile failed. Trying audioread instead.
(APIServer pid=440639)   return librosa.load(file_path, sr=16000)
(APIServer pid=440639) /app/rebase/.venv/lib/python3.12/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
(APIServer pid=440639)  Deprecated as of librosa version 0.10.0.
(APIServer pid=440639)  It will be removed in librosa version 1.0.
(APIServer pid=440639)   y, sr_native = __audioread_load(path, offset, duration, dtype)
(APIServer pid=440639) INFO 03-12 13:22:13 [async_omni.py:401] [AsyncOrchestrator] Entering scheduling loop: stages=3, final_stage=2
(Worker pid=443112) (Worker pid=443112) [Stage-0] WARNING 03-12 13:22:13 [gpu_model_runner.py:336] additional_information on request data is deprecated, use model_intermediate_buffer
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:22:16 [gpu_model_runner.py:336] additional_information on request data is deprecated, use model_intermediate_buffer
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:22:16 [mrope.py:345] Multimodal token idx changed!
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:22:17 [gpu_model_runner.py:1363] _merge_additional_information_update is deprecated, use _update_intermediate_buffer
[Stage-0] INFO 03-12 13:22:19 [loggers.py:259] Engine 000: Avg prompt throughput: 371.9 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:22:24 [mrope.py:345] Multimodal token idx changed!
(APIServer pid=440639) WARNING 03-12 13:22:25 [protocol.py:51] The following fields were present in the request but ignored: {'reasoning_content'}
(APIServer pid=440639) INFO:     127.0.0.1:46978 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[Stage-1] INFO 03-12 13:22:25 [loggers.py:259] Engine 000: Avg prompt throughput: 368.5 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[Stage-0] INFO 03-12 13:22:29 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[Stage-2] INFO 03-12 13:22:31 [loggers.py:259] Engine 000: Avg prompt throughput: 694.4 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[Stage-1] INFO 03-12 13

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

tzhouam added 2 commits March 12, 2026 06:33
…cross multiple files

This commit updates the documentation and configuration files to replace instances of `max_batch_size` with `max_num_seqs`, reflecting a change in how concurrent processing is managed in the system. The updates include changes in stage configurations, user guides, and examples to ensure consistency and clarity in the usage of the new parameter.

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
@tzhouam tzhouam requested a review from hsliuustc0106 as a code owner March 12, 2026 13:24
@tzhouam tzhouam added the ready label to trigger buildkite CI label Mar 12, 2026
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2021d3e276

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 282 to 284
if hasattr(stage_arg, "runtime") and stage_arg.runtime is not None and stage_type != "diffusion":
runtime_cfg = stage_arg.runtime
max_batch_size = int(runtime_cfg.get("max_batch_size", 1) or 1)
base_engine_args_tmp["max_num_seqs"] = max_batch_size
base_engine_args_tmp.async_chunk = global_async_chunk
stage_arg.engine_args = base_engine_args_tmp
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep runtime.max_batch_size compatibility in YAML loader

This change removes the runtime.max_batch_size -> engine_args.max_num_seqs mapping, but there are still shipped stage configs that only set the runtime key (for example tests/perf/stage_configs/qwen3_tts.yaml:12 and benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs4.yaml:10). After this commit those configs load without max_num_seqs, so _stage_worker falls back to batch size 1 and serializes requests, which materially skews perf/benchmark results.

Useful? React with 👍 / 👎.

Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
pass

max_batch_size = int(runtime_cfg.get("max_batch_size", 1) or 1)
max_batch_size = int(engine_args.get("max_num_seqs", 1))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read max_num_seqs before filtering diffusion args

For diffusion stages, engine_args is filtered by OmniDiffusionConfig before this line runs, and that dataclass has no max_num_seqs field, so the value is dropped and this always resolves to the default of 1. As a result, any diffusion stage config that sets engine_args.max_num_seqs cannot actually increase batching anymore.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we use this for control max_concurrency?

Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Comprehensive migration from runtime.max_batch_size to engine_args.max_num_seqs across the codebase. This standardizes concurrency control under engine_args and aligns stage config semantics with vLLM scheduler behavior.

Validated

  • ✅ DCO signed
  • ✅ All CI checks passed (build, pre-commit, buildkite, docs)
  • ✅ Runtime code paths updated in omni.py, async_omni.py, omni_stage.py
  • ✅ Legacy conversion logic removed from load_stage_configs_from_yaml
  • ✅ All docs and examples updated consistently
  • ✅ Test result shows Qwen3-Omni online serving works

Scope

51 files touched covering:

  • Stage config docs and quickstart
  • All model/executor stage YAMLs
  • Platform-specific configs (NPU/ROCm/XPU)
  • E2E/perf test configs
  • Offline inference example READMEs

Clean refactor with clear motivation (issue #695).

@tzhouam
Copy link
Copy Markdown
Collaborator Author

tzhouam commented Mar 13, 2026

do we use this for control max_concurrency?

Yes, as discussed before, vllm using max_num_seq to control the max concurrency, the max batch size defined by us is duplicated and misleading.

@tzhouam tzhouam changed the title Dev/remove max batch size [Debug] Remove duplicated config keyword max batch size Mar 13, 2026
Comment thread vllm_omni/entrypoints/omni_stage.py Outdated
pass

max_batch_size = int(runtime_cfg.get("max_batch_size", 1) or 1)
max_batch_size = int(engine_args.get("max_num_seqs", 1))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider backward compatibility for max_batch_size?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question: if the users have already put it inro production, how do we inform them this yaml arg changes?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added both warning in the running code and doc

Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 16, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

resolve conflicts please.

tzhouam added 3 commits March 20, 2026 02:06
…ross multiple YAML files and update related documentation. This change enhances compatibility and prepares for future deprecations of max_batch_size. Additionally, added tests to ensure proper migration and handling of the new parameter.
…tageConfigFactory.

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
@Gaohan123 Gaohan123 changed the title [Debug] Remove duplicated config keyword max batch size [Bugfix] Remove duplicated config keyword max batch size Mar 20, 2026
Copy link
Copy Markdown
Collaborator

@Gaohan123 Gaohan123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks

@Gaohan123 Gaohan123 merged commit 05d8de8 into vllm-project:main Mar 20, 2026
8 checks passed
yJader added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 24, 2026
…g_reqs and add max_num_seqs to config

aligned with changes introduced in vllm-project#1851

Signed-off-by: jader <yjader@foxmail.com>
yJader added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 24, 2026
…g_reqs and add max_num_seqs to config

aligned with changes introduced in vllm-project#1851

Signed-off-by: jader <yjader@foxmail.com>
yJader added a commit to omni-nicelab/vllm-omni-batching that referenced this pull request Mar 24, 2026
…g_reqs and add max_num_seqs to config

aligned with changes introduced in vllm-project#1851

Signed-off-by: jader <yjader@foxmail.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…t#1851)

Signed-off-by: tzhouam <tzhouam@connect.ust.hk>
Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants