[Bugfix] Remove duplicated config keyword max batch size by tzhouam · Pull Request #1851 · vllm-project/vllm-omni

tzhouam · 2026-03-12T13:24:45Z

Purpose

This PR migrates stage concurrency configuration from runtime.max_batch_size to engine_args.max_num_seqs across code, docs, examples, and test configs, as described in #695 .

The change standardizes concurrency control under engine_args, aligns stage config semantics with vLLM scheduler behavior, and removes legacy runtime-to-engine arg mapping.

Replaced runtime.max_batch_size with engine_args.max_num_seqs in:
- stage config docs and quickstart references
- model/executor stage YAMLs
- platform-specific stage configs (NPU/ROCm/XPU)
- E2E/perf test stage configs
- offline inference example READMEs
Updated runtime code paths to read concurrency from engine_args.max_num_seqs:
- default stage construction in omni.py and async_omni.py
- stage worker batching path in omni_stage.py
Removed legacy conversion logic from load_stage_configs_from_yaml that copied
runtime.max_batch_size into engine_args.max_num_seqs.
Renamed helper parameter in dynamic chunk-size computation from
max_batch_size to max_num_seqs for consistency.

Test Plan

Tested on qwen 3 omni online.

Test Result

/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
INFO 03-12 13:19:00 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀ 
INFO 03-12 13:19:00 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █  
INFO 03-12 13:19:00 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █  
INFO 03-12 13:19:00 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀ 
INFO 03-12 13:19:00 [logo.py:45] 
(APIServer pid=440639) INFO 03-12 13:19:00 [utils.py:302] vLLM server version 0.17.0, serving model Qwen/Qwen3-Omni-30B-A3B-Instruct
(APIServer pid=440639) INFO 03-12 13:19:00 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen3-Omni-30B-A3B-Instruct', 'port': 8091, 'model': 'Qwen/Qwen3-Omni-30B-A3B-Instruct'}
(APIServer pid=440639) INFO 03-12 13:19:00 [weight_utils.py:50] Using model weights format ['*']
(APIServer pid=440639) INFO 03-12 13:19:00 [omni.py:194] Initializing stages for model: Qwen/Qwen3-Omni-30B-A3B-Instruct
(APIServer pid=440639) INFO 03-12 13:19:00 [omni.py:389] No omni_master_address provided, defaulting to localhost (127.0.0.1)
(APIServer pid=440639) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(APIServer pid=440639) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_interleaved', 'interleaved', 'mrope_section'}
(APIServer pid=440639) Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'interleaved', 'mrope_section'}
(APIServer pid=440639) WARNING 03-12 13:19:00 [utils.py:111] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. 
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:251] Auto-configuring SharedMemoryConnector for edge ('0', '1')
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:251] Auto-configuring SharedMemoryConnector for edge ('1', '2')
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:270] Loaded OmniTransferConfig with 2 connector configurations
(APIServer pid=440639) INFO 03-12 13:19:00 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
(APIServer pid=440639) INFO 03-12 13:19:00 [omni.py:426] [AsyncOrchestrator] Loaded 3 stages
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
[Stage-0] INFO 03-12 13:19:07 [omni_stage.py:1289] [Stage-0] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-0] INFO 03-12 13:19:07 [initialization.py:324] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 03-12 13:19:07 [arg_utils.py:640] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-Omni-30B-A3B-Instruct] to model_path [/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695]
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved', 'mrope_interleaved'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved'}
[Stage-1] INFO 03-12 13:19:07 [omni_stage.py:1289] [Stage-1] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-1] INFO 03-12 13:19:07 [initialization.py:324] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 03-12 13:19:07 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 03-12 13:19:07 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 03-12 13:19:07 [arg_utils.py:640] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-Omni-30B-A3B-Instruct] to model_path [/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695]
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved', 'mrope_interleaved'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved'}
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(APIServer pid=440639) INFO 03-12 13:19:16 [omni.py:609] [AsyncOrchestrator] Waiting for 3 stages to initialize (timeout: 600s)
[Stage-2] INFO 03-12 13:19:16 [omni_stage.py:1289] [Stage-2] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-2] INFO 03-12 13:19:16 [initialization.py:324] [Stage-2] Initializing OmniConnectors with config keys: ['from_stage_1']
[Stage-2] INFO 03-12 13:19:16 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-2] INFO 03-12 13:19:16 [initialization.py:60] Created connector for 1 -> 2: SharedMemoryConnector
[Stage-0] INFO 03-12 13:19:17 [model.py:531] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-0] INFO 03-12 13:19:17 [model.py:1554] Using max model len 65536
[Stage-0] INFO 03-12 13:19:17 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-0] INFO 03-12 13:19:17 [vllm.py:747] Asynchronous scheduling is enabled.
[Stage-1] INFO 03-12 13:19:17 [model.py:531] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-1] INFO 03-12 13:19:17 [model.py:1554] Using max model len 65536
[Stage-1] INFO 03-12 13:19:17 [model.py:1554] Using max model len 65536
[Stage-1] INFO 03-12 13:19:17 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=32768.
[Stage-1] INFO 03-12 13:19:17 [vllm.py:747] Asynchronous scheduling is enabled.
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:19:31 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', speculative_config=None, tokenizer='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 128, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=442543) [Stage-1] WARNING 03-12 13:19:31 [multiproc_executor.py:945] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:19:31 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.5 (local), world_size=1, local_world_size=1
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:19:31 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', speculative_config=None, tokenizer='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [32768], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 128, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=442528) [Stage-0] WARNING 03-12 13:19:31 [multiproc_executor.py:945] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:19:31 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.5 (local), world_size=1, local_world_size=1
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(Worker pid=443104) [Stage-1] INFO 03-12 13:19:38 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:38153 backend=nccl
(Worker pid=443104) [Stage-1] INFO 03-12 13:19:38 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=443112) [Stage-0] INFO 03-12 13:19:39 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48707 backend=nccl
(Worker pid=443112) [Stage-0] INFO 03-12 13:19:39 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=443104) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=443112) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=443104) [Stage-1] INFO 03-12 13:19:42 [base.py:106] Offloader set to NoopOffloader
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:42 [gpu_model_runner.py:4255] Starting to load model /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695...
(Worker pid=443112) [Stage-0] INFO 03-12 13:19:42 [base.py:106] Offloader set to NoopOffloader
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:42 [gpu_model_runner.py:4255] Starting to load model /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695...
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:42 [vllm.py:747] Asynchronous scheduling is enabled.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:42 [vllm.py:747] Asynchronous scheduling is enabled.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [flash_attn.py:587] Using FlashAttention version 3
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(Worker pid=443104) (Worker pid=443104) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=443104) (Worker pid=443104) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:43 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [flash_attn.py:587] Using FlashAttention version 3
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:19:43 [unquantized.py:186] Using TRITON backend for Unquantized MoE
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
(Worker pid=443112) (Worker pid=443112) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=443112) (Worker pid=443112) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:08,  1.64it/s]
Loading safetensors checkpoint shards:   7% Completed | 1/15 [00:00<00:09,  1.40it/s]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:00<00:04,  2.81it/s]
Loading safetensors checkpoint shards:  13% Completed | 2/15 [00:00<00:04,  3.19it/s]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:00<00:02,  4.50it/s]
Loading safetensors checkpoint shards:  20% Completed | 3/15 [00:00<00:02,  4.07it/s]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:01<00:02,  5.29it/s]
Loading safetensors checkpoint shards:  27% Completed | 4/15 [00:00<00:01,  5.72it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:01<00:01,  6.41it/s]
Loading safetensors checkpoint shards:  33% Completed | 5/15 [00:01<00:01,  6.06it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:01<00:01,  6.83it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:01<00:01,  7.11it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:01<00:01,  7.40it/s]
Loading safetensors checkpoint shards:  47% Completed | 7/15 [00:01<00:01,  7.62it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:01<00:00,  7.76it/s]
Loading safetensors checkpoint shards:  53% Completed | 8/15 [00:01<00:00,  7.59it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:01<00:00,  7.81it/s]
Loading safetensors checkpoint shards:  60% Completed | 9/15 [00:01<00:00,  7.69it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:01<00:00,  8.22it/s]
Loading safetensors checkpoint shards:  67% Completed | 10/15 [00:01<00:00,  8.14it/s]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:01<00:00,  8.68it/s]
Loading safetensors checkpoint shards:  73% Completed | 11/15 [00:01<00:00,  8.61it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:01<00:00,  8.52it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:01<00:00,  8.47it/s]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:02<00:00,  5.65it/s]
Loading safetensors checkpoint shards:  87% Completed | 13/15 [00:02<00:00,  4.13it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:02<00:00,  3.52it/s]
Loading safetensors checkpoint shards:  93% Completed | 14/15 [00:02<00:00,  3.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:03<00:00,  3.93it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:02<00:00,  4.16it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:03<00:00,  4.98it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:02<00:00,  5.16it/s]

(Worker pid=443112) (Worker pid=443112) 
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:50 [qwen3_omni_moe_talker.py:424] [Model Loaded] name=Qwen3OmniMoeTalkerForConditionalGeneration, success=True, size=8604.35 MB, device=cuda:0
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:50 [qwen3_omni.py:1186] Loaded 1200 weights for Qwen3OmniMoe (stage=talker)
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:51 [default_loader.py:293] Loading weights took 7.69 seconds
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:51 [gpu_model_runner.py:4338] Model loading took 8.5 GiB memory and 8.300574 seconds
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:51 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 62720 tokens, and profiled with 1 video items of the maximum feature size.
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] THIS FUNCTION RETURNS DUMMY MULTIMODAL EMBEDDINGS FOR PROFILE RUN, SHOULD NOT BE CALLED IN INFERENCE.
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:19:53 [qwen3_omni_moe_talker.py:336] 
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:19:53 [qwen3_omni_moe_code_predictor_mtp.py:351] code_predictor: torch.compile enabled
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:01 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/0ca64d7e43/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:01 [backends.py:976] Dynamo bytecode transform time: 3.08 s
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:02 [backends.py:350] Cache the graph of compile range (1, 32768) for later use
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:20:02 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /app/rebase/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=384,device_name=NVIDIA_H800.json
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:03 [backends.py:366] Compiling a graph for compile range (1, 32768) takes 1.38 s
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:03 [monitor.py:35] torch.compile takes 5.08 s in total
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:03 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/9d9a6c693dbb9103abdcbf3bc6971f25c9b4a26c66877586ee4a61161628b9be/rank_0_0/model
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:04 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/9d9a6c693dbb9103abdcbf3bc6971f25c9b4a26c66877586ee4a61161628b9be/rank_0_0/model
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:05 [base.py:102] Available KV cache memory: 37.61 GiB (profiling fallback)
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:05 [kv_cache_utils.py:1314] GPU KV cache size: 1,971,664 tokens
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:05 [kv_cache_utils.py:1319] Maximum concurrency for 65,536 tokens per request: 30.09x
(Worker pid=443104) (Worker pid=443104) 2026-03-12 13:20:05,019 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=443104) (Worker pid=443104) 2026-03-12 13:20:05,077 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:04<00:00,  4.70it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:01<00:00,  9.60it/s]
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:20:11 [gpu_model_runner.py:5360] Graph capturing finished in 6 secs, took -0.48 GiB
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:11 [core.py:282] init engine (profile, create kv cache, warmup model) took 19.42 seconds
(EngineCore_DP0 pid=442543) [Stage-1] WARNING 03-12 13:20:11 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=442543) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=442543) [Stage-1] INFO 03-12 13:20:15 [vllm.py:747] Asynchronous scheduling is enabled.
[Stage-2] INFO 03-12 13:20:15 [arg_utils.py:640] HF_HUB_OFFLINE is True, replace model_id [Qwen/Qwen3-Omni-30B-A3B-Instruct] to model_path [/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695]
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'mrope_interleaved', 'interleaved'}
Unrecognized keys in `rope_scaling` for 'rope_type'='default': {'mrope_section', 'interleaved'}
(APIServer pid=440639) INFO 03-12 13:20:16 [omni.py:599] [AsyncOrchestrator] Stage-1 reported ready
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:16 [qwen3_omni.py:1186] Loaded 1183 weights for Qwen3OmniMoe (stage=thinker)
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:17 [default_loader.py:293] Loading weights took 34.23 seconds
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:18 [gpu_model_runner.py:4338] Model loading took 59.54 GiB memory and 34.846597 seconds
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:18 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 62720 tokens, and profiled with 1 video items of the maximum feature size.
[Stage-2] INFO 03-12 13:20:23 [model.py:531] Resolved architecture: Qwen3OmniMoeForConditionalGeneration
[Stage-2] INFO 03-12 13:20:23 [model.py:1554] Using max model len 65536
[Stage-2] INFO 03-12 13:20:23 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=1000000.
[Stage-2] INFO 03-12 13:20:23 [vllm.py:747] Asynchronous scheduling is disabled.
[Stage-2] WARNING 03-12 13:20:23 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
[Stage-2] WARNING 03-12 13:20:23 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
[Stage-2] INFO 03-12 13:20:23 [vllm.py:957] Cudagraph is disabled under eager mode
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:27 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/b013c44458/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:27 [backends.py:976] Dynamo bytecode transform time: 6.98 s
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:28 [backends.py:350] Cache the graph of compile range (1, 32768) for later use
(Worker pid=443112) (Worker pid=443112) [Stage-0] WARNING 03-12 13:20:28 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /app/rebase/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=768,device_name=NVIDIA_H800.json
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:30 [backends.py:366] Compiling a graph for compile range (1, 32768) takes 2.12 s
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:30 [monitor.py:35] torch.compile takes 10.43 s in total
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:30 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/0af8c31dda6c4febc9f63271538997fbfe88f8101ae37f480eefeaaec970591f/rank_0_0/model
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:31 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/0af8c31dda6c4febc9f63271538997fbfe88f8101ae37f480eefeaaec970591f/rank_0_0/model
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:32 [base.py:102] Available KV cache memory: 8.68 GiB (profiling fallback)
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:32 [kv_cache_utils.py:1314] GPU KV cache size: 94,768 tokens
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:32 [kv_cache_utils.py:1319] Maximum concurrency for 65,536 tokens per request: 1.45x
(Worker pid=443112) (Worker pid=443112) 2026-03-12 13:20:32,218 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=443112) (Worker pid=443112) 2026-03-12 13:20:32,323 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 19/19 [00:01<00:00, 10.56it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 11/11 [00:00<00:00, 14.47it/s]
(Worker pid=443112) (Worker pid=443112) [Stage-0] INFO 03-12 13:20:35 [gpu_model_runner.py:5360] Graph capturing finished in 3 secs, took -1.64 GiB
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:35 [core.py:282] init engine (profile, create kv cache, warmup model) took 17.36 seconds
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(EngineCore_DP0 pid=442528) [Stage-0] WARNING 03-12 13:20:36 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=442528) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:20:36 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', speculative_config=None, tokenizer='/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=65536, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [1000000], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:20:36 [multiproc_executor.py:945] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:20:36 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.5 (local), world_size=1, local_world_size=1
(EngineCore_DP0 pid=442528) [Stage-0] INFO 03-12 13:20:39 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=440639) INFO 03-12 13:20:40 [omni.py:599] [AsyncOrchestrator] Stage-0 reported ready
/app/rebase/.venv/lib/python3.12/site-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
(Worker pid=444935) [Stage-2] INFO 03-12 13:20:44 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:59477 backend=nccl
(Worker pid=444935) [Stage-2] INFO 03-12 13:20:44 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(Worker pid=444935) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [base.py:106] Offloader set to NoopOffloader
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [gpu_model_runner.py:4255] Starting to load model /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695...
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [vllm.py:747] Asynchronous scheduling is disabled.
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [vllm.py:957] Cudagraph is disabled under eager mode
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:49 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:49 [vllm.py:957] Cudagraph is disabled under eager mode
Loading safetensors checkpoint shards:   0% Completed | 0/15 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  40% Completed | 6/15 [00:00<00:00, 55.33it/s]
Loading safetensors checkpoint shards:  80% Completed | 12/15 [00:00<00:00, 55.46it/s]
Loading safetensors checkpoint shards: 100% Completed | 15/15 [00:00<00:00, 47.46it/s]
(Worker pid=444935) (Worker pid=444935) 
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:50 [qwen3_omni_code2wav.py:273] [Model Loaded] name=Qwen3OmniMoeCode2Wav, success=True, size=412.02 MB, device=cuda:0
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:50 [qwen3_omni.py:1186] Loaded 230 weights for Qwen3OmniMoe (stage=code2wav)
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:50 [default_loader.py:293] Loading weights took 0.87 seconds
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:20:51 [gpu_model_runner.py:4338] Model loading took 0.41 GiB memory and 1.004897 seconds
(Worker pid=444935) (Worker pid=444935) 2026-03-12 13:20:51,545 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=444935) (Worker pid=444935) 2026-03-12 13:20:58,018 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker pid=444935) (Worker pid=444935) [Stage-2] WARNING 03-12 13:20:58 [gpu_generation_model_runner.py:462] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:20:58 [core.py:282] init engine (profile, create kv cache, warmup model) took 6.84 seconds
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:20:58 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:20:58 [core.py:137] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=444604) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:21:01 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:21:01 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP0 pid=444604) [Stage-2] WARNING 03-12 13:21:01 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=444604) [Stage-2] INFO 03-12 13:21:01 [vllm.py:957] Cudagraph is disabled under eager mode
(APIServer pid=440639) INFO 03-12 13:21:02 [omni.py:599] [AsyncOrchestrator] Stage-2 reported ready
(APIServer pid=440639) INFO 03-12 13:21:02 [omni.py:628] [AsyncOrchestrator] All stages initialized successfully
(APIServer pid=440639) The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
(APIServer pid=440639) INFO 03-12 13:21:09 [async_omni.py:267] [AsyncOrchestrator] Initialized input_processor, io_processor, and model_config from stage-0
(APIServer pid=440639) INFO 03-12 13:21:09 [api_server.py:455] Supported tasks: {'speech', 'generate'}
(APIServer pid=440639) INFO 03-12 13:21:09 [api_server.py:517] Initialized io_processor for AsyncOmni
(APIServer pid=440639) INFO 03-12 13:21:09 [serving.py:185] Warming up chat template processing...
(APIServer pid=440639) INFO 03-12 13:21:10 [hf.py:318] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=440639) INFO 03-12 13:21:10 [serving.py:210] Chat template warmup completed in 739.3ms
(APIServer pid=440639) INFO 03-12 13:21:10 [serving_speech.py:114] Loaded 3 supported speakers: ['aiden', 'chelsie', 'ethan']
(APIServer pid=440639) INFO 03-12 13:21:10 [serving_speech.py:115] Loaded 0 uploaded speakers
(APIServer pid=440639) WARNING 03-12 13:21:10 [serving_speech.py:140] Failed to load codec frame rate from speech tokenizer config: /models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695 does not appear to have a file named speech_tokenizer/config.json. Checkout 'https://huggingface.co//models/hub/models--Qwen--Qwen3-Omni-30B-A3B-Instruct/snapshots/26291f793822fb6be9555850f06dfe95f2d7e695/tree/main' for available files.
(APIServer pid=440639) INFO 03-12 13:21:10 [api_server.py:248] Starting vLLM API server 0 on http://0.0.0.0:8091
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:38] Available routes are:
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /docs, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /redoc, Methods: HEAD, GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/audio/speech, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/audio/voices, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/images/generations, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/images/edits, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=440639) INFO 03-12 13:21:10 [launcher.py:58] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=440639) INFO:     Started server process [440639]
(APIServer pid=440639) INFO:     Waiting for application startup.
(APIServer pid=440639) INFO:     Application startup complete.
(APIServer pid=440639) WARNING 03-12 13:22:11 [protocol.py:51] The following fields were present in the request but ignored: {'sampling_params_list', 'modalities'}
(APIServer pid=440639) /app/rebase/vllm-omni/vllm_omni/entrypoints/chat_utils.py:31: UserWarning: PySoundFile failed. Trying audioread instead.
(APIServer pid=440639)   return librosa.load(file_path, sr=16000)
(APIServer pid=440639) /app/rebase/.venv/lib/python3.12/site-packages/librosa/core/audio.py:184: FutureWarning: librosa.core.audio.__audioread_load
(APIServer pid=440639)  Deprecated as of librosa version 0.10.0.
(APIServer pid=440639)  It will be removed in librosa version 1.0.
(APIServer pid=440639)   y, sr_native = __audioread_load(path, offset, duration, dtype)
(APIServer pid=440639) INFO 03-12 13:22:13 [async_omni.py:401] [AsyncOrchestrator] Entering scheduling loop: stages=3, final_stage=2
(Worker pid=443112) (Worker pid=443112) [Stage-0] WARNING 03-12 13:22:13 [gpu_model_runner.py:336] additional_information on request data is deprecated, use model_intermediate_buffer
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:22:16 [gpu_model_runner.py:336] additional_information on request data is deprecated, use model_intermediate_buffer
(Worker pid=443104) (Worker pid=443104) [Stage-1] INFO 03-12 13:22:16 [mrope.py:345] Multimodal token idx changed!
(Worker pid=443104) (Worker pid=443104) [Stage-1] WARNING 03-12 13:22:17 [gpu_model_runner.py:1363] _merge_additional_information_update is deprecated, use _update_intermediate_buffer
[Stage-0] INFO 03-12 13:22:19 [loggers.py:259] Engine 000: Avg prompt throughput: 371.9 tokens/s, Avg generation throughput: 14.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(Worker pid=444935) (Worker pid=444935) [Stage-2] INFO 03-12 13:22:24 [mrope.py:345] Multimodal token idx changed!
(APIServer pid=440639) WARNING 03-12 13:22:25 [protocol.py:51] The following fields were present in the request but ignored: {'reasoning_content'}
(APIServer pid=440639) INFO:     127.0.0.1:46978 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[Stage-1] INFO 03-12 13:22:25 [loggers.py:259] Engine 000: Avg prompt throughput: 368.5 tokens/s, Avg generation throughput: 43.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[Stage-0] INFO 03-12 13:22:29 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[Stage-2] INFO 03-12 13:22:31 [loggers.py:259] Engine 000: Avg prompt throughput: 694.4 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
[Stage-1] INFO 03-12 13

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

…cross multiple files This commit updates the documentation and configuration files to replace instances of `max_batch_size` with `max_num_seqs`, reflecting a change in how concurrent processing is managed in the system. The updates include changes in stage configurations, user guides, and examples to ensure consistency and clarity in the usage of the new parameter. Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

…x-batch-size

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2021d3e276

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T13:29:43Z

        if hasattr(stage_arg, "runtime") and stage_arg.runtime is not None and stage_type != "diffusion":
-            runtime_cfg = stage_arg.runtime
-            max_batch_size = int(runtime_cfg.get("max_batch_size", 1) or 1)
-            base_engine_args_tmp["max_num_seqs"] = max_batch_size
            base_engine_args_tmp.async_chunk = global_async_chunk
        stage_arg.engine_args = base_engine_args_tmp


Keep runtime.max_batch_size compatibility in YAML loader

This change removes the runtime.max_batch_size -> engine_args.max_num_seqs mapping, but there are still shipped stage configs that only set the runtime key (for example tests/perf/stage_configs/qwen3_tts.yaml:12 and benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs4.yaml:10). After this commit those configs load without max_num_seqs, so _stage_worker falls back to batch size 1 and serializes requests, which materially skews perf/benchmark results.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-12T13:29:43Z

        pass

-    max_batch_size = int(runtime_cfg.get("max_batch_size", 1) or 1)
+    max_batch_size = int(engine_args.get("max_num_seqs", 1))


Read max_num_seqs before filtering diffusion args

For diffusion stages, engine_args is filtered by OmniDiffusionConfig before this line runs, and that dataclass has no max_num_seqs field, so the value is dropped and this always resolves to the default of 1. As a result, any diffusion stage config that sets engine_args.max_num_seqs cannot actually increase batching anymore.

Useful? React with 👍 / 👎.

hsliuustc0106

do we use this for control max_concurrency?

Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>

hsliuustc0106

Summary

Comprehensive migration from runtime.max_batch_size to engine_args.max_num_seqs across the codebase. This standardizes concurrency control under engine_args and aligns stage config semantics with vLLM scheduler behavior.

Validated

✅ DCO signed
✅ All CI checks passed (build, pre-commit, buildkite, docs)
✅ Runtime code paths updated in omni.py, async_omni.py, omni_stage.py
✅ Legacy conversion logic removed from load_stage_configs_from_yaml
✅ All docs and examples updated consistently
✅ Test result shows Qwen3-Omni online serving works

Scope

51 files touched covering:

Stage config docs and quickstart
All model/executor stage YAMLs
Platform-specific configs (NPU/ROCm/XPU)
E2E/perf test configs
Offline inference example READMEs

Clean refactor with clear motivation (issue #695).

tzhouam · 2026-03-13T01:41:36Z

do we use this for control max_concurrency?

Yes, as discussed before, vllm using max_num_seq to control the max concurrency, the max batch size defined by us is duplicated and misleading.

gcanlin · 2026-03-13T02:27:30Z

        pass

-    max_batch_size = int(runtime_cfg.get("max_batch_size", 1) or 1)
+    max_batch_size = int(engine_args.get("max_num_seqs", 1))


Should we consider backward compatibility for max_batch_size?

same question: if the users have already put it inro production, how do we inform them this yaml arg changes?

added both warning in the running code and doc

Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>

linyueqian · 2026-03-18T18:56:30Z

resolve conflicts please.

…x-batch-size

…ross multiple YAML files and update related documentation. This change enhances compatibility and prepares for future deprecations of max_batch_size. Additionally, added tests to ensure proper migration and handling of the new parameter.

…tageConfigFactory. Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Gaohan123

LGTM. Thanks

…g_reqs and add max_num_seqs to config aligned with changes introduced in vllm-project#1851 Signed-off-by: jader <yjader@foxmail.com>

…t#1851) Signed-off-by: tzhouam <tzhouam@connect.ust.hk> Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>

tzhouam added 2 commits March 12, 2026 06:33

Merge remote-tracking branch 'remotes/origin/main' into dev/remove-ma…

2021d3e

…x-batch-size

tzhouam requested a review from hsliuustc0106 as a code owner March 12, 2026 13:24

tzhouam added the ready label to trigger buildkite CI label Mar 12, 2026

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

hsliuustc0106 reviewed Mar 12, 2026

View reviewed changes

Merge branch 'main' into dev/remove-max-batch-size

71b2075

Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>

hsliuustc0106 requested review from DarkLight1337, Gaohan123, ZJY0516, gcanlin, linyueqian, princepride and wtomin March 12, 2026 15:28

hsliuustc0106 approved these changes Mar 12, 2026

View reviewed changes

tzhouam changed the title ~~Dev/remove max batch size~~ [Debug] Remove duplicated config keyword max batch size Mar 13, 2026

gcanlin reviewed Mar 13, 2026

View reviewed changes

Merge branch 'main' into dev/remove-max-batch-size

2ec2c4e

Signed-off-by: Zhou Taichang <tzhouam@connect.ust.hk>

Gaohan123 added this to the v0.18.0 milestone Mar 16, 2026

linyueqian mentioned this pull request Mar 18, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

tzhouam added 3 commits March 20, 2026 02:06

Merge remote-tracking branch 'remotes/origin/main' into dev/remove-ma…

39868d5

…x-batch-size

Fix formatting in debug log message for max_batch_size migration in S…

a741044

…tageConfigFactory. Signed-off-by: tzhouam <tzhouam@connect.ust.hk>

Gaohan123 changed the title ~~[Debug] Remove duplicated config keyword max batch size~~ [Bugfix] Remove duplicated config keyword max batch size Mar 20, 2026

Gaohan123 approved these changes Mar 20, 2026

View reviewed changes

Gaohan123 merged commit 05d8de8 into vllm-project:main Mar 20, 2026
8 checks passed

pi314ever mentioned this pull request Mar 27, 2026

[bugfix] Remove duplicate yaml entry #2279

Merged

5 tasks

yJader mentioned this pull request May 14, 2026

[RFC] [Refactor]: Unify diffusion request identity around request_id #3550

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix] Remove duplicated config keyword max batch size#1851

[Bugfix] Remove duplicated config keyword max batch size#1851
Gaohan123 merged 7 commits into
vllm-project:mainfrom
tzhouam:dev/remove-max-batch-size

tzhouam commented Mar 12, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Uh oh!

hsliuustc0106 left a comment

Uh oh!

hsliuustc0106 left a comment

Uh oh!

tzhouam commented Mar 13, 2026

Uh oh!

gcanlin Mar 13, 2026

Uh oh!

hsliuustc0106 Mar 14, 2026

Uh oh!

tzhouam Mar 16, 2026

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

Gaohan123 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

tzhouam commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Summary

Validated

Scope

Uh oh!

tzhouam commented Mar 13, 2026

Uh oh!

gcanlin Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 Mar 14, 2026

Choose a reason for hiding this comment

Uh oh!

tzhouam Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

Gaohan123 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

tzhouam commented Mar 12, 2026 •

edited

Loading