Skip to content

[Feature]: Remove some useless hf_overrides in yaml#1898

Merged
princepride merged 2 commits into
vllm-project:mainfrom
princepride:remove-useless-hf_overrides
Mar 18, 2026
Merged

[Feature]: Remove some useless hf_overrides in yaml#1898
princepride merged 2 commits into
vllm-project:mainfrom
princepride:remove-useless-hf_overrides

Conversation

@princepride
Copy link
Copy Markdown
Collaborator

Purpose

I have cleaned up redundant hf_overrides from several Qwen3-TTS and Fish Speech configuration files.
The architectures override is redundant because vllm_omni/config/model.py implements a property that automatically uses model_arch if it is set:

    @property
    def architectures(self) -> list[str]:
        if self.model_arch is not None:
            return [self.model_arch]
        return super().architectures

And vllm_omni/engine/arg_utils.py ensures this property is written back to the HF config structure:

        omni_config.hf_config.architectures = omni_config.architectures

Test Plan

vllm-omni serve Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice     --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts.yaml     --omni --port 8091

Test Result

INFO 03-14 15:40:26 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀ 
INFO 03-14 15:40:26 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █  
INFO 03-14 15:40:26 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █  
INFO 03-14 15:40:26 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀ 
INFO 03-14 15:40:26 [logo.py:45] 
(APIServer pid=2620562) INFO 03-14 15:40:26 [utils.py:302] vLLM server version 0.17.0, serving model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
(APIServer pid=2620562) INFO 03-14 15:40:26 [utils.py:238] non-default args: {'model_tag': 'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', 'port': 8091, 'model': 'Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice'}
(APIServer pid=2620562) INFO 03-14 15:40:26 [weight_utils.py:50] Using model weights format ['*']
(APIServer pid=2620562) INFO 03-14 15:40:26 [omni.py:195] Initializing stages for model: Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice
(APIServer pid=2620562) INFO 03-14 15:40:26 [omni.py:322] No omni_master_address provided, defaulting to localhost (127.0.0.1)
(APIServer pid=2620562) WARNING 03-14 15:40:26 [utils.py:111] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. 
(APIServer pid=2620562) INFO 03-14 15:40:26 [initialization.py:270] Loaded OmniTransferConfig with 1 connector configurations
(APIServer pid=2620562) INFO 03-14 15:40:26 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=2620562) INFO 03-14 15:40:26 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
(APIServer pid=2620562) INFO 03-14 15:40:26 [omni.py:356] [AsyncOrchestrator] Loaded 2 stages
/usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/__init__.py:32: RuntimeWarning: Failed to import version from _version.py: No module named 'vllm_omni._version'
This typically happens in development mode before building.
Using fallback version 'dev'.
  from .version import __version__, __version_tuple__  # isort:skip
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/__init__.py:32: RuntimeWarning: Failed to import version from _version.py: No module named 'vllm_omni._version'
This typically happens in development mode before building.
Using fallback version 'dev'.
  from .version import __version__, __version_tuple__  # isort:skip
(APIServer pid=2620562) INFO 03-14 15:40:35 [omni.py:539] [AsyncOrchestrator] Waiting for 2 stages to initialize (timeout: 600s)
[Stage-0] INFO 03-14 15:40:35 [omni_stage.py:1292] [Stage-0] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-0] INFO 03-14 15:40:35 [initialization.py:324] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-1] INFO 03-14 15:40:35 [omni_stage.py:1292] [Stage-1] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-1] INFO 03-14 15:40:35 [initialization.py:324] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 03-14 15:40:35 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 03-14 15:40:35 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-1] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-1] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-1] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-1] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-1] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-0] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 03-14 15:40:35 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-1] INFO 03-14 15:40:45 [model.py:531] Resolved architecture: Qwen3TTSCode2Wav
[Stage-0] INFO 03-14 15:40:45 [model.py:531] Resolved architecture: Qwen3TTSTalkerForConditionalGeneration
[Stage-1] INFO 03-14 15:40:46 [model.py:1554] Using max model len 32768
[Stage-1] INFO 03-14 15:40:46 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
[Stage-1] INFO 03-14 15:40:46 [vllm.py:747] Asynchronous scheduling is enabled.
[Stage-1] WARNING 03-14 15:40:46 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
[Stage-1] WARNING 03-14 15:40:46 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
[Stage-1] INFO 03-14 15:40:46 [vllm.py:957] Cudagraph is disabled under eager mode
[Stage-0] INFO 03-14 15:40:46 [model.py:1554] Using max model len 4096
[Stage-0] INFO 03-14 15:40:46 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=512.
[Stage-0] INFO 03-14 15:40:46 [vllm.py:747] Asynchronous scheduling is enabled.
/usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/__init__.py:32: RuntimeWarning: Failed to import version from _version.py: No module named 'vllm_omni._version'
This typically happens in development mode before building.
Using fallback version 'dev'.
  from .version import __version__, __version_tuple__  # isort:skip
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/__init__.py:32: RuntimeWarning: Failed to import version from _version.py: No module named 'vllm_omni._version'
This typically happens in development mode before building.
Using fallback version 'dev'.
  from .version import __version__, __version_tuple__  # isort:skip
(EngineCore_DP0 pid=2621717) [Stage-0] INFO 03-14 15:40:55 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', speculative_config=None, tokenizer='Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=2621717) [Stage-0] WARNING 03-14 15:40:55 [multiproc_executor.py:945] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2621717) [Stage-0] INFO 03-14 15:40:55 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.186.186 (local), world_size=1, local_world_size=1
(EngineCore_DP0 pid=2621711) [Stage-1] INFO 03-14 15:40:55 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', speculative_config=None, tokenizer='Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=2621711) [Stage-1] WARNING 03-14 15:40:55 [multiproc_executor.py:945] Reducing Torch parallelism from 96 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore_DP0 pid=2621711) [Stage-1] INFO 03-14 15:40:55 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.244.186.186 (local), world_size=1, local_world_size=1
/usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/usr/local/lib/python3.12/dist-packages/transformers/models/auto/image_processing_auto.py:647: FutureWarning: The image_processor_class argument is deprecated and will be removed in v4.42. Please use `slow_image_processor_class`, or `fast_image_processor_class` instead
  warnings.warn(
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/__init__.py:32: RuntimeWarning: Failed to import version from _version.py: No module named 'vllm_omni._version'
This typically happens in development mode before building.
Using fallback version 'dev'.
  from .version import __version__, __version_tuple__  # isort:skip
/proj-tango-pvc/users/zhipeng.wang/workspace/vllm-omni/vllm_omni/__init__.py:32: RuntimeWarning: Failed to import version from _version.py: No module named 'vllm_omni._version'
This typically happens in development mode before building.
Using fallback version 'dev'.
  from .version import __version__, __version_tuple__  # isort:skip
(Worker pid=2622225) [Stage-0] INFO 03-14 15:41:03 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:48261 backend=nccl
(Worker pid=2622228) [Stage-1] INFO 03-14 15:41:03 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:42993 backend=nccl
[W314 15:41:03.639374508 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
[W314 15:41:03.639582392 socket.cpp:207] [c10d] The hostname of the client socket cannot be retrieved. err=-3
(Worker pid=2622228) [Stage-1] INFO 03-14 15:41:03 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=2622225) [Stage-0] INFO 03-14 15:41:03 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(Worker pid=2622225) [Stage-0] INFO 03-14 15:41:03 [base.py:106] Offloader set to NoopOffloader
(Worker pid=2622228) [Stage-1] INFO 03-14 15:41:03 [base.py:106] Offloader set to NoopOffloader
/bin/sh: 1: sox: not found
(Worker pid=2622228) [2026-03-14 15:41:03] WARNING __init__.py:10: SoX could not be found!
(Worker pid=2622228) 
(Worker pid=2622228)     If you do not have SoX, proceed here:
(Worker pid=2622228)      - - - http://sox.sourceforge.net/ - - -
(Worker pid=2622228) 
(Worker pid=2622228)     If you do (or think that you should) have SoX, double-check your
(Worker pid=2622228)     path variables.
(Worker pid=2622228)     
/bin/sh: 1: sox: not found
(Worker pid=2622225) [2026-03-14 15:41:03] WARNING __init__.py:10: SoX could not be found!
(Worker pid=2622225) 
(Worker pid=2622225)     If you do not have SoX, proceed here:
(Worker pid=2622225)      - - - http://sox.sourceforge.net/ - - -
(Worker pid=2622225) 
(Worker pid=2622225)     If you do (or think that you should) have SoX, double-check your
(Worker pid=2622225)     path variables.
(Worker pid=2622225)     
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:03 [gpu_model_runner.py:4255] Starting to load model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice...
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:03 [gpu_model_runner.py:4255] Starting to load model Qwen/Qwen3-TTS-12Hz-1.7B-CustomVoice...
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:04 [default_loader.py:293] Loading weights took 1603542.37 seconds
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:04 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:04 [flash_attn.py:587] Using FlashAttention version 3
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:04 [gpu_model_runner.py:4338] Model loading took 0.0 GiB memory and 0.001065 seconds
(Worker pid=2622225) (Worker pid=2622225) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(Worker pid=2622225) (Worker pid=2622225) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:04 [vllm.py:747] Asynchronous scheduling is enabled.
(Worker pid=2622228) (Worker pid=2622228) 2026-03-14 15:41:04,947 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:05 [weight_utils.py:601] No model.safetensors.index.json found in remote.
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(Worker pid=2622228) (Worker pid=2622228) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:05 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:05 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:05 [cuda_graph_decoder_wrapper.py:75] Starting CUDA Graph warmup for 12 sizes: [2, 4, 8, 16, 25, 32, 50, 100, 150, 200, 250, 300]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.59it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  1.59it/s]
(Worker pid=2622225) (Worker pid=2622225) 
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:05 [qwen3_tts_talker.py:1557] Loaded 305 weights for Qwen3TTSTalkerForConditionalGeneration
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:05 [default_loader.py:293] Loading weights took 0.69 seconds
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:06 [gpu_model_runner.py:4338] Model loading took 3.62 GiB memory and 1.688287 seconds
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=2
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=4
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=8
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=16
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:06 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:06 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:06 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:06 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:06 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=25
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=32
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=50
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=100
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=150
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:06 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=200
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:07 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=250
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:07 [cuda_graph_decoder_wrapper.py:94]   Captured CUDA Graph for size=300
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:07 [cuda_graph_decoder_wrapper.py:99] CUDA Graph warmup complete. Captured 12 graphs.
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:07 [modeling_qwen3_tts_tokenizer_v2.py:900] CUDA Graph enabled for decoder with sizes: [2, 4, 8, 16, 25, 32, 50, 100, 150, 200, 250, 300]
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:07 [qwen3_tts_code2wav.py:131] Code2Wav decoder CUDA Graph enabled
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] INFO 03-14 15:41:07 [qwen3_tts_code2wav.py:255] Code2Wav codec: frames=512 q=16 uniq=1 range=[0,0] batch=1
(Worker pid=2622228) (Worker pid=2622228) 2026-03-14 15:41:07,098 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] WARNING 03-14 15:41:07 [qwen3_tts_code2wav.py:226] Code2Wav input_ids length 1 not divisible by num_quantizers 16, likely a warmup run; returning empty audio.
(Worker pid=2622228) (Worker pid=2622228) [Stage-1] WARNING 03-14 15:41:07 [gpu_generation_model_runner.py:462] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=2621711) [Stage-1] INFO 03-14 15:41:07 [core.py:282] init engine (profile, create kv cache, warmup model) took 2.44 seconds
(EngineCore_DP0 pid=2621711) [Stage-1] WARNING 03-14 15:41:08 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=2621711) [Stage-1] WARNING 03-14 15:41:08 [core.py:137] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=2621711) [Stage-1] INFO 03-14 15:41:08 [factory.py:46] Created connector: SharedMemoryConnector
(EngineCore_DP0 pid=2621711) [Stage-1] INFO 03-14 15:41:08 [vllm.py:747] Asynchronous scheduling is enabled.
(EngineCore_DP0 pid=2621711) [Stage-1] WARNING 03-14 15:41:08 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP0 pid=2621711) [Stage-1] WARNING 03-14 15:41:08 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=2621711) [Stage-1] INFO 03-14 15:41:08 [vllm.py:957] Cudagraph is disabled under eager mode
(APIServer pid=2620562) INFO 03-14 15:41:08 [omni.py:529] [AsyncOrchestrator] Stage-1 reported ready
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:11 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:11 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:11 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:11 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:11 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:11 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/96f3b50373/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:11 [backends.py:976] Dynamo bytecode transform time: 4.88 s
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:16 [backends.py:350] Cache the graph of compile range (1, 512) for later use
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:20 [backends.py:366] Compiling a graph for compile range (1, 512) takes 7.62 s
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:20 [monitor.py:35] torch.compile takes 13.65 s in total
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:20 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/ab74f1923f4b3cfdfa4e294c36e55d3d738443ba792987b5d34faf5054963aa7/rank_0_0/model
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:21 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/ab74f1923f4b3cfdfa4e294c36e55d3d738443ba792987b5d34faf5054963aa7/rank_0_0/model
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:25 [base.py:81] Available KV cache memory: 37.5 GiB (process-scoped)
(EngineCore_DP0 pid=2621717) [Stage-0] INFO 03-14 15:41:25 [kv_cache_utils.py:1314] GPU KV cache size: 351,104 tokens
(EngineCore_DP0 pid=2621717) [Stage-0] INFO 03-14 15:41:25 [kv_cache_utils.py:1319] Maximum concurrency for 4,096 tokens per request: 85.72x
(Worker pid=2622225) (Worker pid=2622225) 2026-03-14 15:41:25,514 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
(Worker pid=2622225) (Worker pid=2622225) 2026-03-14 15:41:25,531 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████| 5/5 [00:00<00:00, 25.65it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████| 4/4 [00:00<00:00, 34.90it/s]
(Worker pid=2622225) (Worker pid=2622225) [Stage-0] INFO 03-14 15:41:26 [gpu_model_runner.py:5360] Graph capturing finished in 1 secs, took 0.07 GiB
(EngineCore_DP0 pid=2621717) [Stage-0] INFO 03-14 15:41:26 [core.py:282] init engine (profile, create kv cache, warmup model) took 20.08 seconds
(EngineCore_DP0 pid=2621717) [Stage-0] WARNING 03-14 15:41:27 [scheduler.py:173] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=2621717) [Stage-0] INFO 03-14 15:41:27 [factory.py:46] Created connector: SharedMemoryConnector
(EngineCore_DP0 pid=2621717) [Stage-0] INFO 03-14 15:41:27 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=2620562) INFO 03-14 15:41:28 [omni.py:529] [AsyncOrchestrator] Stage-0 reported ready
(APIServer pid=2620562) INFO 03-14 15:41:28 [omni.py:558] [AsyncOrchestrator] All stages initialized successfully
(APIServer pid=2620562) INFO 03-14 15:41:29 [async_omni.py:271] [AsyncOrchestrator] Initialized input_processor, io_processor, and model_config from stage-0
(APIServer pid=2620562) WARNING 03-14 15:41:29 [api_server.py:488] vllm_config is None, some features may not work correctly
(APIServer pid=2620562) INFO 03-14 15:41:29 [api_server.py:496] Supported tasks: {'speech'}
(APIServer pid=2620562) WARNING 03-14 15:41:29 [api_server.py:567] Cannot initialize processors: vllm_config is None. OpenAIServingModels may fail.
(APIServer pid=2620562) INFO 03-14 15:41:29 [serving_speech.py:160] Loaded 9 supported speakers: ['aiden', 'dylan', 'eric', 'ono_anna', 'ryan', 'serena', 'sohee', 'uncle_fu', 'vivian']
(APIServer pid=2620562) INFO 03-14 15:41:29 [serving_speech.py:161] Loaded 0 uploaded speakers
(APIServer pid=2620562) INFO 03-14 15:41:29 [serving_speech.py:180] Loaded codec frame rate: 12.5 Hz (output_sample_rate=24000, encode_downsample_rate=1920)
(APIServer pid=2620562) INFO 03-14 15:41:29 [api_server.py:284] Starting vLLM API server (pure diffusion mode) on http://0.0.0.0:8091
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:38] Available routes are:
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/audio/speech, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/audio/voices, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/audio/voices, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/audio/voices/{name}, Methods: DELETE
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/images/generations, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/images/edits, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/videos, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/videos/{video_id}, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/videos/{video_id}, Methods: DELETE
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:47] Route: /v1/videos/{video_id}/content, Methods: GET
(APIServer pid=2620562) INFO 03-14 15:41:29 [launcher.py:58] Route: /v1/audio/speech/stream, Endpoint: streaming_speech
(APIServer pid=2620562) INFO:     Started server process [2620562]
(APIServer pid=2620562) INFO:     Waiting for application startup.
(APIServer pid=2620562) INFO:     Application startup complete.

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Signed-off-by: princepride <wangzhipeng628@gmail.com>
@princepride
Copy link
Copy Markdown
Collaborator Author

@tzhouam PTAL

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 58fa85029b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@@ -43,9 +40,6 @@ stage_args:
engine_args:
model_stage: code2wav
model_arch: Qwen3TTSCode2Wav
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore architecture override for the Code2Wav stage

Removing hf_overrides.architectures from the Qwen3-TTS Code2Wav stage disables the config path that strips rope_parameters for this non-LLM decoder: Qwen3TTSConfig.get_text_config() only performs that strip when architectures contains Code2Wav (vllm_omni/model_executor/models/qwen3_tts/configuration_qwen3_tts.py), but OmniEngineArgs.create_model_config() sets hf_config.architectures only after OmniModelConfig initialization (vllm_omni/engine/arg_utils.py), while OmniModelConfig.__post_init__ already computes hf_text_config (vllm_omni/config/model.py). This means Code2Wav can run with mRoPE still enabled, which regresses stage-1 runtime behavior/latency for all qwen3_tts configs touched here.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The statement in this review is inaccurate. All changes can be kept, and there is no need to revert any of them. Below is a detailed analysis:

The Reviewer's Argument

The reviewer states that removing hf_overrides.architectures will cause Qwen3TTSConfig.get_text_config() to fail to strip rope_parameters, thereby causing Code2Wav to mistakenly enable mRoPE. They cited three points in the execution timeline:

  1. get_text_config() only strips rope_parameters when architectures includes "Code2Wav".
  2. create_model_config() sets hf_config.architectures after the initialization of OmniModelConfig.
  3. OmniModelConfig.__post_init__ has already computed hf_text_config during initialization.

Why This Argument is Invalid

Key Point 1: uses_mrope is a lazy property, not cached during __post_init__

    @property
    def uses_mrope(self) -> bool:
        return uses_mrope(self.hf_config)

This is a standard @property (not a @cached_property), meaning it is re-evaluated every time it is accessed.

Key Point 2: The uses_mrope() function calls config.get_text_config() every time

def _uses_mrope(config: PretrainedConfig) -> bool:
    rope_parameters = getattr(config, "rope_parameters", None)
    if rope_parameters is None:
        return False
    return "mrope_section" in rope_parameters

def uses_mrope(config: PretrainedConfig) -> bool:
    """Detect if the model with this config uses M-ROPE."""
    return (
        _uses_mrope(config)
        or _uses_mrope(config.get_text_config())
        or thinker_uses_mrope(config)
    )

Key Point 3: hf_config.architectures is correctly set before the model runner accesses uses_mrope

In create_model_config() within arg_utils.py, at line 209:

        omni_config.hf_config.architectures = omni_config.architectures

And the OmniModelConfig.architectures property returns [self.model_arch]:

    @property
    def architectures(self) -> list[str]:
        if self.model_arch is not None:
            return [self.model_arch]
        return super().architectures

Since model_arch: Qwen3TTSCode2Wav is preserved in the YAML, omni_config.architectures evaluates to ["Qwen3TTSCode2Wav"]. This is assigned to hf_config.architectures before create_model_config() returns.

Complete Timeline Analysis

  1. **During __post_init__**: hf_config.architectures holds its original value (from config.json), which does not include "Code2Wav". When hf_text_config is computed, rope_parameters is not yet stripped.
  2. Before create_model_config() returns (Line 209): hf_config.architectures = ["Qwen3TTSCode2Wav"] is set.
  3. During Model Runner initialization: model_config.uses_mrope is accessed → calls uses_mrope(hf_config) → calls hf_config.get_text_config(). By this point, self.architectures already contains "Code2Wav" → rope_parameters is correctly stripped → uses_mrope returns False.

The Impact of rope_parameters During __post_init__

The only place where hf_text_config.rope_parameters is used during __post_init__ is in _get_and_verify_max_len (to calculate max_model_len). However, the Code2Wav YAML explicitly specifies max_model_len: 32768, and the max_position_embeddings in the talker config is also 32768, so this validation will not fail.

The Case of FishSpeech

FishSpeechS2ProConfig.get_text_config() does not have a similar stripping logic for rope_parameters:

    def get_text_config(self, **kwargs) -> FishSpeechSlowARConfig:
        return self.text_config

Therefore, the removal of hf_overrides for FishSpeech has absolutely no impact.

Conclusion

All changes can be kept, and there is no need to revert any of them. While the timing issue pointed out by the reviewer does indeed exist during the __post_init__ phase (architectures has not been set at that point), it will not cause Code2Wav to mistakenly use mRoPE at runtime. This is because uses_mrope is lazily evaluated. When it is actually used by the model runner, hf_config.architectures has already been correctly set to ["Qwen3TTSCode2Wav"].

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I use Claude-4.6-Opus-high review the review

@princepride princepride added the ready label to trigger buildkite CI label Mar 15, 2026
Copy link
Copy Markdown
Collaborator

@tzhouam tzhouam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, will merge after frontend rebase finished.

@princepride princepride merged commit 61e170c into vllm-project:main Mar 18, 2026
7 checks passed
yiliu30 pushed a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 20, 2026
)

Signed-off-by: princepride <wangzhipeng628@gmail.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 22, 2026
### vllm-omni-audio-tts
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-perf
- Source: [PR #2059](vllm-project/vllm-omni#2059) - [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool
- Changes:
  - Bug fix: [BugFix][Qwen3TTS] CodePredictor CudaGraph Pool

### vllm-omni-api
- Source: [PR #2058](vllm-project/vllm-omni#2058) - [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection
- Changes:
  - Bug fix: [Bugfix] Fix Fish Speech and CosyVoice3 online serving - missing is_comprehension and broken model detection

### vllm-omni-contrib
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-cicd
- Source: [PR #2045](vllm-project/vllm-omni#2045) - [Voxtral] Improve example

### vllm-omni-api
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-perf
- Source: [PR #2042](vllm-project/vllm-omni#2042) - [bugfix] /chat/completion doesn't read extra_body for diffusion model
- Changes:
  - Bug fix: [bugfix] /chat/completion doesn't read extra_body for diffusion model

### vllm-omni-contrib
- Source: [PR #2038](vllm-project/vllm-omni#2038) - [Doc] Update docs and dockerfiles for rebase of vllm v0.18.0

### vllm-omni-serving
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-contrib
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-api
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2037](vllm-project/vllm-omni#2037) - [Rebase] Rebase to vllm v0.18.0

### vllm-omni-cicd
- Source: [PR #2032](vllm-project/vllm-omni#2032) - [CI] Change Bagel online test environment variable `VLLM_TEST_CLEAN_GPU_MEMORY` to `0`

### vllm-omni-cicd
- Source: [PR #2031](vllm-project/vllm-omni#2031) - [CI] Fix test.
- Changes:
  - Bug fix: [CI] Fix test.

### vllm-omni-cicd
- Source: [PR #2017](vllm-project/vllm-omni#2017) - [CI] [ROCm] Setup `test-ready.yml` and `test-merge.yml`

### vllm-omni-cicd
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-perf
- Source: [PR #2014](vllm-project/vllm-omni#2014) - [Test] Implement mock HTTP request handling in benchmark CLI tests

### vllm-omni-serving
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-image-gen
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-perf
- Source: [PR #2012](vllm-project/vllm-omni#2012) - [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips
- Changes:
  - Bug fix: [Fixbug][Perf] Qwen3-omni: code predictor with re-prefill + SDPA and eliminate decode hot-path CPU round-trips

### vllm-omni-serving
- Source: [PR #2009](vllm-project/vllm-omni#2009) - [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni
- Changes:
  - Bug fix: [Bugfix] revert PR#1758 which introduced the accuracy problem of qwen3-omni

### vllm-omni-image-gen
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images
- Additions:
  - Qwen-Image-Layered
  - Qwen-Image-Layered
  - Qwen-Image-Layered

### vllm-omni-api
- Source: [PR #2007](vllm-project/vllm-omni#2007) - [Bugfix]Fix bug of online server can not return mutli images
- Changes:
  - Bug fix: [Bugfix]Fix bug of online server can not return mutli images

### vllm-omni-cicd
- Source: [PR #1998](vllm-project/vllm-omni#1998) - [CI] Split BAGEL tests into dummy/real weight tiers (L2/L3)

### vllm-omni-serving
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-audio-tts
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-perf
- Source: [PR #1985](vllm-project/vllm-omni#1985) - [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls
- Changes:
  - Performance improvement: [Perf] [Qwen3-TTS] Keep audio_codes and last_talker_hidden on GPU to eliminate per-step sync stalls

### vllm-omni-serving
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-api
- Source: [PR #1984](vllm-project/vllm-omni#1984) - [CI] [ROCm] Bugfix device environment issue
- Changes:
  - Bug fix: [CI] [ROCm] Bugfix device environment issue

### vllm-omni-serving
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-cicd
- Source: [PR #1982](vllm-project/vllm-omni#1982) - [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__
- Changes:
  - Bug fix: [Fix] Fix slow hasattr in CUDAGraphWrapper.__getattr__

### vllm-omni-api
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Additions:
  - `/v1/chat/completions`

### vllm-omni-perf
- Source: [PR #1979](vllm-project/vllm-omni#1979) - [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)
- Changes:
  - Bug fix: [Bugfix] Fix config misalignment between offline and online diffusion inference (Wan2.2, Qwen-Image series)

### vllm-omni-contrib
- Source: [PR #1976](vllm-project/vllm-omni#1976) - [skip ci][Docs] Update WeChat QR code (fix filename case)
- Changes:
  - Bug fix: [skip ci][Docs] Update WeChat QR code (fix filename case)

### vllm-omni-contrib
- Source: [PR #1974](vllm-project/vllm-omni#1974) - [Docs] Update WeChat QR code for community support

### vllm-omni-cicd
- Source: [PR #1945](vllm-project/vllm-omni#1945) - Fix Base voice clone streaming quality and stop-token crash
- Changes:
  - Bug fix: Fix Base voice clone streaming quality and stop-token crash

### vllm-omni-cicd
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1938](vllm-project/vllm-omni#1938) - [Test] L4 complete diffusion feature test for Bagel models
- Changes:
  - New feature: [Test] L4 complete diffusion feature test for Bagel models

### vllm-omni-perf
- Source: [PR #1934](vllm-project/vllm-omni#1934) - Fix OmniGen2 transformer config loading for HF models
- Changes:
  - Bug fix: Fix OmniGen2 transformer config loading for HF models

### vllm-omni-audio-tts
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-perf
- Source: [PR #1930](vllm-project/vllm-omni#1930) - [Bug][Qwen3TTS][Streaming] remove dynamic initial chunk and only compute on initial request

### vllm-omni-audio-tts
- Source: [PR #1926](vllm-project/vllm-omni#1926) - [Misc] removed qwen3_tts.py as it is out-dated

### vllm-omni-contrib
- Source: [PR #1920](vllm-project/vllm-omni#1920) - [Docs] Add Wan2.1-T2V as supported video generation models
- Changes:
  - New feature: [Docs] Add Wan2.1-T2V as supported video generation models

### vllm-omni-video-gen
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-perf
- Source: [PR #1915](vllm-project/vllm-omni#1915) - [Bugfix] fix helios video generate use cpu device
- Changes:
  - Bug fix: [Bugfix] fix helios video generate use cpu device

### vllm-omni-audio-tts
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-perf
- Source: [PR #1913](vllm-project/vllm-omni#1913) - [Optim][Qwen3TTS][CodePredictor] support torch.compile with reduce-overhead and dynamic False

### vllm-omni-api
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-perf
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-contrib
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-serving
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-cicd
- Source: [PR #1908](vllm-project/vllm-omni#1908) - [Entrypoint][Refactor] vLLM-Omni Entrypoint Refactoring

### vllm-omni-image-gen
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-contrib
- Source: [PR #1900](vllm-project/vllm-omni#1900) - [Feat] support HSDP for Flux family
- Changes:
  - New feature: [Feat] support HSDP for Flux family

### vllm-omni-distributed
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-quantization
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-cicd
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-perf
- Source: [PR #1898](vllm-project/vllm-omni#1898) - [Feature]: Remove some useless `hf_overrides` in yaml
- Changes:
  - New feature: [Feature]: Remove some useless `hf_overrides` in yaml

### vllm-omni-contrib
- Source: [PR #1890](vllm-project/vllm-omni#1890) - [NPU] Upgrade to v0.17.0

### vllm-omni-contrib
- Source: [PR #1889](vllm-project/vllm-omni#1889) - Add `Governance` section
- Changes:
  - New feature: Add `Governance` section

### vllm-omni-distributed
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism

### vllm-omni-cicd
- Source: [PR #1881](vllm-project/vllm-omni#1881) - [Feat] Support T5 Tensor Parallelism
- Changes:
  - New feature: [Feat] Support T5 Tensor Parallelism
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
)

Signed-off-by: princepride <wangzhipeng628@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants