Skip to content

[BugFix] Return proper HTTP status for ErrorResponse in create_speech#1687

Merged
hsliuustc0106 merged 1 commit intovllm-project:mainfrom
Lidang-Jiang:fix/speech-error-response-http-200
Mar 5, 2026
Merged

[BugFix] Return proper HTTP status for ErrorResponse in create_speech#1687
hsliuustc0106 merged 1 commit intovllm-project:mainfrom
Lidang-Jiang:fix/speech-error-response-http-200

Conversation

@Lidang-Jiang
Copy link
Copy Markdown
Contributor

@Lidang-Jiang Lidang-Jiang commented Mar 5, 2026

Summary

  • The /v1/audio/speech endpoint returns HTTP 200 even when the request fails (e.g., model not found), because ErrorResponse from _check_model() is returned as a raw Pydantic object — FastAPI serializes it as JSON with the default 200 status
  • This fix wraps ErrorResponse in JSONResponse with the correct HTTP status code, following the same pattern already used in create_chat_completion (api_server.py:788-792)

Bug Details

Root Cause

In serving_speech.py, _check_model() returns an ErrorResponse Pydantic object (with error.code=404) when the requested model name doesn't match any served model:

# serving_speech.py:502-505
error_check_ret = await self._check_model(request)
if error_check_ret is not None:
    logger.error("Error with model %s", error_check_ret)
    return error_check_ret  # Returns ErrorResponse Pydantic object

In api_server.py, the create_speech endpoint directly returns this object:

# Before fix
return await handler.create_speech(request, raw_request)
# FastAPI receives ErrorResponse (a Pydantic BaseModel), serializes to JSON → HTTP 200

Existing Correct Pattern

The create_chat_completion endpoint already handles this correctly:

# api_server.py:788-792
if isinstance(generator, ErrorResponse):
    return JSONResponse(
        content=generator.model_dump(),
        status_code=generator.error.code if generator.error else 400,
    )

Fix

Apply the same pattern to create_speech:

# After fix
result = await handler.create_speech(request, raw_request)
if isinstance(result, ErrorResponse):
    return JSONResponse(
        content=result.model_dump(),
        status_code=result.error.code if result.error else 400,
    )
return result

Reproduction

Setup: Deploy Qwen3-TTS via vllm-omni without --served-model-name (so the served name defaults to the full model path).

Test command:

curl -s -o /dev/null -w "HTTP %{http_code}\n" \
  -X POST http://localhost:8100/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"model": "Qwen3-TTS", "input": "test", "voice": "Chelsie"}'

Before fix: HTTP 200 (bug — error content returned with success status)
After fix: HTTP 404 (correct — model name mismatch properly reported)

Server Log (before fix)

tts.log (before fix) — HTTP 200 returned for error
WARNING 03-05 19:28:15 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
INFO 03-05 19:28:16 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀ 
INFO 03-05 19:28:16 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █  
INFO 03-05 19:28:16 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █  
INFO 03-05 19:28:16 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀ 
INFO 03-05 19:28:16 [logo.py:45] 
(APIServer pid=155200) INFO 03-05 19:28:16 [utils.py:287] vLLM server version 0.16.0, serving model /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice
(APIServer pid=155200) INFO 03-05 19:28:16 [utils.py:223] non-default args: {'model_tag': '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', 'host': '0.0.0.0', 'port': 8100, 'chat_template': '/ssd1/jianglidang/workspace/deploy/tts_chat_template.jinja', 'model': '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', 'trust_remote_code': True, 'enforce_eager': True}
(APIServer pid=155200) INFO 03-05 19:28:16 [omni.py:183] Initializing stages for model: /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice
(APIServer pid=155200) INFO 03-05 19:28:16 [omni.py:318] No omni_master_address provided, defaulting to localhost (127.0.0.1)
(APIServer pid=155200) WARNING 03-05 19:28:16 [utils.py:111] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. 
(APIServer pid=155200) INFO 03-05 19:28:16 [initialization.py:270] Loaded OmniTransferConfig with 1 connector configurations
(APIServer pid=155200) INFO 03-05 19:28:16 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=155200) INFO 03-05 19:28:16 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
(APIServer pid=155200) INFO 03-05 19:28:16 [omni.py:352] [AsyncOrchestrator] Loaded 2 stages
[Stage-1] WARNING 03-05 19:28:25 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-0] WARNING 03-05 19:28:25 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(APIServer pid=155200) INFO 03-05 19:28:25 [omni.py:463] [AsyncOrchestrator] Waiting for 2 stages to initialize (timeout: 600s)
[Stage-1] INFO 03-05 19:28:25 [omni_stage.py:1233] [Stage-1] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-1] INFO 03-05 19:28:25 [initialization.py:324] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 03-05 19:28:25 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 03-05 19:28:25 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 03-05 19:28:25 [omni_stage.py:83] Waiting for global engine init lock (/tmp/vllm_omni_engine_init.lock)...
[Stage-1] INFO 03-05 19:28:25 [omni_stage.py:85] Acquired global engine init lock
[Stage-1] INFO 03-05 19:28:25 [omni_stage.py:122] Using sequential init locks (nvml_available=True, pid_host=False)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-1] INFO 03-05 19:28:25 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-1] INFO 03-05 19:28:25 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-1] INFO 03-05 19:28:25 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-1] INFO 03-05 19:28:25 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-1] INFO 03-05 19:28:25 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 03-05 19:28:26 [omni_stage.py:1233] [Stage-0] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-0] INFO 03-05 19:28:26 [initialization.py:324] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 03-05 19:28:26 [omni_stage.py:83] Waiting for global engine init lock (/tmp/vllm_omni_engine_init.lock)...
[Stage-1] INFO 03-05 19:28:36 [model.py:529] Resolved architecture: Qwen3TTSCode2Wav
[Stage-1] INFO 03-05 19:28:38 [model.py:1871] Downcasting torch.float32 to torch.bfloat16.
[Stage-1] INFO 03-05 19:28:38 [model.py:1549] Using max model len 32768
[Stage-1] INFO 03-05 19:28:38 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=8192.
[Stage-1] INFO 03-05 19:28:38 [vllm.py:689] Asynchronous scheduling is disabled.
[Stage-1] WARNING 03-05 19:28:38 [vllm.py:727] Enforce eager set, overriding optimization level to -O0
[Stage-1] INFO 03-05 19:28:38 [vllm.py:845] Cudagraph is disabled under eager mode
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
[Stage-1] WARNING 03-05 19:28:48 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(EngineCore_DP0 pid=157178) [Stage-1] INFO 03-05 19:28:48 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', speculative_config=None, tokenizer='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=157178) [Stage-1] WARNING 03-05 19:28:48 [multiproc_executor.py:921] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-1] WARNING 03-05 19:28:57 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-1] INFO 03-05 19:28:58 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:21505 backend=nccl
[Stage-1] INFO 03-05 19:28:58 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(Worker pid=157625) [Stage-1] INFO 03-05 19:28:58 [gpu_model_runner.py:4124] Starting to load model /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice...
(Worker pid=157625) [Stage-1] INFO 03-05 19:28:59 [default_loader.py:293] Loading weights took 5213329.11 seconds
(Worker pid=157625) [Stage-1] INFO 03-05 19:29:00 [gpu_model_runner.py:4221] Model loading took 0.0 GiB memory and 0.001023 seconds
(Worker pid=157625) [Stage-1] INFO 03-05 19:29:00 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
(Worker pid=157625) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=157625) [Stage-1] INFO 03-05 19:29:00 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(Worker pid=157625) [Stage-1] INFO 03-05 19:29:00 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(Worker pid=157625) [Stage-1] WARNING 03-05 19:29:00 [qwen3_tts_code2wav.py:208] Code2Wav input_ids length 4 not divisible by num_quantizers 16, likely a warmup run; returning empty audio.
(Worker pid=157625) [Stage-1] WARNING 03-05 19:29:00 [gpu_generation_model_runner.py:451] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=157178) [Stage-1] INFO 03-05 19:29:00 [core.py:278] init engine (profile, create kv cache, warmup model) took 0.83 seconds
(EngineCore_DP0 pid=157178) The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=157178) [Stage-1] WARNING 03-05 19:29:01 [scheduler.py:166] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=157178) [Stage-1] WARNING 03-05 19:29:01 [core.py:130] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=157178) [Stage-1] INFO 03-05 19:29:01 [factory.py:46] Created connector: SharedMemoryConnector
(EngineCore_DP0 pid=157178) [Stage-1] INFO 03-05 19:29:01 [vllm.py:689] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=157178) [Stage-1] WARNING 03-05 19:29:01 [vllm.py:734] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=157178) [Stage-1] INFO 03-05 19:29:01 [vllm.py:845] Cudagraph is disabled under eager mode
[Stage-1] INFO 03-05 19:29:01 [omni_stage.py:102] Released global engine init lock
[Stage-0] INFO 03-05 19:29:01 [omni_stage.py:85] Acquired global engine init lock
[Stage-0] INFO 03-05 19:29:01 [omni_stage.py:122] Using sequential init locks (nvml_available=True, pid_host=False)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 03-05 19:29:01 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 03-05 19:29:01 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-0] INFO 03-05 19:29:01 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 03-05 19:29:01 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 03-05 19:29:01 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=155200) INFO 03-05 19:29:02 [omni.py:453] [AsyncOrchestrator] Stage-1 reported ready
[Stage-0] INFO 03-05 19:29:12 [model.py:529] Resolved architecture: Qwen3TTSTalkerForConditionalGeneration
[Stage-0] INFO 03-05 19:29:14 [model.py:1871] Downcasting torch.float32 to torch.bfloat16.
[Stage-0] INFO 03-05 19:29:14 [model.py:1549] Using max model len 4096
[Stage-0] INFO 03-05 19:29:14 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=512.
[Stage-0] INFO 03-05 19:29:14 [vllm.py:689] Asynchronous scheduling is disabled.
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
[Stage-0] WARNING 03-05 19:29:24 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(EngineCore_DP0 pid=158706) [Stage-0] INFO 03-05 19:29:24 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', speculative_config=None, tokenizer='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=158706) [Stage-0] WARNING 03-05 19:29:24 [multiproc_executor.py:921] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-0] WARNING 03-05 19:29:33 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-0] INFO 03-05 19:29:34 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:10003 backend=nccl
[Stage-0] INFO 03-05 19:29:34 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:35 [gpu_model_runner.py:4124] Starting to load model /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice...
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:35 [cuda.py:367] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:36 [vllm.py:689] Asynchronous scheduling is disabled.
(Worker pid=159165) 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(Worker pid=159165) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.11s/it]
(Worker pid=159165) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:02<00:00,  2.11s/it]
(Worker pid=159165) 
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:38 [qwen3_tts_talker.py:1537] Loaded 305 weights for Qwen3TTSTalkerForConditionalGeneration
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:38 [default_loader.py:293] Loading weights took 2.30 seconds
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:39 [gpu_model_runner.py:4221] Model loading took 3.62 GiB memory and 3.019834 seconds
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:45 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:45 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:45 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:45 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:45 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:47 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/d8a554edba/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:47 [backends.py:976] Dynamo bytecode transform time: 7.44 s
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:54 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 512) from the cache, took 2.564 s
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:54 [monitor.py:34] torch.compile takes 10.01 s in total
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:56 [base.py:102] Available KV cache memory: 35.97 GiB (profiling fallback)
(EngineCore_DP0 pid=158706) [Stage-0] INFO 03-05 19:29:56 [kv_cache_utils.py:1307] GPU KV cache size: 336,784 tokens
(EngineCore_DP0 pid=158706) [Stage-0] INFO 03-05 19:29:56 [kv_cache_utils.py:1312] Maximum concurrency for 4,096 tokens per request: 82.22x
(Worker pid=159165) 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/4 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  50%|█████     | 2/4 [00:00<00:00, 14.79it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:00<00:00, 15.40it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:00<00:00, 15.29it/s]
(Worker pid=159165) 
Capturing CUDA graphs (decode, FULL):   0%|          | 0/3 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):  67%|██████▋   | 2/3 [00:00<00:00, 16.58it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00, 18.03it/s]
(Worker pid=159165) [Stage-0] INFO 03-05 19:29:58 [gpu_model_runner.py:5246] Graph capturing finished in 2 secs, took 0.06 GiB
(EngineCore_DP0 pid=158706) [Stage-0] INFO 03-05 19:29:58 [core.py:278] init engine (profile, create kv cache, warmup model) took 18.85 seconds
(EngineCore_DP0 pid=158706) The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=158706) [Stage-0] WARNING 03-05 19:29:58 [scheduler.py:166] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=158706) [Stage-0] INFO 03-05 19:29:58 [factory.py:46] Created connector: SharedMemoryConnector
(EngineCore_DP0 pid=158706) [Stage-0] INFO 03-05 19:29:59 [vllm.py:689] Asynchronous scheduling is disabled.
[Stage-0] INFO 03-05 19:29:59 [omni_stage.py:102] Released global engine init lock
(APIServer pid=155200) INFO 03-05 19:29:59 [omni.py:453] [AsyncOrchestrator] Stage-0 reported ready
(APIServer pid=155200) INFO 03-05 19:29:59 [omni.py:482] [AsyncOrchestrator] All stages initialized successfully
(APIServer pid=155200) The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=155200) INFO 03-05 19:30:00 [async_omni.py:234] [AsyncOrchestrator] Initialized input_processor, io_processor, and model_config from stage-0
(APIServer pid=155200) WARNING 03-05 19:30:00 [api_server.py:469] vllm_config is None, some features may not work correctly
(APIServer pid=155200) INFO 03-05 19:30:00 [api_server.py:477] Supported tasks: {'generate'}
(APIServer pid=155200) WARNING 03-05 19:30:00 [api_server.py:548] Cannot initialize processors: vllm_config is None. OpenAIServingModels may fail.
(APIServer pid=155200) WARNING 03-05 19:30:00 [model.py:1350] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.9, 'max_tokens': 8192}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=155200) INFO 03-05 19:30:00 [serving.py:188] Warming up chat template processing...
(APIServer pid=155200) INFO 03-05 19:30:00 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=155200) INFO 03-05 19:30:00 [serving.py:213] Chat template warmup completed in 15.0ms
(APIServer pid=155200) INFO 03-05 19:30:00 [serving_speech.py:76] Loaded 9 supported speakers: ['aiden', 'dylan', 'eric', 'ono_anna', 'ryan', 'serena', 'sohee', 'uncle_fu', 'vivian']
(APIServer pid=155200) INFO 03-05 19:30:00 [serving_speech.py:94] Loaded codec frame rate: 12.5 Hz (output_sample_rate=24000, encode_downsample_rate=1920)
(APIServer pid=155200) INFO 03-05 19:30:00 [api_server.py:265] Starting vLLM API server (pure diffusion mode) on http://0.0.0.0:8100
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:38] Available routes are:
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/audio/speech, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/audio/voices, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/images/generations, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/images/edits, Methods: POST
(APIServer pid=155200) INFO 03-05 19:30:00 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=155200) INFO:     Started server process [155200]
(APIServer pid=155200) INFO:     Waiting for application startup.
(APIServer pid=155200) INFO:     Application startup complete.
(APIServer pid=155200) ERROR 03-05 19:37:11 [serving_speech.py:504] Error with model error=ErrorInfo(message='The model `Qwen3-TTS` does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=155200) INFO:     127.0.0.1:10002 - "POST /v1/audio/speech HTTP/1.1" 200 OK

Server Log (after fix)

fixed_tts.log (after fix) — HTTP 404 returned correctly
WARNING 03-05 19:39:32 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
INFO 03-05 19:39:33 [logo.py:45]        █     █     █▄   ▄█       ▄▀▀▀▀▄ █▄   ▄█ █▄    █ ▀█▀ 
INFO 03-05 19:39:33 [logo.py:45]  ▄▄ ▄█ █     █     █ ▀▄▀ █  ▄▄▄  █    █ █ ▀▄▀ █ █ ▀▄  █  █  
INFO 03-05 19:39:33 [logo.py:45]   █▄█▀ █     █     █     █       █    █ █     █ █   ▀▄█  █  
INFO 03-05 19:39:33 [logo.py:45]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀        ▀▀▀▀  ▀     ▀ ▀     ▀ ▀▀▀ 
INFO 03-05 19:39:33 [logo.py:45] 
(APIServer pid=2058) INFO 03-05 19:39:33 [utils.py:287] vLLM server version 0.16.0, serving model /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice
(APIServer pid=2058) INFO 03-05 19:39:33 [utils.py:223] non-default args: {'model_tag': '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', 'host': '0.0.0.0', 'port': 8100, 'chat_template': '/ssd1/jianglidang/workspace/deploy/tts_chat_template.jinja', 'model': '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', 'trust_remote_code': True, 'enforce_eager': True}
(APIServer pid=2058) INFO 03-05 19:39:33 [omni.py:183] Initializing stages for model: /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice
(APIServer pid=2058) INFO 03-05 19:39:33 [omni.py:318] No omni_master_address provided, defaulting to localhost (127.0.0.1)
(APIServer pid=2058) WARNING 03-05 19:39:33 [utils.py:111] Filtered out 1 callable object(s) from base_engine_args that are not compatible with OmegaConf: ['dispatch_function']. 
(APIServer pid=2058) INFO 03-05 19:39:34 [initialization.py:270] Loaded OmniTransferConfig with 1 connector configurations
(APIServer pid=2058) INFO 03-05 19:39:34 [factory.py:46] Created connector: SharedMemoryConnector
(APIServer pid=2058) INFO 03-05 19:39:34 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
(APIServer pid=2058) INFO 03-05 19:39:34 [omni.py:352] [AsyncOrchestrator] Loaded 2 stages
[Stage-1] WARNING 03-05 19:39:42 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-0] WARNING 03-05 19:39:42 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(APIServer pid=2058) INFO 03-05 19:39:43 [omni.py:463] [AsyncOrchestrator] Waiting for 2 stages to initialize (timeout: 600s)
[Stage-1] INFO 03-05 19:39:43 [omni_stage.py:1233] [Stage-1] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-1] INFO 03-05 19:39:43 [initialization.py:324] [Stage-1] Initializing OmniConnectors with config keys: ['from_stage_0']
[Stage-1] INFO 03-05 19:39:43 [factory.py:46] Created connector: SharedMemoryConnector
[Stage-1] INFO 03-05 19:39:43 [initialization.py:60] Created connector for 0 -> 1: SharedMemoryConnector
[Stage-1] INFO 03-05 19:39:43 [omni_stage.py:83] Waiting for global engine init lock (/tmp/vllm_omni_engine_init.lock)...
[Stage-1] INFO 03-05 19:39:43 [omni_stage.py:85] Acquired global engine init lock
[Stage-1] INFO 03-05 19:39:43 [omni_stage.py:122] Using sequential init locks (nvml_available=True, pid_host=False)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-1] INFO 03-05 19:39:43 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-1] INFO 03-05 19:39:43 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-1] INFO 03-05 19:39:43 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-1] INFO 03-05 19:39:43 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-1] INFO 03-05 19:39:43 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 03-05 19:39:43 [omni_stage.py:1233] [Stage-0] ZMQ transport detected; disabling SHM IPC (shm_threshold_bytes set to maxsize)
[Stage-0] INFO 03-05 19:39:43 [initialization.py:324] [Stage-0] Initializing OmniConnectors with config keys: ['to_stage_1']
[Stage-0] INFO 03-05 19:39:43 [omni_stage.py:83] Waiting for global engine init lock (/tmp/vllm_omni_engine_init.lock)...
[Stage-1] INFO 03-05 19:39:53 [model.py:529] Resolved architecture: Qwen3TTSCode2Wav
[Stage-1] INFO 03-05 19:39:55 [model.py:1871] Downcasting torch.float32 to torch.bfloat16.
[Stage-1] INFO 03-05 19:39:55 [model.py:1549] Using max model len 32768
[Stage-1] INFO 03-05 19:39:55 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=8192.
[Stage-1] INFO 03-05 19:39:55 [vllm.py:689] Asynchronous scheduling is disabled.
[Stage-1] WARNING 03-05 19:39:55 [vllm.py:727] Enforce eager set, overriding optimization level to -O0
[Stage-1] INFO 03-05 19:39:55 [vllm.py:845] Cudagraph is disabled under eager mode
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
[Stage-1] WARNING 03-05 19:40:05 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(EngineCore_DP0 pid=3946) [Stage-1] INFO 03-05 19:40:05 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', speculative_config=None, tokenizer='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=32768, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=3946) [Stage-1] WARNING 03-05 19:40:05 [multiproc_executor.py:921] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-1] WARNING 03-05 19:40:14 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-1] INFO 03-05 19:40:14 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:10003 backend=nccl
[Stage-1] INFO 03-05 19:40:15 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(Worker pid=4382) [Stage-1] INFO 03-05 19:40:15 [gpu_model_runner.py:4124] Starting to load model /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice...
(Worker pid=4382) [Stage-1] INFO 03-05 19:40:16 [default_loader.py:293] Loading weights took 5214006.10 seconds
(Worker pid=4382) [Stage-1] INFO 03-05 19:40:17 [gpu_model_runner.py:4221] Model loading took 0.0 GiB memory and 0.001044 seconds
(Worker pid=4382) [Stage-1] INFO 03-05 19:40:17 [kernel_warmup.py:44] Skipping FlashInfer autotune because it is disabled.
(Worker pid=4382) `torch_dtype` is deprecated! Use `dtype` instead!
(Worker pid=4382) [Stage-1] INFO 03-05 19:40:17 [configuration_qwen3_tts_tokenizer_v2.py:156] encoder_config is None. Initializing encoder with default values
(Worker pid=4382) [Stage-1] INFO 03-05 19:40:17 [configuration_qwen3_tts_tokenizer_v2.py:159] decoder_config is None. Initializing decoder with default values
(Worker pid=4382) [Stage-1] WARNING 03-05 19:40:17 [qwen3_tts_code2wav.py:208] Code2Wav input_ids length 4 not divisible by num_quantizers 16, likely a warmup run; returning empty audio.
(Worker pid=4382) [Stage-1] WARNING 03-05 19:40:17 [gpu_generation_model_runner.py:451] Dummy sampler run is not implemented for generation model
(EngineCore_DP0 pid=3946) [Stage-1] INFO 03-05 19:40:17 [core.py:278] init engine (profile, create kv cache, warmup model) took 0.77 seconds
(EngineCore_DP0 pid=3946) The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=3946) [Stage-1] WARNING 03-05 19:40:18 [scheduler.py:166] Using custom scheduler class vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=3946) [Stage-1] WARNING 03-05 19:40:18 [core.py:130] Disabling chunked prefill for model without KVCache
(EngineCore_DP0 pid=3946) [Stage-1] INFO 03-05 19:40:18 [factory.py:46] Created connector: SharedMemoryConnector
(EngineCore_DP0 pid=3946) [Stage-1] INFO 03-05 19:40:18 [vllm.py:689] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=3946) [Stage-1] WARNING 03-05 19:40:18 [vllm.py:734] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore_DP0 pid=3946) [Stage-1] INFO 03-05 19:40:18 [vllm.py:845] Cudagraph is disabled under eager mode
[Stage-1] INFO 03-05 19:40:18 [omni_stage.py:102] Released global engine init lock
[Stage-0] INFO 03-05 19:40:18 [omni_stage.py:85] Acquired global engine init lock
[Stage-0] INFO 03-05 19:40:18 [omni_stage.py:122] Using sequential init locks (nvml_available=True, pid_host=False)
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
[Stage-0] INFO 03-05 19:40:18 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 03-05 19:40:18 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
[Stage-0] INFO 03-05 19:40:18 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
[Stage-0] INFO 03-05 19:40:18 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
[Stage-0] INFO 03-05 19:40:18 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(APIServer pid=2058) INFO 03-05 19:40:19 [omni.py:453] [AsyncOrchestrator] Stage-1 reported ready
[Stage-0] INFO 03-05 19:40:29 [model.py:529] Resolved architecture: Qwen3TTSTalkerForConditionalGeneration
[Stage-0] INFO 03-05 19:40:31 [model.py:1871] Downcasting torch.float32 to torch.bfloat16.
[Stage-0] INFO 03-05 19:40:31 [model.py:1549] Using max model len 4096
[Stage-0] INFO 03-05 19:40:31 [scheduler.py:224] Chunked prefill is enabled with max_num_batched_tokens=512.
[Stage-0] INFO 03-05 19:40:31 [vllm.py:689] Asynchronous scheduling is disabled.
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
[Stage-0] WARNING 03-05 19:40:41 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
(EngineCore_DP0 pid=5463) [Stage-0] INFO 03-05 19:40:41 [core.py:97] Initializing a V1 LLM engine (v0.16.0) with config: model='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', speculative_config=None, tokenizer='/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=5463) [Stage-0] WARNING 03-05 19:40:41 [multiproc_executor.py:921] Reducing Torch parallelism from 80 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
[Stage-0] WARNING 03-05 19:40:50 [envs.py:94] No Flash Attention backend found, using pytorch SDPA implementation
[Stage-0] INFO 03-05 19:40:50 [parallel_state.py:1234] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:31155 backend=nccl
[Stage-0] INFO 03-05 19:40:51 [parallel_state.py:1445] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A
(Worker pid=5898) [Stage-0] INFO 03-05 19:40:51 [gpu_model_runner.py:4124] Starting to load model /ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice...
(Worker pid=5898) [Stage-0] INFO 03-05 19:40:52 [cuda.py:367] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(Worker pid=5898) [Stage-0] INFO 03-05 19:40:52 [vllm.py:689] Asynchronous scheduling is disabled.
(Worker pid=5898) 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(Worker pid=5898) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.98s/it]
(Worker pid=5898) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00,  1.98s/it]
(Worker pid=5898) 
(Worker pid=5898) [Stage-0] INFO 03-05 19:40:54 [qwen3_tts_talker.py:1537] Loaded 305 weights for Qwen3TTSTalkerForConditionalGeneration
(Worker pid=5898) [Stage-0] INFO 03-05 19:40:54 [default_loader.py:293] Loading weights took 2.12 seconds
(Worker pid=5898) [Stage-0] INFO 03-05 19:40:55 [gpu_model_runner.py:4221] Model loading took 3.62 GiB memory and 2.803808 seconds
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:02 [configuration_qwen3_tts.py:489] talker_config is None. Initializing talker model with default values
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:02 [configuration_qwen3_tts.py:492] speaker_encoder_config is None. Initializing talker model with default values
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:02 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:02 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:02 [configuration_qwen3_tts.py:441] code_predictor_config is None. Initializing code_predictor model with default values
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:03 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/d8a554edba/rank_0_0/backbone for vLLM's torch.compile
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:03 [backends.py:976] Dynamo bytecode transform time: 7.47 s
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:11 [backends.py:267] Directly load the compiled graph(s) for compile range (1, 512) from the cache, took 2.696 s
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:11 [monitor.py:34] torch.compile takes 10.17 s in total
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:12 [base.py:102] Available KV cache memory: 35.97 GiB (profiling fallback)
(EngineCore_DP0 pid=5463) [Stage-0] INFO 03-05 19:41:12 [kv_cache_utils.py:1307] GPU KV cache size: 336,784 tokens
(EngineCore_DP0 pid=5463) [Stage-0] INFO 03-05 19:41:12 [kv_cache_utils.py:1312] Maximum concurrency for 4,096 tokens per request: 82.22x
(Worker pid=5898) 
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):   0%|          | 0/4 [00:00<?, ?it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE):  50%|█████     | 2/4 [00:00<00:00, 15.40it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:00<00:00, 16.88it/s]
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 4/4 [00:00<00:00, 16.62it/s]
(Worker pid=5898) 
Capturing CUDA graphs (decode, FULL):   0%|          | 0/3 [00:00<?, ?it/s]
Capturing CUDA graphs (decode, FULL):  67%|██████▋   | 2/3 [00:00<00:00, 18.89it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 3/3 [00:00<00:00, 20.68it/s]
(Worker pid=5898) [Stage-0] INFO 03-05 19:41:13 [gpu_model_runner.py:5246] Graph capturing finished in 1 secs, took 0.06 GiB
(EngineCore_DP0 pid=5463) [Stage-0] INFO 03-05 19:41:13 [core.py:278] init engine (profile, create kv cache, warmup model) took 18.41 seconds
(EngineCore_DP0 pid=5463) The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(EngineCore_DP0 pid=5463) [Stage-0] WARNING 03-05 19:41:14 [scheduler.py:166] Using custom scheduler class vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler. This scheduler interface is not public and compatibility may not be maintained.
(EngineCore_DP0 pid=5463) [Stage-0] INFO 03-05 19:41:14 [factory.py:46] Created connector: SharedMemoryConnector
(EngineCore_DP0 pid=5463) [Stage-0] INFO 03-05 19:41:14 [vllm.py:689] Asynchronous scheduling is disabled.
[Stage-0] INFO 03-05 19:41:15 [omni_stage.py:102] Released global engine init lock
(APIServer pid=2058) INFO 03-05 19:41:15 [omni.py:453] [AsyncOrchestrator] Stage-0 reported ready
(APIServer pid=2058) INFO 03-05 19:41:15 [omni.py:482] [AsyncOrchestrator] All stages initialized successfully
(APIServer pid=2058) The tokenizer you are loading from '/ssd1/models/qwen3-tts/Qwen3-TTS-12Hz-1.7B-CustomVoice' with an incorrect regex pattern: https://huggingface.co/mistralai/Mistral-Small-3.1-24B-Instruct-2503/discussions/84#69121093e8b480e709447d5e. This will lead to incorrect tokenization. You should set the `fix_mistral_regex=True` flag when loading this tokenizer to fix this issue.
(APIServer pid=2058) INFO 03-05 19:41:16 [async_omni.py:234] [AsyncOrchestrator] Initialized input_processor, io_processor, and model_config from stage-0
(APIServer pid=2058) WARNING 03-05 19:41:16 [api_server.py:469] vllm_config is None, some features may not work correctly
(APIServer pid=2058) INFO 03-05 19:41:16 [api_server.py:477] Supported tasks: {'generate'}
(APIServer pid=2058) WARNING 03-05 19:41:16 [api_server.py:548] Cannot initialize processors: vllm_config is None. OpenAIServingModels may fail.
(APIServer pid=2058) WARNING 03-05 19:41:16 [model.py:1350] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'repetition_penalty': 1.05, 'temperature': 0.9, 'max_tokens': 8192}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=2058) INFO 03-05 19:41:16 [serving.py:188] Warming up chat template processing...
(APIServer pid=2058) INFO 03-05 19:41:16 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=2058) INFO 03-05 19:41:16 [serving.py:213] Chat template warmup completed in 12.6ms
(APIServer pid=2058) INFO 03-05 19:41:16 [serving_speech.py:76] Loaded 9 supported speakers: ['aiden', 'dylan', 'eric', 'ono_anna', 'ryan', 'serena', 'sohee', 'uncle_fu', 'vivian']
(APIServer pid=2058) INFO 03-05 19:41:16 [serving_speech.py:94] Loaded codec frame rate: 12.5 Hz (output_sample_rate=24000, encode_downsample_rate=1920)
(APIServer pid=2058) INFO 03-05 19:41:16 [api_server.py:265] Starting vLLM API server (pure diffusion mode) on http://0.0.0.0:8100
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:38] Available routes are:
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/audio/speech, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/audio/voices, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/images/generations, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/images/edits, Methods: POST
(APIServer pid=2058) INFO 03-05 19:41:16 [launcher.py:47] Route: /v1/videos, Methods: POST
(APIServer pid=2058) INFO:     Started server process [2058]
(APIServer pid=2058) INFO:     Waiting for application startup.
(APIServer pid=2058) INFO:     Application startup complete.
(APIServer pid=2058) ERROR 03-05 19:41:39 [serving_speech.py:504] Error with model error=ErrorInfo(message='The model `Qwen3-TTS` does not exist.', type='NotFoundError', param='model', code=404)
(APIServer pid=2058) INFO:     127.0.0.1:20868 - "POST /v1/audio/speech HTTP/1.1" 404 Not Found

Test plan

  • Syntax validation passes (python -c "import ast; ast.parse(...)")
  • Deployed on server without --served-model-name, confirmed HTTP 404 returned for mismatched model name
  • Deployed with --served-model-name Qwen3-TTS, confirmed normal TTS generation works
  • CI passes

Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Rating: 9.5/10 | Verdict: ✅ Approved

Summary

Clean fix for HTTP status code issue in /v1/audio/speech endpoint. When errors occur (e.g., model not found), the endpoint now correctly returns HTTP 404 instead of HTTP 200.

Strengths

  1. Follows Existing Pattern - Reuses the exact same error handling pattern from create_chat_completion (api_server.py:788-792)

  2. Correct Behavior - Per API spec:

    • Before: HTTP 200 with error content (bug)
    • After: HTTP 404 for model not found (correct)
  3. Minimal & Safe - Only 7 lines added, no side effects on normal TTS generation

  4. Well-Documented - Clear root cause analysis and reproduction steps


Great catch! Simple fix, big impact on API correctness. 🚀

@hsliuustc0106 hsliuustc0106 added the ready label to trigger buildkite CI label Mar 5, 2026
…ode in create_speech

The create_speech endpoint returns ErrorResponse as a raw Pydantic object
when model validation fails (e.g. _check_model() returns 404). FastAPI
serializes it as JSON with HTTP 200, making it impossible for clients to
detect errors via status code.

This follows the same pattern already used in create_chat_completion
(api_server.py:788-792): check isinstance(result, ErrorResponse) and
wrap in JSONResponse with the correct status_code.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
@Lidang-Jiang Lidang-Jiang force-pushed the fix/speech-error-response-http-200 branch from 519c05b to 6ce018e Compare March 5, 2026 12:04
@Lidang-Jiang Lidang-Jiang changed the title fix(speech): return proper HTTP status for ErrorResponse in create_speech [BugFix] Return proper HTTP status for ErrorResponse in create_speech Mar 5, 2026
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@yenuo26 this may be used in L5 test for error

@hsliuustc0106 hsliuustc0106 merged commit 28c2200 into vllm-project:main Mar 5, 2026
7 checks passed
linyueqian pushed a commit to lishunyang12/vllm-omni that referenced this pull request Mar 5, 2026
@yenuo26
Copy link
Copy Markdown
Collaborator

yenuo26 commented Mar 6, 2026

@yenuo26 this may be used in L5 test for error

Once the use cases for the basic features of this iteration of L4 are completed, we will add some reliability use cases that include abnormal scenarios in the next iteration.

hsliuustc0106 added a commit to hsliuustc0106/vllm-omni-skills that referenced this pull request Mar 7, 2026
### vllm-omni-api
- Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"
- Changes:
  - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"

### vllm-omni-contrib
- Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"
- Changes:
  - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)"

### vllm-omni-api
- Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]:  Add vae-patch-parallel CLI argument in online serving
- Changes:
  - New feature: [Feature]:  Add vae-patch-parallel CLI argument in online serving

### vllm-omni-contrib
- Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]:  Add vae-patch-parallel CLI argument in online serving
- Changes:
  - New feature: [Feature]:  Add vae-patch-parallel CLI argument in online serving

### vllm-omni-contrib
- Source: [PR #1693](vllm-project/vllm-omni#1693) - [skip CI][Docs] Add TTS model developer guide
- Changes:
  - New feature: [skip CI][Docs] Add TTS model developer guide

### vllm-omni-audio-tts
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-distributed
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-perf
- Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1
- Changes:
  - Bug fix: [MiMo-Audio] Bugfix tp lg than 1

### vllm-omni-perf
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-distributed
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-api
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Additions:
  - `/v1/audio/speech`

### vllm-omni-quantization
- Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech
- Changes:
  - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech

### vllm-omni-cicd
- Source: [PR #1683](vllm-project/vllm-omni#1683) - [CI] Remove high concurrency tests before issue #1374 fixed.
- Changes:
  - Bug fix: [CI] Remove high concurrency tests before issue #1374 fixed.

### vllm-omni-audio-tts
- Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS
- Changes:
  - New feature: Add non-async chunk support for Qwen3-TTS

### vllm-omni-cicd
- Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS
- Changes:
  - New feature: Add non-async chunk support for Qwen3-TTS

### vllm-omni-cicd
- Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type

### vllm-omni-perf
- Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type

### vllm-omni-serving
- Source: [PR #1675](vllm-project/vllm-omni#1675) - [Misc] remove logits_processor_pattern this field, because vllm have …

### vllm-omni-cicd
- Source: [PR #1666](vllm-project/vllm-omni#1666) - [Cleanup] Move cosyvoice3 tests to model subdirectory

### vllm-omni-audio-tts
- Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder
- Changes:
  - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder

### vllm-omni-cicd
- Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder
- Changes:
  - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder

### vllm-omni-distributed
- Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk

### vllm-omni-contrib
- Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk

### vllm-omni-quantization
- Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models
- Changes:
  - New feature: [UX] Add progress bar for diffusion models

### vllm-omni-perf
- Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models
- Changes:
  - New feature: [UX] Add progress bar for diffusion models

### vllm-omni-distributed
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-quantization
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-perf
- Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project

### vllm-omni-contrib
- Source: [PR #1649](vllm-project/vllm-omni#1649) - [Misc] update wechat

### vllm-omni-perf
- Source: [PR #1642](vllm-project/vllm-omni#1642) - [chore] add _repeated_blocks for regional compilation support
- Changes:
  - New feature: [chore] add _repeated_blocks for regional compilation support

### vllm-omni-api
- Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes
- Changes:
  - New feature: [Bugfix] Add TTS request validation to prevent engine crashes

### vllm-omni-cicd
- Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes
- Changes:
  - New feature: [Bugfix] Add TTS request validation to prevent engine crashes

### vllm-omni-image-gen
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Additions:
  - text-to-image
  - Text-to-Image
  - Flux

### vllm-omni-quantization
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Additions:
  - FP8 support or improvements

### vllm-omni-contrib
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer

### vllm-omni-perf
- Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer
- Changes:
  - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer

### vllm-omni-contrib
- Source: [PR #1631](vllm-project/vllm-omni#1631) - [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup
- Changes:
  - Bug fix: [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup

### vllm-omni-cicd
- Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases

### vllm-omni-perf
- Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases

### vllm-omni-perf
- Source: [PR #1619](vllm-project/vllm-omni#1619) - [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context
- Changes:
  - Bug fix: [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context

### vllm-omni-perf
- Source: [PR #1617](vllm-project/vllm-omni#1617) - [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph
- Changes:
  - Performance improvement: [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph

### vllm-omni-contrib
- Source: [PR #1615](vllm-project/vllm-omni#1615) - [Doc] Fix links in the configuration doc
- Changes:
  - Bug fix: [Doc] Fix links in the configuration doc

### vllm-omni-audio-tts
- Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor
- Changes:
  - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor

### vllm-omni-perf
- Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor
- Changes:
  - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor

### vllm-omni-image-gen
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Additions:
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image
  - GLM-Image

### vllm-omni-api
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-perf
- Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation
- Changes:
  - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation

### vllm-omni-contrib
- Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios  from ByteDance

### vllm-omni-perf
- Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios  from ByteDance

### vllm-omni-serving
- Source: [PR #1602](vllm-project/vllm-omni#1602) - [Bugfix] fix kernel error for qwen3-omni
- Changes:
  - Bug fix: [Bugfix] fix kernel error for qwen3-omni

### vllm-omni-distributed
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-image-gen
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Additions:
  - HunyuanImage3
  - HunyuanImage3Pipeline
  - HunyuanImage3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage-3
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage3Pipeline
  - HunyuanImage-3

### vllm-omni-quantization
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-perf
- Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0
- Changes:
  - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0

### vllm-omni-audio-tts
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-api
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-cicd
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-contrib
- Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase
- Changes:
  - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase

### vllm-omni-api
- Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor

### vllm-omni-serving
- Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor

### vllm-omni-distributed
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-cicd
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-perf
- Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode
- Changes:
  - New feature: [Feature][Bagel] Add CFG parallel mode

### vllm-omni-contrib
- Source: [PR #1576](vllm-project/vllm-omni#1576) - 0.16.0 release

### vllm-omni-audio-tts
- Source: [PR #1570](vllm-project/vllm-omni#1570) - [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio
- Changes:
  - Bug fix: [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio

### vllm-omni-api
- Source: [PR #1566](vllm-project/vllm-omni#1566) - [Bugfix] Import InputPreprocessor into Renderer
- Changes:
  - Bug fix: [Bugfix] Import InputPreprocessor into Renderer

### vllm-omni-distributed
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-quantization
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-perf
- Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai

### vllm-omni-image-gen
- Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln
- Changes:
  - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln
- Additions:
  - mindiesd
  - mindiesd
  - Qwen-Image-Edit-2509
  - mindiesd
  - mindiesd
  - mindiesd
  - mindiesd

### vllm-omni-perf
- Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln
- Changes:
  - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln

### vllm-omni-serving
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving

### vllm-omni-perf
- Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving
- Changes:
  - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving
lishunyang12 pushed a commit to lishunyang12/vllm-omni that referenced this pull request Mar 11, 2026
…vllm-project#1687)

Signed-off-by: Lidang-Jiang <lidangjiang@gmail.com>
Signed-off-by: lishunyang <lishunyang12@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants