Skip to content

Add Fish Speech S2 Pro support with online serving and voice cloning#1798

Merged
linyueqian merged 7 commits into
vllm-project:mainfrom
linyueqian:feature/fish-speech-s2-pro
Mar 12, 2026
Merged

Add Fish Speech S2 Pro support with online serving and voice cloning#1798
linyueqian merged 7 commits into
vllm-project:mainfrom
linyueqian:feature/fish-speech-s2-pro

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian commented Mar 11, 2026

Summary

  • Add Fish Speech S2 Pro (fishaudio/s2-pro) model support with dual-AR architecture (4B Slow AR + Fast AR + DAC decoder)
  • Implement online serving via /v1/audio/speech endpoint with streaming support
  • Add voice cloning: DAC-encode reference audio to semantic tokens, prepend as system message conditioning

Changes

Model files (vllm_omni/model_executor/models/fish_speech/)

  • fish_speech_slow_ar.py — Slow AR decoder with RoPE fix and sqrt normalization
  • fish_speech_fast_ar.py — Fast AR decoder with interleaved RoPE
  • fish_speech_dac_decoder.py — DAC codec decoder (44.1kHz output)
  • dac_encoder.py — CPU-based DAC encoder for voice cloning reference audio
  • configuration_fish_speech.py — Model config (fish_qwen3_omni)

Online serving

  • Fish Speech prompt builder in serving_speech.py with voice cloning support
  • Voice cloning: encode ref audio → semantic tokens → system message prefix

Stage config & input processors

  • fish_speech_s2_pro.yaml — Two-stage pipeline with async chunk streaming
  • fish_speech.py — Stage input processor for Slow AR → DAC decoder

Examples

  • examples/offline_inference/fish_speech/end2end.py — Offline inference
  • examples/online_serving/fish_speech/run_server.sh — Server launch script
  • examples/online_serving/fish_speech/speech_client.py — API client with voice cloning

Test plan

  • Offline inference produces intelligible speech
  • Online serving POST /v1/audio/speech returns valid WAV (44.1kHz, 3.76s)
  • Streaming mode returns PCM chunks (254KB)
  • Voice cloning with ref_audio + ref_text produces cloned voice output (2.83s)
  • Multi-request batching
  • Long-form text synthesis

Usage

# Start server
vllm-omni serve fishaudio/s2-pro \
    --stage-configs-path vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml \
    --omni --trust-remote-code --enforce-eager

# Basic TTS
curl -X POST http://localhost:8091/v1/audio/speech \
    -H "Content-Type: application/json" \
    -d '{"model":"fishaudio/s2-pro","input":"Hello world","voice":"default"}' \
    --output output.wav

# Voice cloning
python examples/online_serving/fish_speech/speech_client.py \
    --text "Hello world" --ref-audio ref.wav --ref-text "Reference transcript"

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 444cdb7efb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +359 to +362
semantic_code = semantic_token_id.reshape(bsz) - semantic_begin

all_codes = torch.empty(bsz, num_cb, dtype=torch.long, device=device)
all_codes[:, 0] = semantic_code
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Clamp non-semantic token ids before emitting codec codes

When the sampler emits <|im_end|> (which is explicitly allowed by the logits mask), semantic_token_id - semantic_begin_id becomes negative here and is written into codebook-0. That frame is then forwarded downstream as codec input, but DAC decoding expects non-negative code indices, so end-of-sequence steps can produce invalid codes and corrupted/failing decode at request tail. Please map out-of-range semantic ids to a safe pad value before filling all_codes.

Useful? React with 👍 / 👎.

Comment on lines +324 to +325
with torch.cuda.amp.autocast(dtype=torch.float32):
wav, audio_lengths = self._codec.decode(codes_bqf, feature_lengths)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Drop unsupported float32 autocast in DAC decode path

This decode block enters CUDA autocast with dtype=torch.float32, but CUDA AMP only supports reduced-precision autocast dtypes (fp16/bf16). In CUDA deployments this can raise at runtime when decoding audio, turning normal synthesis requests into failures. If full precision is desired, remove autocast (or disable it) instead of requesting fp32 autocast.

Useful? React with 👍 / 👎.

Comment on lines +791 to +794
additional_information: dict[str, Any] = {
"text": [request.input],
"max_new_tokens": [request.max_new_tokens or 4096],
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce Fish Speech max_new_tokens in actual sampling

max_new_tokens is stored in additional_information here, but the Fish Speech generation path still uses stage default sampling (max_tokens: 200) and no Fish model code reads this field, so caller-specified output length is silently ignored. This makes API behavior inconsistent (requests that ask for shorter/longer generations do not take effect).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a couple comments — mainly around the duplicated DAC codec construction and the resampling quality.

Comment thread vllm_omni/model_executor/models/fish_speech/fish_speech_dac_decoder.py Outdated
Comment thread vllm_omni/model_executor/models/fish_speech/dac_encoder.py
@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 12, 2026
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Adds Fish Speech S2 Pro model support with:

  • Dual-AR architecture (4B Slow AR + Fast AR + DAC decoder)
  • Online serving via /v1/audio/speech with streaming
  • Voice cloning via DAC-encoded reference audio
  • Comprehensive docs and examples

Validated

  • ✅ DCO signed
  • ✅ All CI checks passed
  • ✅ Offline inference, online serving, streaming mode, and voice cloning tested per PR description
  • ✅ Docs updated (supported_models.md, speech_api.md, examples)
  • ✅ Stage config with async chunk streaming

Scope

24 files with clean model structure:

  • Model files: slow_ar, fast_ar, dac_decoder, dac_encoder, configuration
  • Online serving: prompt builder with voice cloning support
  • Examples: offline inference, online serving, gradio demo
  • Tests mentioned but not in diff (assume tested locally)

Comprehensive new model integration.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

any inference speed result?

@linyueqian
Copy link
Copy Markdown
Collaborator Author

any inference speed result?

rtf is about 0.52. ttfp is 131 ms. @Sy0307 is working on several optimization in subsequent pr.

Implements the Dual-AR TTS pipeline for Fish Speech S2 Pro with two stages:
- Stage 0: Slow AR (Qwen3-based text model) generates semantic tokens with
  Fast AR codebook predictor for residual codes
- Stage 1: DAC decoder converts codec indices to 44.1kHz audio waveform

Key implementation details:
- Interleaved (GPT-J) RoPE style matching Fish Speech training
- Codebook embedding normalization by sqrt(num_codebooks + 1)
- DAC hop length of 2048 (512 decoder * 4 quantizer upsample)
- Async chunk streaming with left-context overlap for smooth audio
- Semantic token masking (only semantic range + im_end allowed)

Signed-off-by: linyueqian <linyueqian@outlook.com>
- Add fish_speech_slow_ar to TTS model stages in serving_speech.py
- Build Fish Speech prompts with chat template and <|voice|> token
- Support voice cloning via ref_audio + ref_text (DAC-encodes reference
  audio to semantic tokens on CPU and prepends as system message)
- Add DAC encoder utility for reference audio encoding
- Add server launch script and client example

Signed-off-by: linyueqian <linyueqian@outlook.com>
…kens

- Clamp semantic token IDs to valid codebook range in Fast AR; im_end
  or other non-semantic tokens now map to 0 instead of going negative
- Replace unsupported float32 autocast with autocast(enabled=False)
  in DAC decoder to avoid CUDA AMP runtime errors
- Override Stage-0 max_tokens from caller-specified max_new_tokens
  so Fish Speech API requests respect output length parameter

Signed-off-by: linyueqian <linyueqian@outlook.com>
- Interactive web UI with text input and voice cloning support
- Streaming (progressive PCM) and non-streaming modes
- Voice cloning via audio upload/URL + transcript
- Combined server + demo launch script (run_gradio_demo.sh)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the feature/fish-speech-s2-pro branch from 792076b to c69aaff Compare March 12, 2026 17:00
Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian added the ready label to trigger buildkite CI label Mar 12, 2026
@linyueqian linyueqian merged commit 366b336 into vllm-project:main Mar 12, 2026
6 of 7 checks passed
Fishermanykx pushed a commit to Fishermanykx/vllm-omni that referenced this pull request Mar 13, 2026
…llm-project#1798)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: KexiongYu <yukexiong1@huawei.com>
@univa-HARRY
Copy link
Copy Markdown

univa-HARRY commented Mar 16, 2026

Does this not work on the vllm/vllm-openai:v0.16.0-based image?
A language_model_only Pydantic error occurs.

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1234, in _stage_worker_async_entry
    asyncio.run(_stage_worker_async(model, stage_payload, in_q, out_q, batch_timeout, stage_init_timeout))
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1368, in _stage_worker_async
    vllm_config = omni_engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1410, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/engine/arg_utils.py", line 304, in create_model_config
    omni_config = OmniModelConfig(
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 141, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
pydantic_core._pydantic_core.ValidationError: 1 validation error for OmniModelConfig
language_model_only
  Unexpected keyword argument [type=unexpected_keyword_argument, input_value=False, input_type=bool]
    For further information visit https://errors.pydantic.dev/2.9/v/unexpected_keyword_argument
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1234, in _stage_worker_async_entry
    asyncio.run(_stage_worker_async(model, stage_payload, in_q, out_q, batch_timeout, stage_init_timeout))
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1368, in _stage_worker_async
    vllm_config = omni_engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1410, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/engine/arg_utils.py", line 304, in create_model_config
    omni_config = OmniModelConfig(
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 141, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
pydantic_core._pydantic_core.ValidationError: 1 validation error for OmniModelConfig
language_model_only
  Unexpected keyword argument [type=unexpected_keyword_argument, input_value=False, input_type=bool]
    For further information visit https://errors.pydantic.dev/2.9/v/unexpected_keyword_argument
(APIServer pid=7) Traceback (most recent call last):
(APIServer pid=7)   File "/usr/local/bin/vllm-omni", line 10, in <module>
(APIServer pid=7)     sys.exit(main())
(APIServer pid=7)              ^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/cli/main.py", line 53, in main
(APIServer pid=7)     args.dispatch_function(args)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/cli/serve.py", line 80, in cmd
(APIServer pid=7)     uvloop.run(omni_run_server(args))
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=7)     return __asyncio.run(
(APIServer pid=7)            ^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=7)     return runner.run(main)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=7)     return self._loop.run_until_complete(task)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=7)     return await main
(APIServer pid=7)            ^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 232, in omni_run_server
(APIServer pid=7)     await omni_run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 250, in omni_run_server_worker
(APIServer pid=7)     async with build_async_omni(
(APIServer pid=7)                ^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=7)     return await anext(self.gen)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 354, in build_async_omni
(APIServer pid=7)     async with build_async_omni_from_stage_config(
(APIServer pid=7)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=7)     return await anext(self.gen)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 397, in build_async_omni_from_stage_config
(APIServer pid=7)     async_omni = AsyncOmni(model=args.model, **kwargs)
(APIServer pid=7)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 132, in __init__
(APIServer pid=7)     super().__init__(model, **kwargs)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni.py", line 196, in __init__
(APIServer pid=7)     self._initialize_stages(model, kwargs)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni.py", line 375, in _initialize_stages
(APIServer pid=7)     self._wait_for_stages_ready(timeout=init_timeout)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 256, in _wait_for_stages_ready
(APIServer pid=7)     super()._wait_for_stages_ready(timeout)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni.py", line 548, in _wait_for_stages_ready
(APIServer pid=7)     if result := stage.try_collect():
(APIServer pid=7)                  ^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 714, in try_collect
(APIServer pid=7)     raise RuntimeError(f"OmniStage Worker process died unexpectedly with exit code {self._proc.exitcode}")
(APIServer pid=7) RuntimeError: OmniStage Worker process died unexpectedly with exit code 1

@mru4913
Copy link
Copy Markdown

mru4913 commented Mar 17, 2026

What's the release date? (for docker deployment)

@linyueqian
Copy link
Copy Markdown
Collaborator Author

Does this not work on the vllm/vllm-openai:v0.16.0-based image? A language_model_only Pydantic error occurs.

Process SpawnProcess-1:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1234, in _stage_worker_async_entry
    asyncio.run(_stage_worker_async(model, stage_payload, in_q, out_q, batch_timeout, stage_init_timeout))
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1368, in _stage_worker_async
    vllm_config = omni_engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1410, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/engine/arg_utils.py", line 304, in create_model_config
    omni_config = OmniModelConfig(
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 141, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
pydantic_core._pydantic_core.ValidationError: 1 validation error for OmniModelConfig
language_model_only
  Unexpected keyword argument [type=unexpected_keyword_argument, input_value=False, input_type=bool]
    For further information visit https://errors.pydantic.dev/2.9/v/unexpected_keyword_argument
The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
Process SpawnProcess-2:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1234, in _stage_worker_async_entry
    asyncio.run(_stage_worker_async(model, stage_payload, in_q, out_q, batch_timeout, stage_init_timeout))
  File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
    return runner.run(main)
           ^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/asyncio/base_events.py", line 691, in run_until_complete
    return future.result()
           ^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 1368, in _stage_worker_async
    vllm_config = omni_engine_args.create_engine_config(usage_context=usage_context)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1410, in create_engine_config
    model_config = self.create_model_config()
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/app/vllm-omni/vllm_omni/engine/arg_utils.py", line 304, in create_model_config
    omni_config = OmniModelConfig(
                  ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 141, in __init__
    s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
pydantic_core._pydantic_core.ValidationError: 1 validation error for OmniModelConfig
language_model_only
  Unexpected keyword argument [type=unexpected_keyword_argument, input_value=False, input_type=bool]
    For further information visit https://errors.pydantic.dev/2.9/v/unexpected_keyword_argument
(APIServer pid=7) Traceback (most recent call last):
(APIServer pid=7)   File "/usr/local/bin/vllm-omni", line 10, in <module>
(APIServer pid=7)     sys.exit(main())
(APIServer pid=7)              ^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/cli/main.py", line 53, in main
(APIServer pid=7)     args.dispatch_function(args)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/cli/serve.py", line 80, in cmd
(APIServer pid=7)     uvloop.run(omni_run_server(args))
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=7)     return __asyncio.run(
(APIServer pid=7)            ^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=7)     return runner.run(main)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=7)     return self._loop.run_until_complete(task)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=7)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=7)     return await main
(APIServer pid=7)            ^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 232, in omni_run_server
(APIServer pid=7)     await omni_run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 250, in omni_run_server_worker
(APIServer pid=7)     async with build_async_omni(
(APIServer pid=7)                ^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=7)     return await anext(self.gen)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 354, in build_async_omni
(APIServer pid=7)     async with build_async_omni_from_stage_config(
(APIServer pid=7)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=7)     return await anext(self.gen)
(APIServer pid=7)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/openai/api_server.py", line 397, in build_async_omni_from_stage_config
(APIServer pid=7)     async_omni = AsyncOmni(model=args.model, **kwargs)
(APIServer pid=7)                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 132, in __init__
(APIServer pid=7)     super().__init__(model, **kwargs)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni.py", line 196, in __init__
(APIServer pid=7)     self._initialize_stages(model, kwargs)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni.py", line 375, in _initialize_stages
(APIServer pid=7)     self._wait_for_stages_ready(timeout=init_timeout)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 256, in _wait_for_stages_ready
(APIServer pid=7)     super()._wait_for_stages_ready(timeout)
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni.py", line 548, in _wait_for_stages_ready
(APIServer pid=7)     if result := stage.try_collect():
(APIServer pid=7)                  ^^^^^^^^^^^^^^^^^^^
(APIServer pid=7)   File "/app/vllm-omni/vllm_omni/entrypoints/omni_stage.py", line 714, in try_collect
(APIServer pid=7)     raise RuntimeError(f"OmniStage Worker process died unexpectedly with exit code {self._proc.exitcode}")
(APIServer pid=7) RuntimeError: OmniStage Worker process died unexpectedly with exit code 1

needs to use 0.17.0

@linyueqian
Copy link
Copy Markdown
Collaborator Author

What's the release date? (for docker deployment)

please refer to https://docs.google.com/document/d/1OY_11S0FdOzY5txBdLrtPp4NleEhr7S3OE1YnpOWu9w/ the next formal release date is Mar 27

yiliu30 pushed a commit to yiliu30/vllm-omni-fork that referenced this pull request Mar 20, 2026
…llm-project#1798)

Signed-off-by: linyueqian <linyueqian@outlook.com>

Signed-off-by: yiliu30 <yi4.liu@intel.com>
@Morris-Lucifer
Copy link
Copy Markdown

Subject: Error BadRequestError: This model does not support generation when serving Fish Speech S2 Pro

Hi @linyueqian ,

I am testing the newly merged Fish Speech S2 Pro support using vllm-omni (version 0.17.0rc2.dev121+gb33d7637d). I encountered a 400 BadRequestError during online serving.

Server Start Command:

vllm serve /path/to/s2-pro \
  --served-model-name s2-pro \
  --stage-configs-path vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml \
  --omni --trust-remote-code --enforce-eager

Request Payload:
I am using the speech_client.py provided in the examples.

Error Log:

(APIServer pid=5191) INFO 03-20 21:36:47 [serving_speech.py:992] TTS speech request speech-xxx: text='...', model=s2-pro
(APIServer pid=5191) INFO: 172.22.16.1:58043 - "POST /v1/audio/speech HTTP/1.1" 400 Bad Request
{"error":{"message":"This model does not support generation","type":"BadRequestError","param":null,"code":400}}

I have already ensured model_type is fish_qwen3_omni, but the error persists.
Is there any specific requirement for the config.json or additional flags needed to enable the generation path for this dual-AR architecture?

Thanks!

@linyueqian
Copy link
Copy Markdown
Collaborator Author

Subject: Error BadRequestError: This model does not support generation when serving Fish Speech S2 Pro

Hi @linyueqian ,

I am testing the newly merged Fish Speech S2 Pro support using vllm-omni (version 0.17.0rc2.dev121+gb33d7637d). I encountered a 400 BadRequestError during online serving.

Server Start Command:

vllm serve /path/to/s2-pro \

  --served-model-name s2-pro \

  --stage-configs-path vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml \

  --omni --trust-remote-code --enforce-eager

Request Payload:

I am using the speech_client.py provided in the examples.

Error Log:


(APIServer pid=5191) INFO 03-20 21:36:47 [serving_speech.py:992] TTS speech request speech-xxx: text='...', model=s2-pro

(APIServer pid=5191) INFO: 172.22.16.1:58043 - "POST /v1/audio/speech HTTP/1.1" 400 Bad Request

{"error":{"message":"This model does not support generation","type":"BadRequestError","param":null,"code":400}}

I have already ensured model_type is fish_qwen3_omni, but the error persists.

Is there any specific requirement for the config.json or additional flags needed to enable the generation path for this dual-AR architecture?

Thanks!

Can you try to add is_comprehension=True in your config and see if that works?

@BadDeveloper2022
Copy link
Copy Markdown

v0.18.0rc1 Bug : WARNING 03-23 11:37:05 [serving_speech.py:331] Failed to estimate TTS prompt length, using fallback 2048: 'FishSpeechConfig' object has no attribute 'talker_config'
(APIServer pid=21972) INFO 03-23 11:37:05 [serving_speech.py:992] TTS speech request speech-aded8318469b416f: text='Hello ,Good Morning', model=Base
(APIServer pid=21972) INFO: 127.0.0.1:26695 - "POST /v1/audio/speech HTTP/1.1" 200 OK
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] Streaming speech generation failed for speech-aded8318469b416f: This model does not support generation
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] Traceback (most recent call last):
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/entrypoints/openai/serving_speech.py", line 695, in _generate_audio_chunks
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] async for res in generator:
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 193, in generate
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] await self.engine.add_request_async(
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 952, in add_request_async
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] self.add_request(
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 923, in add_request
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] msg = self._build_add_request_message(
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 638, in _build_add_request_message
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] request = self.input_processor.process_inputs(
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/vllm/v1/engine/input_processor.py", line 201, in process_inputs
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] self._validate_params(params, supported_tasks)
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/vllm/v1/engine/input_processor.py", line 94, in _validate_params
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] raise ValueError("This model does not support generation")
(APIServer pid=21972) ERROR 03-23 11:37:05 [serving_speech.py:749] ValueError: This model does not support generation
(APIServer pid=21972) ERROR: Exception in ASGI application
(APIServer pid=21972) + Exception Group Traceback (most recent call last):
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/_utils.py", line 81, in collapse_excgroups
(APIServer pid=21972) | yield
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/responses.py", line 270, in call
(APIServer pid=21972) | async with anyio.create_task_group() as task_group:
(APIServer pid=21972) | ^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/anyio/_backends/_asyncio.py", line 783, in aexit
(APIServer pid=21972) | raise BaseExceptionGroup(
(APIServer pid=21972) | ExceptionGroup: unhandled errors in a TaskGroup (1 sub-exception)
(APIServer pid=21972) +-+---------------- 1 ----------------
(APIServer pid=21972) | Traceback (most recent call last):
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
(APIServer pid=21972) | result = await app( # type: ignore[func-returns-value]
(APIServer pid=21972) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in call
(APIServer pid=21972) | return await self.app(scope, receive, send)
(APIServer pid=21972) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/fastapi/applications.py", line 1160, in call
(APIServer pid=21972) | await super().call(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/applications.py", line 107, in call
(APIServer pid=21972) | await self.middleware_stack(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/middleware/errors.py", line 186, in call
(APIServer pid=21972) | raise exc
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/middleware/errors.py", line 164, in call
(APIServer pid=21972) | await self.app(scope, receive, _send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/middleware/cors.py", line 87, in call
(APIServer pid=21972) | await self.app(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 177, in call
(APIServer pid=21972) | raise exc
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/prometheus_fastapi_instrumentator/middleware.py", line 175, in call
(APIServer pid=21972) | await self.app(scope, receive, send_wrapper)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/middleware/exceptions.py", line 63, in call
(APIServer pid=21972) | await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=21972) | raise exc
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=21972) | await app(scope, receive, sender)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/fastapi/middleware/asyncexitstack.py", line 18, in call
(APIServer pid=21972) | await self.app(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/routing.py", line 716, in call
(APIServer pid=21972) | await self.middleware_stack(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/routing.py", line 736, in app
(APIServer pid=21972) | await route.handle(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/routing.py", line 290, in handle
(APIServer pid=21972) | await self.app(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/fastapi/routing.py", line 119, in app
(APIServer pid=21972) | await wrap_app_handling_exceptions(app, request)(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
(APIServer pid=21972) | raise exc
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/_exception_handler.py", line 42, in wrapped_app
(APIServer pid=21972) | await app(scope, receive, sender)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/fastapi/routing.py", line 106, in app
(APIServer pid=21972) | await response(scope, receive, send)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/responses.py", line 269, in call
(APIServer pid=21972) | with collapse_excgroups():
(APIServer pid=21972) | ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/contextlib.py", line 158, in exit
(APIServer pid=21972) | self.gen.throw(value)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/_utils.py", line 87, in collapse_excgroups
(APIServer pid=21972) | raise exc
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/responses.py", line 273, in wrap
(APIServer pid=21972) | await func()
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/starlette/responses.py", line 253, in stream_response
(APIServer pid=21972) | async for chunk in self.body_iterator:
(APIServer pid=21972) | File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/entrypoints/openai/serving_speech.py", line 695, in _generate_audio_chunks
(APIServer pid=21972) | async for res in generator:
(APIServer pid=21972) | File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/entrypoints/async_omni.py", line 193, in generate
(APIServer pid=21972) | await self.engine.add_request_async(
(APIServer pid=21972) | File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 952, in add_request_async
(APIServer pid=21972) | self.add_request(
(APIServer pid=21972) | File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 923, in add_request
(APIServer pid=21972) | msg = self._build_add_request_message(
(APIServer pid=21972) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) | File "/home/ai/minicpm/0.18/vllm-omni/vllm_omni/engine/async_omni_engine.py", line 638, in _build_add_request_message
(APIServer pid=21972) | request = self.input_processor.process_inputs(
(APIServer pid=21972) | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/vllm/v1/engine/input_processor.py", line 201, in process_inputs
(APIServer pid=21972) | self._validate_params(params, supported_tasks)
(APIServer pid=21972) | File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/vllm/v1/engine/input_processor.py", line 94, in _validate_params
(APIServer pid=21972) | raise ValueError("This model does not support generation")
(APIServer pid=21972) | ValueError: This model does not support generation
(APIServer pid=21972) +------------------------------------
(APIServer pid=21972)
(APIServer pid=21972) During handling of the above exception, another exception occurred:
(APIServer pid=21972)
(APIServer pid=21972) Traceback (most recent call last):
(APIServer pid=21972) File "/root/miniconda3/envs/vllm-omni/lib/python3.12/site-packages/uvicorn/protocols/http/httptools_impl.py", line 416, in run_asgi
(APIServer pid=21972) result = await app( # type: ignore[func-returns-value]
(APIServer pid=21972) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@linyueqian
Copy link
Copy Markdown
Collaborator Author

This was fixed in #2058 — the stage configs were missing is_comprehension: true, which caused supported_tasks to not include "generate". Please update to the latest main and try again.

@1615070057
Copy link
Copy Markdown

The cloning effect feels mediocre, sounding like a Westerner speaking Chinese. Are there specific requirements for the cloning audio? My test audio is 18 seconds in MP3 format.

clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants