Add Fish Speech S2 Pro support with online serving and voice cloning#1798
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 444cdb7efb
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| semantic_code = semantic_token_id.reshape(bsz) - semantic_begin | ||
|
|
||
| all_codes = torch.empty(bsz, num_cb, dtype=torch.long, device=device) | ||
| all_codes[:, 0] = semantic_code |
There was a problem hiding this comment.
Clamp non-semantic token ids before emitting codec codes
When the sampler emits <|im_end|> (which is explicitly allowed by the logits mask), semantic_token_id - semantic_begin_id becomes negative here and is written into codebook-0. That frame is then forwarded downstream as codec input, but DAC decoding expects non-negative code indices, so end-of-sequence steps can produce invalid codes and corrupted/failing decode at request tail. Please map out-of-range semantic ids to a safe pad value before filling all_codes.
Useful? React with 👍 / 👎.
| with torch.cuda.amp.autocast(dtype=torch.float32): | ||
| wav, audio_lengths = self._codec.decode(codes_bqf, feature_lengths) |
There was a problem hiding this comment.
Drop unsupported float32 autocast in DAC decode path
This decode block enters CUDA autocast with dtype=torch.float32, but CUDA AMP only supports reduced-precision autocast dtypes (fp16/bf16). In CUDA deployments this can raise at runtime when decoding audio, turning normal synthesis requests into failures. If full precision is desired, remove autocast (or disable it) instead of requesting fp32 autocast.
Useful? React with 👍 / 👎.
| additional_information: dict[str, Any] = { | ||
| "text": [request.input], | ||
| "max_new_tokens": [request.max_new_tokens or 4096], | ||
| } |
There was a problem hiding this comment.
Enforce Fish Speech max_new_tokens in actual sampling
max_new_tokens is stored in additional_information here, but the Fish Speech generation path still uses stage default sampling (max_tokens: 200) and no Fish model code reads this field, so caller-specified output length is silently ignored. This makes API behavior inconsistent (requests that ask for shorter/longer generations do not take effect).
Useful? React with 👍 / 👎.
ded4c6b to
a49fca5
Compare
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple comments — mainly around the duplicated DAC codec construction and the resampling quality.
hsliuustc0106
left a comment
There was a problem hiding this comment.
Summary
Adds Fish Speech S2 Pro model support with:
- Dual-AR architecture (4B Slow AR + Fast AR + DAC decoder)
- Online serving via
/v1/audio/speechwith streaming - Voice cloning via DAC-encoded reference audio
- Comprehensive docs and examples
Validated
- ✅ DCO signed
- ✅ All CI checks passed
- ✅ Offline inference, online serving, streaming mode, and voice cloning tested per PR description
- ✅ Docs updated (supported_models.md, speech_api.md, examples)
- ✅ Stage config with async chunk streaming
Scope
24 files with clean model structure:
- Model files: slow_ar, fast_ar, dac_decoder, dac_encoder, configuration
- Online serving: prompt builder with voice cloning support
- Examples: offline inference, online serving, gradio demo
- Tests mentioned but not in diff (assume tested locally)
Comprehensive new model integration.
|
any inference speed result? |
rtf is about 0.52. ttfp is 131 ms. @Sy0307 is working on several optimization in subsequent pr. |
Implements the Dual-AR TTS pipeline for Fish Speech S2 Pro with two stages: - Stage 0: Slow AR (Qwen3-based text model) generates semantic tokens with Fast AR codebook predictor for residual codes - Stage 1: DAC decoder converts codec indices to 44.1kHz audio waveform Key implementation details: - Interleaved (GPT-J) RoPE style matching Fish Speech training - Codebook embedding normalization by sqrt(num_codebooks + 1) - DAC hop length of 2048 (512 decoder * 4 quantizer upsample) - Async chunk streaming with left-context overlap for smooth audio - Semantic token masking (only semantic range + im_end allowed) Signed-off-by: linyueqian <linyueqian@outlook.com>
- Add fish_speech_slow_ar to TTS model stages in serving_speech.py - Build Fish Speech prompts with chat template and <|voice|> token - Support voice cloning via ref_audio + ref_text (DAC-encodes reference audio to semantic tokens on CPU and prepends as system message) - Add DAC encoder utility for reference audio encoding - Add server launch script and client example Signed-off-by: linyueqian <linyueqian@outlook.com>
…kens - Clamp semantic token IDs to valid codebook range in Fast AR; im_end or other non-semantic tokens now map to 0 instead of going negative - Replace unsupported float32 autocast with autocast(enabled=False) in DAC decoder to avoid CUDA AMP runtime errors - Override Stage-0 max_tokens from caller-specified max_new_tokens so Fish Speech API requests respect output length parameter Signed-off-by: linyueqian <linyueqian@outlook.com>
- Interactive web UI with text input and voice cloning support - Streaming (progressive PCM) and non-streaming modes - Voice cloning via audio upload/URL + transcript - Combined server + demo launch script (run_gradio_demo.sh) Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
792076b to
c69aaff
Compare
Signed-off-by: linyueqian <linyueqian@outlook.com>
…llm-project#1798) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: KexiongYu <yukexiong1@huawei.com>
|
Does this not work on the |
|
What's the release date? (for docker deployment) |
needs to use 0.17.0 |
please refer to https://docs.google.com/document/d/1OY_11S0FdOzY5txBdLrtPp4NleEhr7S3OE1YnpOWu9w/ the next formal release date is Mar 27 |
…llm-project#1798) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: yiliu30 <yi4.liu@intel.com>
|
Subject: Error Hi @linyueqian , I am testing the newly merged Fish Speech S2 Pro support using Server Start Command: vllm serve /path/to/s2-pro \
--served-model-name s2-pro \
--stage-configs-path vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml \
--omni --trust-remote-code --enforce-eagerRequest Payload: Error Log: I have already ensured Thanks! |
Can you try to add is_comprehension=True in your config and see if that works? |
|
v0.18.0rc1 Bug : WARNING 03-23 11:37:05 [serving_speech.py:331] Failed to estimate TTS prompt length, using fallback 2048: 'FishSpeechConfig' object has no attribute 'talker_config' |
|
This was fixed in #2058 — the stage configs were missing |
|
The cloning effect feels mediocre, sounding like a Westerner speaking Chinese. Are there specific requirements for the cloning audio? My test audio is 18 seconds in MP3 format. |
…llm-project#1798) Signed-off-by: linyueqian <linyueqian@outlook.com>
Summary
fishaudio/s2-pro) model support with dual-AR architecture (4B Slow AR + Fast AR + DAC decoder)/v1/audio/speechendpoint with streaming supportChanges
Model files (
vllm_omni/model_executor/models/fish_speech/)fish_speech_slow_ar.py— Slow AR decoder with RoPE fix and sqrt normalizationfish_speech_fast_ar.py— Fast AR decoder with interleaved RoPEfish_speech_dac_decoder.py— DAC codec decoder (44.1kHz output)dac_encoder.py— CPU-based DAC encoder for voice cloning reference audioconfiguration_fish_speech.py— Model config (fish_qwen3_omni)Online serving
serving_speech.pywith voice cloning supportStage config & input processors
fish_speech_s2_pro.yaml— Two-stage pipeline with async chunk streamingfish_speech.py— Stage input processor for Slow AR → DAC decoderExamples
examples/offline_inference/fish_speech/end2end.py— Offline inferenceexamples/online_serving/fish_speech/run_server.sh— Server launch scriptexamples/online_serving/fish_speech/speech_client.py— API client with voice cloningTest plan
POST /v1/audio/speechreturns valid WAV (44.1kHz, 3.76s)ref_audio+ref_textproduces cloned voice output (2.83s)Usage