[Perf] Improve Fish Speech S2 Pro inference performance#1859
Conversation
|
@hsliuustc0106 i will merge #1798 first |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: bd6be15da3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._is_fish_speech: | ||
| if not request.input or not request.input.strip(): | ||
| raise ValueError("Input text cannot be empty") | ||
| ref_audio_data = None | ||
| if request.ref_audio is not None: |
There was a problem hiding this comment.
Run Fish Speech request validation before generation
Fish Speech requests skip _validate_tts_request because they return from the if self._is_fish_speech branch before the elif self._is_tts validation path runs. Since OpenAICreateSpeechRequest.max_new_tokens is not range-limited in the schema, values like -1 or very large numbers can pass through and then be written into sampling params, which can cause runtime errors (negative token budget) or excessive generation budgets. Applying the existing validator (or equivalent bounds checks) to Fish Speech requests would prevent this.
Useful? React with 👍 / 👎.
| if self._is_fish_speech and request.max_new_tokens is not None and sampling_params_list: | ||
| import copy | ||
|
|
||
| sampling_params_list = copy.deepcopy(sampling_params_list) | ||
| sampling_params_list[0].max_tokens = request.max_new_tokens |
There was a problem hiding this comment.
Apply Fish Speech default max_new_tokens to sampling params
The Fish Speech prompt builder sets a default max_new_tokens of 4096 in additional_information, but this block only updates stage-0 sampling_params_list[0].max_tokens when the caller explicitly provides request.max_new_tokens. For requests that omit the field, generation falls back to the stage config default (fish_speech_s2_pro.yaml uses max_tokens: 200), causing premature truncation relative to the advertised 4096-token default behavior.
Useful? React with 👍 / 👎.
|
@vllm-omni-reviewer |
|
resolve conflicts please |
cbeec8d to
e8f000e
Compare
Resolved. |
linyueqian
left a comment
There was a problem hiding this comment.
LGTM! Nice perf wins across the board. One bug to fix:
fish_speech_dac_decoder.py — empty valid_codes_qf will crash
If all requests in a batch have invalid/empty codes, valid_codes_qf is empty and valid_codes_qf[0].device (line ~228) raises IndexError. Please add an early return guard:
if not valid_codes_qf:
return audios, srsbefore the feature_lengths = torch.tensor(...) block.
if not valid_codes_qf:
return OmniOutput(
text_hidden_states=None,
multimodal_outputs={
"model_outputs": [empty] * num_req,
"sr": [sr_tensor] * num_req,
},
)We have set some guards in code. This is enough? |
i think so |
|
resolve conflicts and should be good to go. |
Signed-off-by: sy0307 <sy0307@users.noreply.github.com>
|
fix pre-commit please |
Signed-off-by: Sy03 <1370724210@qq.com>
Fixed. |
…#1859) Signed-off-by: sy0307 <sy0307@users.noreply.github.com> Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: sy0307 <sy0307@users.noreply.github.com>
Purpose
This PR focuses on Fish Speech S2 Pro inference performance based on #1798 . Waiting for #1798 merge.
The current diff is shown against
main, so it also contains the full Fish Speech S2 Pro integration stack. However, the main intent of this PR is the performance work on top of that support.The main optimizations are:
last_slow_ar_hiddenGPU-resident across decode steps to avoid the per-step GPU -> CPU -> GPU round-tripthreading.Condition()wakeups plus bounded backoff.cpu().tolist()/torch.tensor(list)reconstructionIn practice, these changes reduce:
This improves both:
Test Result
Measured on RTX 5090 with
fishaudio/s2-pro.The comparison below uses the same config before and after the perf changes:
stage0/stage1 max_batch_size=8max_inflight=8Single request
Concurrency = 8
Concurrency = 16
Validation:
c=1,c=8, andc=16cc @linyueqian