[Perf] Improve Fish Speech S2 Pro inference performance by Sy0307 · Pull Request #1859 · vllm-project/vllm-omni

Sy0307 · 2026-03-12T20:00:11Z

Purpose

This PR focuses on Fish Speech S2 Pro inference performance based on #1798 . Waiting for #1798 merge.

The current diff is shown against main, so it also contains the full Fish Speech S2 Pro integration stack. However, the main intent of this PR is the performance work on top of that support.

The main optimizations are:

batch DAC decode in Stage-1 instead of decoding each request separately
keep last_slow_ar_hidden GPU-resident across decode steps to avoid the per-step GPU -> CPU -> GPU round-trip
replace connector busy-spin polling with threading.Condition() wakeups plus bounded backoff
store async chunk frames as tensors and pack them with tensor ops instead of repeated .cpu().tolist() / torch.tensor(list) reconstruction
keep the guarded Fast AR compiled path for the single-request fast path

In practice, these changes reduce:

repeated small DAC decode launches under concurrency
per-step host/device copies in the Slow AR loop
connector-side CPU polling overhead
Python-heavy chunk packing overhead in the streaming path

This improves both:

single-request latency / RTF
loaded throughput under concurrency

Test Result

Measured on RTX 5090 with fishaudio/s2-pro.

The comparison below uses the same config before and after the perf changes:

stage0/stage1 max_batch_size=8
max_inflight=8

Single request

Metric	Before perf changes	After perf changes	Delta
RTF	0.446	0.392	-12.0%
TTFP	0.121s	0.114s	-5.8%
Request throughput	0.381 req/s	0.432 req/s	+13.6%

Concurrency = 8

Metric	Before perf changes	After perf changes	Delta
RTF	1.113	0.760	-31.7%
TTFP	0.342s	0.287s	-16.2%
Request throughput	1.204 req/s	1.757 req/s	+46.0%
Audio throughput	7.099 s/s	10.363 s/s	+46.0%

Concurrency = 16

Metric	Before perf changes	After perf changes	Delta
RTF	1.629	1.159	-28.9%
TTFP	3.547s	2.567s	-27.6%
Request throughput	1.251 req/s	1.754 req/s	+40.2%
Audio throughput	7.376 s/s	10.343 s/s	+40.2%

Validation:

transfer adapter unit tests passed
Fish Speech end-to-end benchmark completed for c=1, c=8, and c=16

cc @linyueqian

linyueqian · 2026-03-12T20:02:30Z

@hsliuustc0106 i will merge #1798 first

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: bd6be15da3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T20:10:03Z

+        if self._is_fish_speech:
+            if not request.input or not request.input.strip():
+                raise ValueError("Input text cannot be empty")
+            ref_audio_data = None
+            if request.ref_audio is not None:


Run Fish Speech request validation before generation

Fish Speech requests skip _validate_tts_request because they return from the if self._is_fish_speech branch before the elif self._is_tts validation path runs. Since OpenAICreateSpeechRequest.max_new_tokens is not range-limited in the schema, values like -1 or very large numbers can pass through and then be written into sampling params, which can cause runtime errors (negative token budget) or excessive generation budgets. Applying the existing validator (or equivalent bounds checks) to Fish Speech requests would prevent this.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-12T20:10:03Z

+        if self._is_fish_speech and request.max_new_tokens is not None and sampling_params_list:
+            import copy
+
+            sampling_params_list = copy.deepcopy(sampling_params_list)
+            sampling_params_list[0].max_tokens = request.max_new_tokens


Apply Fish Speech default max_new_tokens to sampling params

The Fish Speech prompt builder sets a default max_new_tokens of 4096 in additional_information, but this block only updates stage-0 sampling_params_list[0].max_tokens when the caller explicitly provides request.max_new_tokens. For requests that omit the field, generation falls back to the stage config default (fish_speech_s2_pro.yaml uses max_tokens: 200), causing premature truncation relative to the advertised 4096-token default behavior.

Useful? React with 👍 / 👎.

hsliuustc0106 · 2026-03-13T03:53:15Z

@vllm-omni-reviewer

hsliuustc0106 · 2026-03-13T05:45:52Z

resolve conflicts please

Sy0307 · 2026-03-16T20:07:56Z

resolve conflicts please

Resolved.

Sy0307 · 2026-03-17T07:56:04Z

PTAK @linyueqian @hsliuustc0106

linyueqian

LGTM! Nice perf wins across the board. One bug to fix:

fish_speech_dac_decoder.py — empty valid_codes_qf will crash

If all requests in a batch have invalid/empty codes, valid_codes_qf is empty and valid_codes_qf[0].device (line ~228) raises IndexError. Please add an early return guard:

if not valid_codes_qf:
    return audios, srs

before the feature_lengths = torch.tensor(...) block.

Sy0307 · 2026-03-18T03:33:13Z

If all requests in a batch have invalid/empty codes, valid_codes_qf is empty and valid_codes_qf[0].device (line ~228) raises IndexError. Please add an early return guard:
if not valid_codes_qf:
    return audios, srs
before the feature_lengths = torch.tensor(...) block.

        if not valid_codes_qf:
            return OmniOutput(
                text_hidden_states=None,
                multimodal_outputs={
                    "model_outputs": [empty] * num_req,
                    "sr": [sr_tensor] * num_req,
                },
            )

We have set some guards in code. This is enough?

linyueqian · 2026-03-18T03:36:55Z

If all requests in a batch have invalid/empty codes, valid_codes_qf is empty and valid_codes_qf[0].device (line ~228) raises IndexError. Please add an early return guard:
if not valid_codes_qf:
    return audios, srs
before the feature_lengths = torch.tensor(...) block.
        if not valid_codes_qf:
            return OmniOutput(
                text_hidden_states=None,
                multimodal_outputs={
                    "model_outputs": [empty] * num_req,
                    "sr": [sr_tensor] * num_req,
                },
            )
We have set some guards in code. This is enough?

i think so

linyueqian · 2026-03-18T19:17:01Z

resolve conflicts and should be good to go.

Signed-off-by: sy0307 <sy0307@users.noreply.github.com>

linyueqian · 2026-03-20T21:39:42Z

fix pre-commit please

Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 · 2026-03-20T21:42:27Z

fix pre-commit please

Fixed.

…#1859) Signed-off-by: sy0307 <sy0307@users.noreply.github.com> Signed-off-by: Sy03 <1370724210@qq.com> Co-authored-by: sy0307 <sy0307@users.noreply.github.com>

Sy0307 requested a review from hsliuustc0106 as a code owner March 12, 2026 20:00

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

Sy0307 force-pushed the dev/fish_perf branch 3 times, most recently from cbeec8d to e8f000e Compare March 16, 2026 20:04

linyueqian approved these changes Mar 17, 2026

View reviewed changes

linyueqian added this to the v0.18.0 milestone Mar 18, 2026

linyueqian mentioned this pull request Mar 18, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

Improve Fish Speech inference performance

c41cdbb

Signed-off-by: sy0307 <sy0307@users.noreply.github.com>

Sy0307 force-pushed the dev/fish_perf branch from 2ed3be9 to 570ff88 Compare March 20, 2026 21:37

linyueqian added the ready label to trigger buildkite CI label Mar 20, 2026

Tune Fish Speech inference path

aafef1b

Signed-off-by: Sy03 <1370724210@qq.com>

Sy0307 force-pushed the dev/fish_perf branch from 570ff88 to aafef1b Compare March 20, 2026 21:41

linyueqian merged commit 072647e into vllm-project:main Mar 21, 2026
7 of 8 checks passed

Conversation

Sy0307 commented Mar 12, 2026

Purpose

Test Result

Single request

Concurrency = 8

Concurrency = 16

Uh oh!

linyueqian commented Mar 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 commented Mar 13, 2026

Uh oh!

hsliuustc0106 commented Mar 13, 2026

Uh oh!

Sy0307 commented Mar 16, 2026

Uh oh!

Sy0307 commented Mar 17, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

Sy0307 commented Mar 18, 2026

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

linyueqian commented Mar 20, 2026

Uh oh!

Sy0307 commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants