[Feat] Add Fish Speech S2 Pro benchmark workflow and baseline results by zwhzzz0821 · Pull Request #2515 · vllm-project/vllm-omni

zwhzzz0821 · 2026-04-06T07:15:09Z

Purpose

Closes #2432.

This PR adds a complete Fish Speech benchmark workflow under benchmarks/fish-speech/ and provides the
initial baseline comparison between vllm-omni and sglang-omni for fishaudio/s2-pro.

adds a dedicated Fish Speech benchmark directory with runnable benchmark entrypoints for both vllm-omni
and sglang-omni
adds reusable benchmark configs for both backends so the setup is reproducible
adds shared Fish Speech benchmark utilities and plotting scripts
updates the Fish Speech benchmark README with setup, run instructions, and comparison workflow
provides baseline benchmark outputs and a reference plot artifact

Overall, this PR turns the Fish Speech benchmark from an ad hoc local experiment into a reproducible
benchmark workflow that can be rerun and compared more easily.

Test Plan

Environment

OS: Linux
Python: 3.12
GPU: NVIDIA H100 PCIe
vllm-omni: local source checkout, based on upstream main commit
ca02351a1ef8aa6397126c60154a80ee06ae3553
sglang-omni: local source checkout, commit 3a9accf7fde58e0808c6623ebbe91de87cffdc98

End-to-end benchmark workflow

Start vllm-omni with benchmarks/fish-speech/config/vllm_omni/fish_speech_s2_pro.yaml
Start sglang-omni with benchmarks/fish-speech/config/sglang_omni/s2pro_tts.yaml
Run the Fish Speech benchmark clients against both servers with the same prompt set and concurrency
settings
Generate comparison plots with benchmarks/fish-speech/plot_results.py

Test Result

All static/script checks above passed locally.

The benchmark workflow in this PR was run successfully and produced benchmark JSON outputs and plots under
benchmarks/fish-speech/results/.

Baseline results used for the initial comparison:

Framework	Concurrency	Mean TTFP (ms)	Mean E2E (ms)	Mean RTF	Audio Throughput
vllm-omni	1	612.81	2779.21	0.6011	1.6775
vllm-omni	4	1008.74	4699.35	1.0023	3.9226
vllm-omni	10	6473.68	9932.86	2.1220	4.4595
sglang-omni	1	858.69	10601.55	1.9379	0.4731
sglang-omni	4	3229.71	31559.87	6.0196	0.6435
sglang-omni	10	5335.88	70247.22	13.2144	0.6806

Additional observations from local runs:

sglang-omni used noticeably more GPU memory than vllm-omni in this setup, with roughly 20 GB higher
memory usage during serving.
sglang-omni startup was also significantly slower than vllm-omni under the same local environment and
model setup.

Documentation was also updated in benchmarks/fish-speech/README.md to describe the setup and benchmark
workflow.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes
don't require additional test scripts. For test file guidelines, please check the [test style doc](https://
docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples
for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6abff8174

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-06T07:22:18Z

+
+def print_comparison_table(all_results: list[list[dict]], labels: list[str]) -> None:
+    """Print a markdown-formatted comparison table."""
+    concurrencies = sorted(set(result["concurrency"] for result in all_results[0]))


Include union of concurrency levels in report table

print_comparison_table builds concurrencies from only all_results[0], so when compared runs have different concurrency sets (for example, one backend fails/OOMs at higher concurrency and omits those rows), the markdown table and improvement section silently drop those missing levels. That hides failed or missing experiment points in the textual summary even though the plot path already handles unioned concurrencies.

Useful? React with 👍 / 👎.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

ywang96

A few things that should be addressed:

Let's remove the benchmark result png from the PR itself. (You can still keep it in the PR description).
Could you update the benchmark result with the corresponding version of vllm-omni and sglang-omni? Generally speaking, showing package/library versions is a minimum requirement in any benchmark results.

Sy0307 · 2026-04-06T08:03:36Z

Totally good. Details to be modified as @ywang96 mentioned. vllm-omni TTFP on c=10 is so weird and I will do some research on it. Also cc @linyueqian .

And plz titles shows Fish Speech S2 pro for clarity.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

linyueqian · 2026-04-06T18:05:49Z

Thanks for putting together this benchmark workflow - the reproducible setup is great.

One concern: the sglang-omni numbers here (RTF 1.94, TTFP 859ms at c=1) are significantly worse than their official claims in the Fish Audio S2 technical report - RTF 0.34 and TTFA ~140ms on a single H200 at batch size 1. That's roughly a 5.7x gap in RTF, which seems too large to be explained by the H100 PCIe vs H200 hardware difference alone.

I compared the sglang-omni config in this PR against the official default and most fields match, but the PR adds stream_vocoder_device: cpu to the tts_engine executor args, which is not in the official default config. Could this be causing the vocoder streaming to bottleneck on CPU? It would be worth testing with the minimal official config (just config_cls, model_path, relay_backend) to see if that alone closes some of the gap.

Also, the official benchmarks were run on H200 while this PR uses H100 PCIe - worth noting this hardware difference in the README so readers have the right context when interpreting the comparison.

zwhzzz0821 · 2026-04-07T08:00:06Z

@linyueqian Thanks, this is a very good point.

One clarification from the upstream sglang-omni code: stream_vocoder_device: cpu is not a benchmark-only override on my side; it is the effective upstream default when the field
is omitted.

Code-level evidence:

The upstream minimal config at sglang-omni/examples/configs/s2pro_tts.yaml is:

config_cls: S2ProPipelineConfig
model_path: fishaudio/s2-pro
relay_backend: shm
The default S2ProPipelineConfig in sglang_omni/models/fishaudio_s2_pro/config.py does not explicitly set stream_vocoder_device.
In sglang_omni/models/fishaudio_s2_pro/pipeline/stages.py, create_sglang_tts_engine_executor(...) defines stream_vocoder_device: str | None = None, and then immediately
sets:
- if stream_vocoder_device is None:
- stream_vocoder_device = "cpu"

So the original benchmark config was making the upstream default explicit rather than changing it.

I also ran the additional experiment you suggested. I compared:

the upstream minimal config at examples/configs/s2pro_tts.yaml (which implicitly uses stream_vocoder_device=cpu), and
a modified config with stream_vocoder_device=cuda:0.

Results on my H100 PCIe setup:

Config	Concurrency	Completed	Failed	Mean TTFP (ms)	Mean E2E (ms)	Mean RTF	Audio Throughput
`sglang-omni/examples/configs/s2pro_tts.yaml`	1	50	0	886.58	10522.89	1.9227	0.4929
`sglang-omni/examples/configs/s2pro_tts.yaml`	4	50	0	2370.26	32733.56	6.2347	0.6188
`sglang-omni/examples/configs/s2pro_tts.yaml`	10	50	0	6227.39	70959.05	13.4166	0.6765
Modified config (`stream_vocoder_device=cuda:0`)	1	15	35	160.77	2237.67	0.4343	1.3352
Modified config (`stream_vocoder_device=cuda:0`)	4	10	40	264.06	2628.48	1.0334	2.5087
Modified config (`stream_vocoder_device=cuda:0`)	10	4	46	369.29	2021.11	1.8446	1.1484

So forcing the streaming vocoder onto GPU does improve latency/RTF for the requests that complete, but it is not stable on my H100 PCIe setup: I observed partial-request failures
with CUDA out of memory, CUBLAS_STATUS_ALLOC_FAILED, and CUDNN_STATUS_INTERNAL_ERROR in the vocoder path.

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/zwh/sglang-omni/sglang_omni/serve/openai_api.py", line 483, in _speech_stream
    async for chunk in client.generate(gen_req, request_id=request_id):
  File "/home/zwh/sglang-omni/sglang_omni/client/client.py", line 58, in generate
    async for msg in self._coordinator.stream(req_id, omni_request):
  File "/home/zwh/sglang-omni/sglang_omni/pipeline/coordinator.py", line 136, in stream
    raise RuntimeError(msg.error or "Unknown error")
RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 65.06 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 77.28 GiB is allocated by PyTorch, with 25.59 MiB allocated in private pools (e.g., CUDA Graphs), and 202.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/zwh/sglang-omni/sglang_omni/serve/openai_api.py", line 483, in _speech_stream
    async for chunk in client.generate(gen_req, request_id=request_id):
  File "/home/zwh/sglang-omni/sglang_omni/client/client.py", line 58, in generate
    async for msg in self._coordinator.stream(req_id, omni_request):
  File "/home/zwh/sglang-omni/sglang_omni/pipeline/coordinator.py", line 136, in stream
    raise RuntimeError(msg.error or "Unknown error")
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

So forcing the streaming vocoder onto GPU does improve latency/RTF for the requests that complete, but it is not stable on my H100 PCIe setup: I observed partial-request failures
with CUDA out of memory, CUBLAS_STATUS_ALLOC_FAILED, and CUDNN_STATUS_INTERNAL_ERROR in the vocoder path.

My interpretation is that the upstream CPU default is likely a practical stability/memory tradeoff rather than an arbitrary benchmark choice. Under the default setting, sglang- omni was already operating with a significantly tighter memory budget than vllm-omni in my setup (roughly 20+ GB higher GPU memory usage), so moving the streaming vocoder to GPU
made that pressure worse and pushed the pipeline into OOM / allocator failures. I did not observe the same memory instability with vllm-omni under the same benchmark workflow.

I will also update the README to make the hardware context explicit: these measurements were collected on H100 PCIe, while the Fish Audio technical report numbers were reported on
H200.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

ywang96 · 2026-04-07T21:52:43Z

@zwhzzz0821 Thanks for the explanation and changes!

Just so I understand correctly, I should be able to run the same script on H200 to generate a more representative result for sglang-omni compared to their reported performance. Is that the right understanding?

linyueqian · 2026-04-08T01:13:18Z

@zwhzzz0821 Thanks for the thorough follow-up on the stream_vocoder_device investigation.

I ran an independent reproduction on 8×H20-3e (143 GB per GPU) to further validate the results. The H20 has enough VRAM for both configurations without OOM, so we can compare them fairly on the same hardware.

H20-3e Results (143 GB VRAM)

sglang-omni with stream_vocoder_device=cpu (upstream default):

Concurrency	Completed	Failed	Mean TTFP (ms)	Mean E2E (ms)	Mean RTF	Audio Throughput
1	45	5	7460.1	17877.3	3.7063	0.1593
4	38	12	9049.5	52297.1	10.4923	0.2014
10	19	31	14952.4	75670.1	17.4454	0.1344

sglang-omni with stream_vocoder_device=cuda:0:

Concurrency	Completed	Mean TTFP (ms)	Mean E2E (ms)	Mean RTF	Audio Throughput
1	50	94.2	2286.5	0.2608	3.7274
4	50	168.5	3120.2	0.3884	8.6121
10	50	228.8	3299.5	0.5216	12.2218

vllm-omni (PR config):

Concurrency	Completed	Mean TTFP (ms)	Mean E2E (ms)	Mean RTF	Audio Throughput
1	50	435.9	2210.9	0.4777	2.1058
4	50	694.0	3593.5	0.7636	5.1899
10	50	4692.7	7443.5	1.6275	5.8412

GPU Memory at Serving Time

Framework	VRAM Used
sglang-omni	127.5 GB
vllm-omni	87.3 GB

Analysis

The stream_vocoder_device setting is the dominant factor here. With the CPU vocoder (upstream default), sglang-omni is severely bottlenecked -- RTF 3.71 at c=1, with request failures even at low concurrency. With the GPU vocoder, sglang-omni reaches RTF 0.26 and TTFP 94ms at c=1, which aligns well with Fish Audio's official claims (RTF 0.34, TTFP ~140ms on H200).

The reason stream_vocoder_device=cpu is the upstream default appears to be a memory constraint: sglang-omni already uses ~127.5 GB just for serving, which leaves no headroom on 80 GB GPUs (H100 PCIe) and even 96 GB GPUs would be tight. On H20 (143 GB), the GPU vocoder fits comfortably with zero failures.

This means the comparison in the PR as currently presented is somewhat asymmetric -- vllm-omni runs its full pipeline on GPU while sglang-omni is bottlenecked by a CPU vocoder. I'd suggest:

Add the GPU vocoder config as an additional benchmark variant, at least for hardware that can support it, so readers see the full picture.
Note prominently that the CPU vocoder default is a memory tradeoff, not a performance-optimal configuration, and that results will differ significantly on GPUs with >128 GB VRAM.
The memory efficiency gap (87.3 GB vs 127.5 GB) is a legitimate advantage for vllm-omni and worth highlighting -- it enables full GPU acceleration on a wider range of hardware.

…README Signed-off-by: zwhzzz0821 <2831474076@qq.com>

zwhzzz0821 · 2026-04-08T06:42:29Z

@ywang96 That is correct. Using an H200 would provide the necessary VRAM headroom for sglang-omni to run its full GPU pipeline effectively.

zwhzzz0821 · 2026-04-08T06:45:59Z

@linyueqian Thanks for the detailed reproduction and analysis.

I have updated the benchmark setup accordingly.

Specifically:

I added two sglang-omni config variants:
- s2pro_tts_upstream.yaml, which mirrors the upstream default/minimal config path
- s2pro_tts_gpu_vocoder.yaml, which enables stream_vocoder_device=cuda:0 for high-VRAM hardware
I also updated the README to explain why these two sglang-omni configs are both included, what tradeoff
they represent, and why the CPU-vocoder default should be understood as a memory/stability tradeoff rather
than a performance-optimal configuration.

zwhzzz0821 · 2026-04-08T17:52:23Z

I ran an additional two GPU experiment.

For vllm-omni, the result below comes from the code in PR #2520:
#2520

The vllm-omni deployment used stage-level placement across two GPUs (not TP):

CUDA_VISIBLE_DEVICES=0,1
stage-0 (fish_speech_slow_ar) on GPU 0
stage-1 (dac_decoder) on GPU 1

The sglang-omni run also used a dual-GPU layout, with the TTS engine on GPU 0 and the streaming/full vocoder path on GPU 1.

Results:

Framework	Concurrency	Completed	Mean TTFP (ms)	Mean E2E (ms)	Mean RTF	Audio Throughput	Request Throughput
`sglang-omni`	1	50	163.61	2259.19	0.4360	2.2956	0.4425
`sglang-omni`	4	50	265.72	3022.26	0.5821	6.6495	1.2819
`sglang-omni`	10	50	379.37	4063.02	0.7788	12.0722	2.3223
`vllm-omni` (PR #2520)	1	50	633.12	2826.97	0.6023	1.6718	0.3537
`vllm-omni` (PR #2520)	4	50	992.97	4256.90	0.9266	4.2486	0.9158
`vllm-omni` (PR #2520)	10	50	5920.66	8940.00	1.9641	4.8646	1.0417

vllm omni config:

async_chunk: true
stage_args:
  - stage_id: 0
    stage_type: llm
    is_comprehension: true
    runtime:
      devices: "0"
      max_batch_size: 16
    engine_args:
      max_num_seqs: 4
      model_stage: fish_speech_slow_ar
      model_arch: FishSpeechSlowARForConditionalGeneration
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      enforce_eager: false
      trust_remote_code: true
      async_scheduling: false
      enable_prefix_caching: false
      engine_output_type: latent
      gpu_memory_utilization: 0.6
      distributed_executor_backend: "mp"
      max_num_batched_tokens: 3072
      max_model_len: 16384
      custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.fish_speech.slow_ar_to_dac_decoder_async_chunk
    output_connectors:
      to_stage_1: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.8
      top_k: 30
      top_p: 0.9
      max_tokens: 2048
      seed: 42
      detokenize: false
      repetition_penalty: 1.0
      stop_token_ids: [151645]

  - stage_id: 1
    stage_type: llm
    runtime:
      devices: "1"
      max_batch_size: 16
    engine_args:
      max_num_seqs: 1
      model_stage: dac_decoder
      model_arch: FishSpeechDACDecoder
      worker_type: generation
      scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
      enforce_eager: true
      trust_remote_code: true
      async_scheduling: false
      enable_prefix_caching: false
      engine_output_type: audio
      gpu_memory_utilization: 0.1
      distributed_executor_backend: "mp"
      max_num_batched_tokens: 8192
      max_model_len: 16384
    engine_input_source: [0]
    final_output: true
    final_output_type: audio
    input_connectors:
      from_stage_0: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 65536
      seed: 42
      detokenize: true
      repetition_penalty: 1.0

runtime:
  enabled: true
  defaults:
    window_size: -1
    max_inflight: 16

  connectors:
    connector_of_shared_memory:
      name: SharedMemoryConnector
      extra:
        shm_threshold_bytes: 65536
        codec_streaming: true
        connector_get_sleep_s: 0.01
        connector_get_max_wait_first_chunk: 3000
        connector_get_max_wait: 300
        codec_chunk_frames: 25
        codec_left_context_frames: 25
        initial_codec_chunk_frames: 4

  edges:
    - from: 0
      to: 1
      window_size: -1

sglang omni config:

config_cls: S2ProPipelineConfig
model_path: fishaudio/s2-pro
entry_stage: preprocessing
relay_backend: shm
stages:
  - name: preprocessing
    executor:
      factory: sglang_omni.models.fishaudio_s2_pro.pipeline.stages.create_preprocessing_executor
      args: {}
    get_next: sglang_omni.models.fishaudio_s2_pro.pipeline.next_stage.preprocessing_next
    relay:
      slot_size_mb: 512
      credits: 2
      rank: null
      world_size: null
      device: cpu
    num_workers: 1
    stream_to: []
  - name: tts_engine
    executor:
      factory: sglang_omni.models.fishaudio_s2_pro.pipeline.stages.create_sglang_tts_engine_executor
      args:
        device: cuda:0
        max_new_tokens: 2048
        stream_vocoder_device: cuda:1
    get_next: sglang_omni.models.fishaudio_s2_pro.pipeline.next_stage.tts_engine_next
    relay:
      slot_size_mb: 512
      credits: 2
      rank: null
      world_size: null
      device: cuda:0
    num_workers: 1
    stream_to: []
  - name: vocoder
    executor:
      factory: sglang_omni.models.fishaudio_s2_pro.pipeline.stages.create_vocoder_executor
      args:
        device: cuda:1
    get_next: sglang_omni.models.fishaudio_s2_pro.pipeline.next_stage.vocoder_next
    relay:
      slot_size_mb: 512
      credits: 2
      rank: null
      world_size: null
      device: cpu
    num_workers: 1
    stream_to: []

cc @Sy0307 @linyueqian

Sy0307 · 2026-04-08T19:19:39Z

sgl-omni has higher hit rate in voice clone scenario as it caches ref_audio. Maybe we should do the same thing and it will be welcome like #2561 said.

linyueqian · 2026-04-08T19:50:14Z

sgl-omni has higher hit rate in voice clone scenario as it caches ref_audio. Maybe we should do the same thing and it will be welcome like #2561 said.

make sense. i'll see if that will help.

Reuses fish_bench_utils from PR vllm-project#2515 to compare: A) Inline ref_audio (no cache, DAC encode every request) B) Uploaded voice (cache hits after 1st request) Reports TTFP/E2E/RTF comparison table.

Reuses fish_bench_utils from PR vllm-project#2515 to compare: A) Inline ref_audio (no cache, DAC encode every request) B) Uploaded voice (cache hits after 1st request) Reports TTFP/E2E/RTF comparison table. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

Fold DAC codec decoding into the Slow AR model so AR generation and audio synthesis run in one vLLM engine process. Eliminates the second engine, SharedMemoryConnector, and OmniGenerationScheduler overhead. New files: - fish_speech_single_stage.py: subclasses SlowAR, overrides make_omni_output to decode audio_codes inline via DAC codec - fish_speech_s2_pro_single_stage.yaml: single-stage config - plan/fish_speech_single_stage_analysis.md: analysis doc Ref: vllm-project#2515 (Fish Speech benchmark baseline)

lishunyang12

Review: Fish Speech S2 Pro Benchmark Workflow

Overall this is a well-structured, clearly documented benchmark addition. The shared utility library (fish_bench_utils.py) is a good design choice that keeps the per-backend wrappers thin, and the shell runner script is nicely parameterized. A few observations and suggestions below.

Positives

Clean separation between shared infrastructure (fish_bench_utils.py) and backend-specific payload construction. This will make it easy to add future TTS model benchmarks.
The SSE vs raw-audio stream auto-detection in send_streaming_request is a nice touch for cross-backend compatibility.
The shell script is well-documented with env-var overrides and includes server health probes before running.
Good README with hardware caveats, metric definitions, and architecture notes.

Suggestions

Missing __init__.py files: The vllm_omni/ and sglang_omni/ subdirectories under benchmarks/fish-speech/ have no __init__.py. While not strictly needed since the scripts use sys.path.insert, adding empty __init__.py files would make it possible to import these modules cleanly if future tooling or tests need to reference them.
TIMESTAMP variable in run_benchmark.sh is unused: The shell script defines TIMESTAMP (line 23 of the script) and uses it for the plot filename, but the benchmark JSON files get their own timestamp from fish_bench_utils.save_results() using datetime.now(). This means the plot timestamp and the JSON timestamps will differ slightly. Consider either passing the shell-level timestamp into the Python scripts or just accepting the minor drift.
Warmup requests run without concurrency limiting: In run_benchmark() (fish_bench_utils.py), the warmup phase fires num_warmups requests via asyncio.gather without the semaphore. For num_warmups=3 this is fine, but if someone increases it, all warmup requests would hit the server simultaneously. Minor concern, just worth noting.
pcm_bytes_to_duration assumes mono audio: The helper computes duration as num_bytes / sample_width / sample_rate, which is correct for mono. If the model ever outputs stereo, this would silently halve the reported duration. Consider adding a channels=1 parameter with a default, or at least a docstring note that it assumes mono.
Hardcoded 44100 sample rate in both bench scripts: Both vllm_omni/bench_fish_server.py and sglang_omni/bench_fish_server.py hardcode SAMPLE_RATE = 44100. If this ever changes per-model or per-config, consider making it a CLI argument with the current value as default, similar to how --request-timeout is exposed.
plot_results.py - assert for argument validation: Line assert len(args.results) == len(args.labels) will be stripped in optimized Python (python -O). Consider using a proper if ... raise SystemExit(...) or parser.error(...) instead.
Shell script check_server for sglang uses /health only: The vllm probe tries /v1/audio/voices first then falls back to /health, which is good. The sglang probe only checks /health. If sglang-omni exposes a speech-specific readiness endpoint, it might be worth probing that too for consistency, though this is minor.
No .gitignore for results/: The results/ directory will accumulate JSON and PNG files from benchmark runs. Consider adding a benchmarks/fish-speech/results/.gitignore with * and !.gitignore to prevent accidental commits of large result artifacts.

Questions

The PR description mentions baseline results were collected but I don't see any JSON result files checked in. Was that intentional? If so, that's good (keeps the repo clean). If baseline results are meant to be checked in for regression tracking, they should probably live in a separate directory or be clearly marked.

Verdict

This is a solid benchmark addition. The suggestions above are mostly minor improvements and do not block merging. Nice work turning an ad-hoc experiment into a reproducible workflow.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

zwhzzz0821 · 2026-04-19T17:14:41Z

@lishunyang12 Thanks for the detailed review. I have updated the code following your suggestions.

Regarding the baseline results: this was intentional. The new results/.gitignore keeps all generated artifacts out of the repo, which is what we want. If we ever need checked-in baselines for regression tracking, we can move them to a dedicated directory later.

Signed-off-by: Yukim1 <121286183+zwhzzz0821@users.noreply.github.com>

lishunyang12

LGTM, thanks for the thorough follow-up.

zwhzzz0821 added 3 commits April 6, 2026 06:28

implement fish-speech benchmark

a1b414e

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

Refine Fish Speech benchmark scripts and configs

fc65c54

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

remove unuse file

c6abff8

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

zwhzzz0821 requested a review from hsliuustc0106 as a code owner April 6, 2026 07:15

chatgpt-codex-connector Bot reviewed Apr 6, 2026

View reviewed changes

fix for pre-commit format

c47200a

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

ywang96 requested changes Apr 6, 2026

View reviewed changes

Sy0307 mentioned this pull request Apr 6, 2026

[Perf][Fish Speech] Enable CUDA Graph capture for Fast AR code predictor #2520

Merged

5 tasks

zwhzzz0821 changed the title ~~[Feat] Add Fish Speech benchmark workflow and baseline results~~ [Feat] Add Fish Speech S2 Pro benchmark workflow and baseline results Apr 6, 2026

Remove Fish Speech benchmark artifact and fix comparison table

edb0a38

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

linyueqian self-requested a review April 6, 2026 17:54

update README with GPU hardware context

bca55df

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

linyueqian mentioned this pull request Apr 8, 2026

[Perf] Fish Speech S2 Pro: CUDA graph acceleration for Fast AR codebook decode #2579

Closed

clarify Fish Speech sglang benchmark configs and memory tradeoffs in …

92dcd9c

…README Signed-off-by: zwhzzz0821 <2831474076@qq.com>

linyueqian mentioned this pull request Apr 8, 2026

[Feat][FishSpeech] Cache DAC-encoded ref audio for voice cloning #2609

Merged

6 tasks

zwhzzz0821 requested a review from ywang96 April 9, 2026 08:53

lishunyang12 reviewed Apr 16, 2026

View reviewed changes

address review feedback on Fish Speech benchmark

5ee8ef5

Signed-off-by: zwhzzz0821 <2831474076@qq.com>

zwhzzz0821 requested a review from lishunyang12 April 19, 2026 17:17

Merge branch 'main' into benchmark/fish-speech

28a5bfb

Signed-off-by: Yukim1 <121286183+zwhzzz0821@users.noreply.github.com>

lishunyang12 approved these changes Apr 23, 2026

View reviewed changes

Conversation

zwhzzz0821 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Environment

End-to-end benchmark workflow

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ywang96 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Sy0307 commented Apr 6, 2026

Uh oh!

linyueqian commented Apr 6, 2026

Uh oh!

zwhzzz0821 commented Apr 7, 2026

Uh oh!

ywang96 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Apr 8, 2026

H20-3e Results (143 GB VRAM)

GPU Memory at Serving Time

Analysis

Uh oh!

zwhzzz0821 commented Apr 8, 2026

Uh oh!

zwhzzz0821 commented Apr 8, 2026

Uh oh!

zwhzzz0821 commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sy0307 commented Apr 8, 2026

Uh oh!

linyueqian commented Apr 8, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Review: Fish Speech S2 Pro Benchmark Workflow

Positives

Suggestions

Questions

Verdict

Uh oh!

zwhzzz0821 commented Apr 19, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

zwhzzz0821 commented Apr 6, 2026 •

edited

Loading

ywang96 left a comment •

edited

Loading

ywang96 commented Apr 7, 2026 •

edited

Loading

zwhzzz0821 commented Apr 8, 2026 •

edited

Loading