Skip to content

[Feat] Add Fish Speech S2 Pro benchmark workflow and baseline results#2515

Open
zwhzzz0821 wants to merge 9 commits intovllm-project:mainfrom
zwhzzz0821:benchmark/fish-speech
Open

[Feat] Add Fish Speech S2 Pro benchmark workflow and baseline results#2515
zwhzzz0821 wants to merge 9 commits intovllm-project:mainfrom
zwhzzz0821:benchmark/fish-speech

Conversation

@zwhzzz0821
Copy link
Copy Markdown

@zwhzzz0821 zwhzzz0821 commented Apr 6, 2026

Purpose

Closes #2432.

This PR adds a complete Fish Speech benchmark workflow under benchmarks/fish-speech/ and provides the
initial baseline comparison between vllm-omni and sglang-omni for fishaudio/s2-pro.

  • adds a dedicated Fish Speech benchmark directory with runnable benchmark entrypoints for both vllm-omni
    and sglang-omni
  • adds reusable benchmark configs for both backends so the setup is reproducible
  • adds shared Fish Speech benchmark utilities and plotting scripts
  • updates the Fish Speech benchmark README with setup, run instructions, and comparison workflow
  • provides baseline benchmark outputs and a reference plot artifact

Overall, this PR turns the Fish Speech benchmark from an ad hoc local experiment into a reproducible
benchmark workflow that can be rerun and compared more easily.

Test Plan

Environment

  • OS: Linux
  • Python: 3.12
  • GPU: NVIDIA H100 PCIe
  • vllm-omni: local source checkout, based on upstream main commit
    ca02351a1ef8aa6397126c60154a80ee06ae3553
  • sglang-omni: local source checkout, commit 3a9accf7fde58e0808c6623ebbe91de87cffdc98

End-to-end benchmark workflow

  1. Start vllm-omni with benchmarks/fish-speech/config/vllm_omni/fish_speech_s2_pro.yaml
  2. Start sglang-omni with benchmarks/fish-speech/config/sglang_omni/s2pro_tts.yaml
  3. Run the Fish Speech benchmark clients against both servers with the same prompt set and concurrency
    settings
  4. Generate comparison plots with benchmarks/fish-speech/plot_results.py

Test Result

All static/script checks above passed locally.

The benchmark workflow in this PR was run successfully and produced benchmark JSON outputs and plots under
benchmarks/fish-speech/results/.

Baseline results used for the initial comparison:

Framework Concurrency Mean TTFP (ms) Mean E2E (ms) Mean RTF Audio Throughput
vllm-omni 1 612.81 2779.21 0.6011 1.6775
vllm-omni 4 1008.74 4699.35 1.0023 3.9226
vllm-omni 10 6473.68 9932.86 2.1220 4.4595
sglang-omni 1 858.69 10601.55 1.9379 0.4731
sglang-omni 4 3229.71 31559.87 6.0196 0.6435
sglang-omni 10 5335.88 70247.22 13.2144 0.6806
fish_speech_benchmark

Additional observations from local runs:

  • sglang-omni used noticeably more GPU memory than vllm-omni in this setup, with roughly 20 GB higher
    memory usage during serving.
  • sglang-omni startup was also significantly slower than vllm-omni under the same local environment and
    model setup.

Documentation was also updated in benchmarks/fish-speech/README.md to describe the setup and benchmark
workflow.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes
    don't require additional test scripts. For test file guidelines, please check the [test style doc](https://
    docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples
    for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6abff8174

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread benchmarks/fish-speech/plot_results.py Outdated

def print_comparison_table(all_results: list[list[dict]], labels: list[str]) -> None:
"""Print a markdown-formatted comparison table."""
concurrencies = sorted(set(result["concurrency"] for result in all_results[0]))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Include union of concurrency levels in report table

print_comparison_table builds concurrencies from only all_results[0], so when compared runs have different concurrency sets (for example, one backend fails/OOMs at higher concurrency and omits those rows), the markdown table and improvement section silently drop those missing levels. That hides failed or missing experiment points in the textual summary even though the plot path already handles unioned concurrencies.

Useful? React with 👍 / 👎.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>
Copy link
Copy Markdown
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few things that should be addressed:

  1. Let's remove the benchmark result png from the PR itself. (You can still keep it in the PR description).
  2. Could you update the benchmark result with the corresponding version of vllm-omni and sglang-omni? Generally speaking, showing package/library versions is a minimum requirement in any benchmark results.

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 6, 2026

Totally good. Details to be modified as @ywang96 mentioned. vllm-omni TTFP on c=10 is so weird and I will do some research on it. Also cc @linyueqian .

And plz titles shows Fish Speech S2 pro for clarity.

@zwhzzz0821 zwhzzz0821 changed the title [Feat] Add Fish Speech benchmark workflow and baseline results [Feat] Add Fish Speech S2 Pro benchmark workflow and baseline results Apr 6, 2026
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
@linyueqian linyueqian self-requested a review April 6, 2026 17:54
@linyueqian
Copy link
Copy Markdown
Collaborator

Thanks for putting together this benchmark workflow - the reproducible setup is great.

One concern: the sglang-omni numbers here (RTF 1.94, TTFP 859ms at c=1) are significantly worse than their official claims in the Fish Audio S2 technical report - RTF 0.34 and TTFA ~140ms on a single H200 at batch size 1. That's roughly a 5.7x gap in RTF, which seems too large to be explained by the H100 PCIe vs H200 hardware difference alone.

I compared the sglang-omni config in this PR against the official default and most fields match, but the PR adds stream_vocoder_device: cpu to the tts_engine executor args, which is not in the official default config. Could this be causing the vocoder streaming to bottleneck on CPU? It would be worth testing with the minimal official config (just config_cls, model_path, relay_backend) to see if that alone closes some of the gap.

Also, the official benchmarks were run on H200 while this PR uses H100 PCIe - worth noting this hardware difference in the README so readers have the right context when interpreting the comparison.

@zwhzzz0821
Copy link
Copy Markdown
Author

@linyueqian Thanks, this is a very good point.

One clarification from the upstream sglang-omni code: stream_vocoder_device: cpu is not a benchmark-only override on my side; it is the effective upstream default when the field
is omitted.

Code-level evidence:

  • The upstream minimal config at sglang-omni/examples/configs/s2pro_tts.yaml is:

    config_cls: S2ProPipelineConfig
    model_path: fishaudio/s2-pro
    relay_backend: shm

  • The default S2ProPipelineConfig in sglang_omni/models/fishaudio_s2_pro/config.py does not explicitly set stream_vocoder_device.

  • In sglang_omni/models/fishaudio_s2_pro/pipeline/stages.py, create_sglang_tts_engine_executor(...) defines stream_vocoder_device: str | None = None, and then immediately
    sets:

    • if stream_vocoder_device is None:
    • stream_vocoder_device = "cpu"

So the original benchmark config was making the upstream default explicit rather than changing it.

I also ran the additional experiment you suggested. I compared:

  1. the upstream minimal config at examples/configs/s2pro_tts.yaml (which implicitly uses stream_vocoder_device=cpu), and
  2. a modified config with stream_vocoder_device=cuda:0.

Results on my H100 PCIe setup:

Config Concurrency Completed Failed Mean TTFP (ms) Mean E2E (ms) Mean RTF Audio Throughput
sglang-omni/examples/configs/s2pro_tts.yaml 1 50 0 886.58 10522.89 1.9227 0.4929
sglang-omni/examples/configs/s2pro_tts.yaml 4 50 0 2370.26 32733.56 6.2347 0.6188
sglang-omni/examples/configs/s2pro_tts.yaml 10 50 0 6227.39 70959.05 13.4166 0.6765
Modified config (stream_vocoder_device=cuda:0) 1 15 35 160.77 2237.67 0.4343 1.3352
Modified config (stream_vocoder_device=cuda:0) 4 10 40 264.06 2628.48 1.0334 2.5087
Modified config (stream_vocoder_device=cuda:0) 10 4 46 369.29 2021.11 1.8446 1.1484

So forcing the streaming vocoder onto GPU does improve latency/RTF for the requests that complete, but it is not stable on my H100 PCIe setup: I observed partial-request failures
with CUDA out of memory, CUBLAS_STATUS_ALLOC_FAILED, and CUDNN_STATUS_INTERNAL_ERROR in the vocoder path.

ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/zwh/sglang-omni/sglang_omni/serve/openai_api.py", line 483, in _speech_stream
    async for chunk in client.generate(gen_req, request_id=request_id):
  File "/home/zwh/sglang-omni/sglang_omni/client/client.py", line 58, in generate
    async for msg in self._coordinator.stream(req_id, omni_request):
  File "/home/zwh/sglang-omni/sglang_omni/pipeline/coordinator.py", line 136, in stream
    raise RuntimeError(msg.error or "Unknown error")
RuntimeError: CUDA out of memory. Tried to allocate 84.00 MiB. GPU 0 has a total capacity of 79.19 GiB of which 65.06 MiB is free. Including non-PyTorch memory, this process has 79.12 GiB memory in use. Of the allocated memory 77.28 GiB is allocated by PyTorch, with 25.59 MiB allocated in private pools (e.g., CUDA Graphs), and 202.09 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/zwh/sglang-omni/sglang_omni/serve/openai_api.py", line 483, in _speech_stream
    async for chunk in client.generate(gen_req, request_id=request_id):
  File "/home/zwh/sglang-omni/sglang_omni/client/client.py", line 58, in generate
    async for msg in self._coordinator.stream(req_id, omni_request):
  File "/home/zwh/sglang-omni/sglang_omni/pipeline/coordinator.py", line 136, in stream
    raise RuntimeError(msg.error or "Unknown error")
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)`

So forcing the streaming vocoder onto GPU does improve latency/RTF for the requests that complete, but it is not stable on my H100 PCIe setup: I observed partial-request failures
with CUDA out of memory, CUBLAS_STATUS_ALLOC_FAILED, and CUDNN_STATUS_INTERNAL_ERROR in the vocoder path.

My interpretation is that the upstream CPU default is likely a practical stability/memory tradeoff rather than an arbitrary benchmark choice. Under the default setting, sglang- omni was already operating with a significantly tighter memory budget than vllm-omni in my setup (roughly 20+ GB higher GPU memory usage), so moving the streaming vocoder to GPU
made that pressure worse and pushed the pipeline into OOM / allocator failures. I did not observe the same memory instability with vllm-omni under the same benchmark workflow.

I will also update the README to make the hardware context explicit: these measurements were collected on H100 PCIe, while the Fish Audio technical report numbers were reported on
H200.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>
@ywang96
Copy link
Copy Markdown
Member

ywang96 commented Apr 7, 2026

@zwhzzz0821 Thanks for the explanation and changes!

Just so I understand correctly, I should be able to run the same script on H200 to generate a more representative result for sglang-omni compared to their reported performance. Is that the right understanding?

@linyueqian
Copy link
Copy Markdown
Collaborator

@zwhzzz0821 Thanks for the thorough follow-up on the stream_vocoder_device investigation.

I ran an independent reproduction on 8×H20-3e (143 GB per GPU) to further validate the results. The H20 has enough VRAM for both configurations without OOM, so we can compare them fairly on the same hardware.

H20-3e Results (143 GB VRAM)

sglang-omni with stream_vocoder_device=cpu (upstream default):

Concurrency Completed Failed Mean TTFP (ms) Mean E2E (ms) Mean RTF Audio Throughput
1 45 5 7460.1 17877.3 3.7063 0.1593
4 38 12 9049.5 52297.1 10.4923 0.2014
10 19 31 14952.4 75670.1 17.4454 0.1344

sglang-omni with stream_vocoder_device=cuda:0:

Concurrency Completed Failed Mean TTFP (ms) Mean E2E (ms) Mean RTF Audio Throughput
1 50 0 94.2 2286.5 0.2608 3.7274
4 50 0 168.5 3120.2 0.3884 8.6121
10 50 0 228.8 3299.5 0.5216 12.2218

vllm-omni (PR config):

Concurrency Completed Failed Mean TTFP (ms) Mean E2E (ms) Mean RTF Audio Throughput
1 50 0 435.9 2210.9 0.4777 2.1058
4 50 0 694.0 3593.5 0.7636 5.1899
10 50 0 4692.7 7443.5 1.6275 5.8412

GPU Memory at Serving Time

Framework VRAM Used
sglang-omni 127.5 GB
vllm-omni 87.3 GB

Analysis

The stream_vocoder_device setting is the dominant factor here. With the CPU vocoder (upstream default), sglang-omni is severely bottlenecked -- RTF 3.71 at c=1, with request failures even at low concurrency. With the GPU vocoder, sglang-omni reaches RTF 0.26 and TTFP 94ms at c=1, which aligns well with Fish Audio's official claims (RTF 0.34, TTFP ~140ms on H200).

The reason stream_vocoder_device=cpu is the upstream default appears to be a memory constraint: sglang-omni already uses ~127.5 GB just for serving, which leaves no headroom on 80 GB GPUs (H100 PCIe) and even 96 GB GPUs would be tight. On H20 (143 GB), the GPU vocoder fits comfortably with zero failures.

This means the comparison in the PR as currently presented is somewhat asymmetric -- vllm-omni runs its full pipeline on GPU while sglang-omni is bottlenecked by a CPU vocoder. I'd suggest:

  1. Add the GPU vocoder config as an additional benchmark variant, at least for hardware that can support it, so readers see the full picture.
  2. Note prominently that the CPU vocoder default is a memory tradeoff, not a performance-optimal configuration, and that results will differ significantly on GPUs with >128 GB VRAM.
  3. The memory efficiency gap (87.3 GB vs 127.5 GB) is a legitimate advantage for vllm-omni and worth highlighting -- it enables full GPU acceleration on a wider range of hardware.

…README

Signed-off-by: zwhzzz0821 <2831474076@qq.com>
@zwhzzz0821
Copy link
Copy Markdown
Author

@ywang96 That is correct. Using an H200 would provide the necessary VRAM headroom for sglang-omni to run its full GPU pipeline effectively.

@zwhzzz0821
Copy link
Copy Markdown
Author

@linyueqian Thanks for the detailed reproduction and analysis.

I have updated the benchmark setup accordingly.

Specifically:

  • I added two sglang-omni config variants:
    • s2pro_tts_upstream.yaml, which mirrors the upstream default/minimal config path
    • s2pro_tts_gpu_vocoder.yaml, which enables stream_vocoder_device=cuda:0 for high-VRAM hardware
  • I also updated the README to explain why these two sglang-omni configs are both included, what tradeoff
    they represent, and why the CPU-vocoder default should be understood as a memory/stability tradeoff rather
    than a performance-optimal configuration.

@zwhzzz0821
Copy link
Copy Markdown
Author

zwhzzz0821 commented Apr 8, 2026

I ran an additional two GPU experiment.

For vllm-omni, the result below comes from the code in PR #2520:
#2520

The vllm-omni deployment used stage-level placement across two GPUs (not TP):

  • CUDA_VISIBLE_DEVICES=0,1
  • stage-0 (fish_speech_slow_ar) on GPU 0
  • stage-1 (dac_decoder) on GPU 1

The sglang-omni run also used a dual-GPU layout, with the TTS engine on GPU 0 and the streaming/full vocoder path on GPU 1.

Results:

Framework Concurrency Completed Failed Mean TTFP (ms) Mean E2E (ms) Mean RTF Audio Throughput Request Throughput
sglang-omni 1 50 0 163.61 2259.19 0.4360 2.2956 0.4425
sglang-omni 4 50 0 265.72 3022.26 0.5821 6.6495 1.2819
sglang-omni 10 50 0 379.37 4063.02 0.7788 12.0722 2.3223
vllm-omni (PR #2520) 1 50 0 633.12 2826.97 0.6023 1.6718 0.3537
vllm-omni (PR #2520) 4 50 0 992.97 4256.90 0.9266 4.2486 0.9158
vllm-omni (PR #2520) 10 50 0 5920.66 8940.00 1.9641 4.8646 1.0417

vllm omni config:

async_chunk: true
stage_args:
  - stage_id: 0
    stage_type: llm
    is_comprehension: true
    runtime:
      devices: "0"
      max_batch_size: 16
    engine_args:
      max_num_seqs: 4
      model_stage: fish_speech_slow_ar
      model_arch: FishSpeechSlowARForConditionalGeneration
      worker_type: ar
      scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
      enforce_eager: false
      trust_remote_code: true
      async_scheduling: false
      enable_prefix_caching: false
      engine_output_type: latent
      gpu_memory_utilization: 0.6
      distributed_executor_backend: "mp"
      max_num_batched_tokens: 3072
      max_model_len: 16384
      custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.fish_speech.slow_ar_to_dac_decoder_async_chunk
    output_connectors:
      to_stage_1: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.8
      top_k: 30
      top_p: 0.9
      max_tokens: 2048
      seed: 42
      detokenize: false
      repetition_penalty: 1.0
      stop_token_ids: [151645]

  - stage_id: 1
    stage_type: llm
    runtime:
      devices: "1"
      max_batch_size: 16
    engine_args:
      max_num_seqs: 1
      model_stage: dac_decoder
      model_arch: FishSpeechDACDecoder
      worker_type: generation
      scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
      enforce_eager: true
      trust_remote_code: true
      async_scheduling: false
      enable_prefix_caching: false
      engine_output_type: audio
      gpu_memory_utilization: 0.1
      distributed_executor_backend: "mp"
      max_num_batched_tokens: 8192
      max_model_len: 16384
    engine_input_source: [0]
    final_output: true
    final_output_type: audio
    input_connectors:
      from_stage_0: connector_of_shared_memory
    default_sampling_params:
      temperature: 0.0
      top_p: 1.0
      top_k: -1
      max_tokens: 65536
      seed: 42
      detokenize: true
      repetition_penalty: 1.0

runtime:
  enabled: true
  defaults:
    window_size: -1
    max_inflight: 16

  connectors:
    connector_of_shared_memory:
      name: SharedMemoryConnector
      extra:
        shm_threshold_bytes: 65536
        codec_streaming: true
        connector_get_sleep_s: 0.01
        connector_get_max_wait_first_chunk: 3000
        connector_get_max_wait: 300
        codec_chunk_frames: 25
        codec_left_context_frames: 25
        initial_codec_chunk_frames: 4

  edges:
    - from: 0
      to: 1
      window_size: -1

sglang omni config:

config_cls: S2ProPipelineConfig
model_path: fishaudio/s2-pro
entry_stage: preprocessing
relay_backend: shm
stages:
  - name: preprocessing
    executor:
      factory: sglang_omni.models.fishaudio_s2_pro.pipeline.stages.create_preprocessing_executor
      args: {}
    get_next: sglang_omni.models.fishaudio_s2_pro.pipeline.next_stage.preprocessing_next
    relay:
      slot_size_mb: 512
      credits: 2
      rank: null
      world_size: null
      device: cpu
    num_workers: 1
    stream_to: []
  - name: tts_engine
    executor:
      factory: sglang_omni.models.fishaudio_s2_pro.pipeline.stages.create_sglang_tts_engine_executor
      args:
        device: cuda:0
        max_new_tokens: 2048
        stream_vocoder_device: cuda:1
    get_next: sglang_omni.models.fishaudio_s2_pro.pipeline.next_stage.tts_engine_next
    relay:
      slot_size_mb: 512
      credits: 2
      rank: null
      world_size: null
      device: cuda:0
    num_workers: 1
    stream_to: []
  - name: vocoder
    executor:
      factory: sglang_omni.models.fishaudio_s2_pro.pipeline.stages.create_vocoder_executor
      args:
        device: cuda:1
    get_next: sglang_omni.models.fishaudio_s2_pro.pipeline.next_stage.vocoder_next
    relay:
      slot_size_mb: 512
      credits: 2
      rank: null
      world_size: null
      device: cpu
    num_workers: 1
    stream_to: []

cc @Sy0307 @linyueqian

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 8, 2026

sgl-omni has higher hit rate in voice clone scenario as it caches ref_audio. Maybe we should do the same thing and it will be welcome like #2561 said.

@linyueqian
Copy link
Copy Markdown
Collaborator

sgl-omni has higher hit rate in voice clone scenario as it caches ref_audio. Maybe we should do the same thing and it will be welcome like #2561 said.

make sense. i'll see if that will help.

linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 8, 2026
Reuses fish_bench_utils from PR vllm-project#2515 to compare:
  A) Inline ref_audio (no cache, DAC encode every request)
  B) Uploaded voice (cache hits after 1st request)

Reports TTFP/E2E/RTF comparison table.
@zwhzzz0821 zwhzzz0821 requested a review from ywang96 April 9, 2026 08:53
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 10, 2026
Reuses fish_bench_utils from PR vllm-project#2515 to compare:
  A) Inline ref_audio (no cache, DAC encode every request)
  B) Uploaded voice (cache hits after 1st request)

Reports TTFP/E2E/RTF comparison table.

Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
linyueqian pushed a commit to linyueqian/vllm-omni that referenced this pull request Apr 13, 2026
Fold DAC codec decoding into the Slow AR model so AR generation and
audio synthesis run in one vLLM engine process.  Eliminates the second
engine, SharedMemoryConnector, and OmniGenerationScheduler overhead.

New files:
- fish_speech_single_stage.py: subclasses SlowAR, overrides
  make_omni_output to decode audio_codes inline via DAC codec
- fish_speech_s2_pro_single_stage.yaml: single-stage config
- plan/fish_speech_single_stage_analysis.md: analysis doc

Ref: vllm-project#2515 (Fish Speech benchmark baseline)
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Fish Speech S2 Pro Benchmark Workflow

Overall this is a well-structured, clearly documented benchmark addition. The shared utility library (fish_bench_utils.py) is a good design choice that keeps the per-backend wrappers thin, and the shell runner script is nicely parameterized. A few observations and suggestions below.

Positives

  • Clean separation between shared infrastructure (fish_bench_utils.py) and backend-specific payload construction. This will make it easy to add future TTS model benchmarks.
  • The SSE vs raw-audio stream auto-detection in send_streaming_request is a nice touch for cross-backend compatibility.
  • The shell script is well-documented with env-var overrides and includes server health probes before running.
  • Good README with hardware caveats, metric definitions, and architecture notes.

Suggestions

  1. Missing __init__.py files: The vllm_omni/ and sglang_omni/ subdirectories under benchmarks/fish-speech/ have no __init__.py. While not strictly needed since the scripts use sys.path.insert, adding empty __init__.py files would make it possible to import these modules cleanly if future tooling or tests need to reference them.

  2. TIMESTAMP variable in run_benchmark.sh is unused: The shell script defines TIMESTAMP (line 23 of the script) and uses it for the plot filename, but the benchmark JSON files get their own timestamp from fish_bench_utils.save_results() using datetime.now(). This means the plot timestamp and the JSON timestamps will differ slightly. Consider either passing the shell-level timestamp into the Python scripts or just accepting the minor drift.

  3. Warmup requests run without concurrency limiting: In run_benchmark() (fish_bench_utils.py), the warmup phase fires num_warmups requests via asyncio.gather without the semaphore. For num_warmups=3 this is fine, but if someone increases it, all warmup requests would hit the server simultaneously. Minor concern, just worth noting.

  4. pcm_bytes_to_duration assumes mono audio: The helper computes duration as num_bytes / sample_width / sample_rate, which is correct for mono. If the model ever outputs stereo, this would silently halve the reported duration. Consider adding a channels=1 parameter with a default, or at least a docstring note that it assumes mono.

  5. Hardcoded 44100 sample rate in both bench scripts: Both vllm_omni/bench_fish_server.py and sglang_omni/bench_fish_server.py hardcode SAMPLE_RATE = 44100. If this ever changes per-model or per-config, consider making it a CLI argument with the current value as default, similar to how --request-timeout is exposed.

  6. plot_results.py - assert for argument validation: Line assert len(args.results) == len(args.labels) will be stripped in optimized Python (python -O). Consider using a proper if ... raise SystemExit(...) or parser.error(...) instead.

  7. Shell script check_server for sglang uses /health only: The vllm probe tries /v1/audio/voices first then falls back to /health, which is good. The sglang probe only checks /health. If sglang-omni exposes a speech-specific readiness endpoint, it might be worth probing that too for consistency, though this is minor.

  8. No .gitignore for results/: The results/ directory will accumulate JSON and PNG files from benchmark runs. Consider adding a benchmarks/fish-speech/results/.gitignore with * and !.gitignore to prevent accidental commits of large result artifacts.

Questions

  • The PR description mentions baseline results were collected but I don't see any JSON result files checked in. Was that intentional? If so, that's good (keeps the repo clean). If baseline results are meant to be checked in for regression tracking, they should probably live in a separate directory or be clearly marked.

Verdict

This is a solid benchmark addition. The suggestions above are mostly minor improvements and do not block merging. Nice work turning an ad-hoc experiment into a reproducible workflow.

Signed-off-by: zwhzzz0821 <2831474076@qq.com>
@zwhzzz0821
Copy link
Copy Markdown
Author

@lishunyang12 Thanks for the detailed review. I have updated the code following your suggestions.

Regarding the baseline results: this was intentional. The new results/.gitignore keeps all generated artifacts out of the repo, which is what we want. If we ever need checked-in baselines for regression tracking, we can move them to a dedicated directory later.

@zwhzzz0821 zwhzzz0821 requested a review from lishunyang12 April 19, 2026 17:17
Signed-off-by: Yukim1 <121286183+zwhzzz0821@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the thorough follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Establish baseline and profile fish-speech's performance

5 participants