[Feat] Add Fish Speech S2 Pro benchmark workflow and baseline results#2515
[Feat] Add Fish Speech S2 Pro benchmark workflow and baseline results#2515zwhzzz0821 wants to merge 9 commits intovllm-project:mainfrom
Conversation
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6abff8174
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| def print_comparison_table(all_results: list[list[dict]], labels: list[str]) -> None: | ||
| """Print a markdown-formatted comparison table.""" | ||
| concurrencies = sorted(set(result["concurrency"] for result in all_results[0])) |
There was a problem hiding this comment.
Include union of concurrency levels in report table
print_comparison_table builds concurrencies from only all_results[0], so when compared runs have different concurrency sets (for example, one backend fails/OOMs at higher concurrency and omits those rows), the markdown table and improvement section silently drop those missing levels. That hides failed or missing experiment points in the textual summary even though the plot path already handles unioned concurrencies.
Useful? React with 👍 / 👎.
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
There was a problem hiding this comment.
A few things that should be addressed:
- Let's remove the benchmark result png from the PR itself. (You can still keep it in the PR description).
- Could you update the benchmark result with the corresponding version of
vllm-omniandsglang-omni? Generally speaking, showing package/library versions is a minimum requirement in any benchmark results.
|
Totally good. Details to be modified as @ywang96 mentioned. vllm-omni TTFP on c=10 is so weird and I will do some research on it. Also cc @linyueqian . And plz titles shows Fish Speech S2 pro for clarity. |
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
|
Thanks for putting together this benchmark workflow - the reproducible setup is great. One concern: the sglang-omni numbers here (RTF 1.94, TTFP 859ms at c=1) are significantly worse than their official claims in the Fish Audio S2 technical report - RTF 0.34 and TTFA ~140ms on a single H200 at batch size 1. That's roughly a 5.7x gap in RTF, which seems too large to be explained by the H100 PCIe vs H200 hardware difference alone. I compared the sglang-omni config in this PR against the official default and most fields match, but the PR adds Also, the official benchmarks were run on H200 while this PR uses H100 PCIe - worth noting this hardware difference in the README so readers have the right context when interpreting the comparison. |
|
@linyueqian Thanks, this is a very good point. One clarification from the upstream Code-level evidence:
So the original benchmark config was making the upstream default explicit rather than changing it. I also ran the additional experiment you suggested. I compared:
Results on my H100 PCIe setup:
So forcing the streaming vocoder onto GPU does improve latency/RTF for the requests that complete, but it is not stable on my H100 PCIe setup: I observed partial-request failures So forcing the streaming vocoder onto GPU does improve latency/RTF for the requests that complete, but it is not stable on my H100 PCIe setup: I observed partial-request failures My interpretation is that the upstream CPU default is likely a practical stability/memory tradeoff rather than an arbitrary benchmark choice. Under the default setting, I will also update the README to make the hardware context explicit: these measurements were collected on H100 PCIe, while the Fish Audio technical report numbers were reported on |
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
|
@zwhzzz0821 Thanks for the explanation and changes! Just so I understand correctly, I should be able to run the same script on H200 to generate a more representative result for |
|
@zwhzzz0821 Thanks for the thorough follow-up on the I ran an independent reproduction on 8×H20-3e (143 GB per GPU) to further validate the results. The H20 has enough VRAM for both configurations without OOM, so we can compare them fairly on the same hardware. H20-3e Results (143 GB VRAM)sglang-omni with
sglang-omni with
vllm-omni (PR config):
GPU Memory at Serving Time
AnalysisThe The reason This means the comparison in the PR as currently presented is somewhat asymmetric -- vllm-omni runs its full pipeline on GPU while sglang-omni is bottlenecked by a CPU vocoder. I'd suggest:
|
…README Signed-off-by: zwhzzz0821 <2831474076@qq.com>
|
@ywang96 That is correct. Using an H200 would provide the necessary VRAM headroom for sglang-omni to run its full GPU pipeline effectively. |
|
@linyueqian Thanks for the detailed reproduction and analysis. I have updated the benchmark setup accordingly. Specifically:
|
|
I ran an additional two GPU experiment. For The
The Results:
vllm omni config: sglang omni config: |
|
sgl-omni has higher hit rate in voice clone scenario as it caches ref_audio. Maybe we should do the same thing and it will be welcome like #2561 said. |
make sense. i'll see if that will help. |
Reuses fish_bench_utils from PR vllm-project#2515 to compare: A) Inline ref_audio (no cache, DAC encode every request) B) Uploaded voice (cache hits after 1st request) Reports TTFP/E2E/RTF comparison table.
Reuses fish_bench_utils from PR vllm-project#2515 to compare: A) Inline ref_audio (no cache, DAC encode every request) B) Uploaded voice (cache hits after 1st request) Reports TTFP/E2E/RTF comparison table. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Fold DAC codec decoding into the Slow AR model so AR generation and audio synthesis run in one vLLM engine process. Eliminates the second engine, SharedMemoryConnector, and OmniGenerationScheduler overhead. New files: - fish_speech_single_stage.py: subclasses SlowAR, overrides make_omni_output to decode audio_codes inline via DAC codec - fish_speech_s2_pro_single_stage.yaml: single-stage config - plan/fish_speech_single_stage_analysis.md: analysis doc Ref: vllm-project#2515 (Fish Speech benchmark baseline)
lishunyang12
left a comment
There was a problem hiding this comment.
Review: Fish Speech S2 Pro Benchmark Workflow
Overall this is a well-structured, clearly documented benchmark addition. The shared utility library (fish_bench_utils.py) is a good design choice that keeps the per-backend wrappers thin, and the shell runner script is nicely parameterized. A few observations and suggestions below.
Positives
- Clean separation between shared infrastructure (
fish_bench_utils.py) and backend-specific payload construction. This will make it easy to add future TTS model benchmarks. - The SSE vs raw-audio stream auto-detection in
send_streaming_requestis a nice touch for cross-backend compatibility. - The shell script is well-documented with env-var overrides and includes server health probes before running.
- Good README with hardware caveats, metric definitions, and architecture notes.
Suggestions
-
Missing
__init__.pyfiles: Thevllm_omni/andsglang_omni/subdirectories underbenchmarks/fish-speech/have no__init__.py. While not strictly needed since the scripts usesys.path.insert, adding empty__init__.pyfiles would make it possible to import these modules cleanly if future tooling or tests need to reference them. -
TIMESTAMPvariable inrun_benchmark.shis unused: The shell script definesTIMESTAMP(line 23 of the script) and uses it for the plot filename, but the benchmark JSON files get their own timestamp fromfish_bench_utils.save_results()usingdatetime.now(). This means the plot timestamp and the JSON timestamps will differ slightly. Consider either passing the shell-level timestamp into the Python scripts or just accepting the minor drift. -
Warmup requests run without concurrency limiting: In
run_benchmark()(fish_bench_utils.py), the warmup phase firesnum_warmupsrequests viaasyncio.gatherwithout the semaphore. Fornum_warmups=3this is fine, but if someone increases it, all warmup requests would hit the server simultaneously. Minor concern, just worth noting. -
pcm_bytes_to_durationassumes mono audio: The helper computes duration asnum_bytes / sample_width / sample_rate, which is correct for mono. If the model ever outputs stereo, this would silently halve the reported duration. Consider adding achannels=1parameter with a default, or at least a docstring note that it assumes mono. -
Hardcoded
44100sample rate in both bench scripts: Bothvllm_omni/bench_fish_server.pyandsglang_omni/bench_fish_server.pyhardcodeSAMPLE_RATE = 44100. If this ever changes per-model or per-config, consider making it a CLI argument with the current value as default, similar to how--request-timeoutis exposed. -
plot_results.py-assertfor argument validation: Lineassert len(args.results) == len(args.labels)will be stripped in optimized Python (python -O). Consider using a properif ... raise SystemExit(...)orparser.error(...)instead. -
Shell script
check_serverfor sglang uses/healthonly: The vllm probe tries/v1/audio/voicesfirst then falls back to/health, which is good. The sglang probe only checks/health. If sglang-omni exposes a speech-specific readiness endpoint, it might be worth probing that too for consistency, though this is minor. -
No
.gitignoreforresults/: Theresults/directory will accumulate JSON and PNG files from benchmark runs. Consider adding abenchmarks/fish-speech/results/.gitignorewith*and!.gitignoreto prevent accidental commits of large result artifacts.
Questions
- The PR description mentions baseline results were collected but I don't see any JSON result files checked in. Was that intentional? If so, that's good (keeps the repo clean). If baseline results are meant to be checked in for regression tracking, they should probably live in a separate directory or be clearly marked.
Verdict
This is a solid benchmark addition. The suggestions above are mostly minor improvements and do not block merging. Nice work turning an ad-hoc experiment into a reproducible workflow.
Signed-off-by: zwhzzz0821 <2831474076@qq.com>
|
@lishunyang12 Thanks for the detailed review. I have updated the code following your suggestions. Regarding the baseline results: this was intentional. The new results/.gitignore keeps all generated artifacts out of the repo, which is what we want. If we ever need checked-in baselines for regression tracking, we can move them to a dedicated directory later. |
Signed-off-by: Yukim1 <121286183+zwhzzz0821@users.noreply.github.com>
lishunyang12
left a comment
There was a problem hiding this comment.
LGTM, thanks for the thorough follow-up.
Purpose
Closes #2432.
This PR adds a complete Fish Speech benchmark workflow under
benchmarks/fish-speech/and provides theinitial baseline comparison between
vllm-omniandsglang-omniforfishaudio/s2-pro.vllm-omniand
sglang-omniOverall, this PR turns the Fish Speech benchmark from an ad hoc local experiment into a reproducible
benchmark workflow that can be rerun and compared more easily.
Test Plan
Environment
vllm-omni: local source checkout, based on upstreammaincommitca02351a1ef8aa6397126c60154a80ee06ae3553sglang-omni: local source checkout, commit3a9accf7fde58e0808c6623ebbe91de87cffdc98End-to-end benchmark workflow
vllm-omniwithbenchmarks/fish-speech/config/vllm_omni/fish_speech_s2_pro.yamlsglang-omniwithbenchmarks/fish-speech/config/sglang_omni/s2pro_tts.yamlsettings
benchmarks/fish-speech/plot_results.pyTest Result
All static/script checks above passed locally.
The benchmark workflow in this PR was run successfully and produced benchmark JSON outputs and plots under
benchmarks/fish-speech/results/.Baseline results used for the initial comparison:
Additional observations from local runs:
sglang-omniused noticeably more GPU memory thanvllm-omniin this setup, with roughly 20 GB highermemory usage during serving.
sglang-omnistartup was also significantly slower thanvllm-omniunder the same local environment andmodel setup.
Documentation was also updated in
benchmarks/fish-speech/README.mdto describe the setup and benchmarkworkflow.
Essential Elements of an Effective PR Description Checklist
don't require additional test scripts. For test file guidelines, please check the [test style doc](https://
docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
supported_models.mdandexamplesfor a new model. Please run
mkdocs serveto sync the documentation editions to./docs.