feat: add MOSS-TTS-Nano single-stage TTS support#2753
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
1d4beaa to
5de0923
Compare
|
BLOCKER scan: This PR has a merge conflict. Please resolve the conflict before proceeding with review. OVERALL: MERGE CONFLICT VERDICT: REQUEST_CHANGES |
|
I think you can update the skills to allow community users to contribute more effectively |
will do |
42986eb to
033e404
Compare
lishunyang12
left a comment
There was a problem hiding this comment.
Code Review: feat: add MOSS-TTS-Nano single-stage TTS support
Overall this is a well-structured PR that follows existing patterns in the codebase (VoxCPM-style generator, stage config YAML, registry entry, serving layer extension). The examples, tests, and documentation are thorough. A few issues to address before merging:
Issues
1. _build_moss_tts_params does not handle ref_audio (serving layer bug)
vllm_omni/entrypoints/openai/serving_speech.py — _build_moss_tts_params() maps ref_text to prompt_text but completely ignores request.ref_audio. The online serving README documents custom voice cloning via ref_audio as a supported feature, and the model's _create_stream_gen() reads prompt_audio_path from additional_information. Without wiring ref_audio through the serving layer, voice cloning via the /v1/audio/speech endpoint will silently fail to use the reference audio.
You need to resolve ref_audio (download/decode the data URL) and pass the path through as prompt_audio_path in the params dict, similar to how CosyVoice3 or VoxCPM2 handle it.
2. _ar_emit_stop_token is shared mutable state across a batch — race condition
modeling_moss_tts_nano.py — self._ar_emit_stop_token is a single boolean on the model instance. In forward(), it is set to all(last_chunk_flags) — meaning if any request in the batch is still generating, all requests get non-EOS logits from compute_logits(). This means finished requests are kept alive (emitting empty audio) until the slowest request in the batch completes. For max_num_seqs: 4 this could be significant. Consider making this per-request (e.g., a dict keyed by request ID) so the AR scheduler can finish individual requests independently.
3. _create_stream_gen buffers all chunks then yields — "streaming" is misleading
The docstring says "yields one audio chunk per forward() call" for progressive streaming, and the PR description claims "TTFP reduced from ~3.1s to ~0.11s." However, looking at the generator, it calls self._lm.inference_stream() which does yield events progressively, and those are yielded out. This part looks correct on re-read. But the comment at line ~1949 says "We buffer first because inference_stream mixes audio events with a final result event" — this comment is outdated/misleading since you actually yield chunk, False inside the loop. Please clean up the comment to match the actual streaming behavior.
4. torch.manual_seed() in _create_stream_gen sets global RNG state
modeling_moss_tts_nano.py line ~1925 — calling torch.manual_seed(seed) and torch.cuda.manual_seed_all(seed) sets the global RNG state. In a concurrent batch with max_num_seqs: 4, one request's seed will overwrite another's. Consider using a torch.Generator for per-request determinism, or at least document that seeding is best-effort and not safe under concurrency.
5. _stream_gens dict is not thread-safe
self._stream_gens is a plain dict mutated without the existing self._lock. While the AR worker is likely single-threaded, the _lock is already used for load_weights, so if there's any possibility of concurrent forward calls (e.g., from async scheduling), this could corrupt state. Either document the single-thread assumption or protect mutations with the lock.
6. Offline test _collect_audio has a typo: AssertionError
tests/e2e/offline_inference/test_moss_tts_nano.py line 1331 — raise AssertionError is misspelled as AssertionError. This would actually raise a NameError at runtime instead of the intended assertion.
Minor / Nits
## MOSS-TTS-Nanocomment in registry.py — other entries don't use##headers. Use#for consistency.for _ in weights: passinload_weights()— add a comment explaining this drains the iterator to satisfy the vLLM weight-loading protocol, since it's non-obvious.- Gradio demo is 690 lines — the inline AudioWorklet JS and HTML templates are large. Consider extracting them to separate files under the example directory for maintainability (not blocking).
_DEFAULT_MODE = "continuation"vs online serving README says default voice is"Junhao"which impliesvoice_clonemode — the defaults are inconsistent between offline and online paths. The offline example usesvoice_cloneas default mode while the model code usescontinuation. Clarify which is intended.- CI step uses
gpu_1_queue— confirm this is the right queue for L4 GPUs, as the test decorator specifiesres={"cuda": "L4"}. _REPO_ROOTcalculation inend2end.pyusesPath(__file__).resolve().parents[4]— this is fragile and breaks if the file is moved. Consider using a more robust path resolution or accepting it as a required CLI arg.
What looks good
- Clean single-stage architecture with well-documented YAML config
- Proper use of the VoxCPM generator pattern for streaming
- Good test coverage (offline: English, Chinese, deterministic, batch, voice presets; online: WAV, streaming PCM, Chinese)
- Serving layer integration follows existing patterns cleanly
- Gradio demo with AudioWorklet streaming is a nice addition
Please address items 1, 2, 4, and 6 before merging. Items 3 and 5 are lower priority but worth fixing.
|
@Sy0307 PTAL. |
|
Overall LGTM, one minor issue:
When a request is cancelled, timed out, or preempted, the AR scheduler notifies the model via Suggest adding cleanup following the def on_requests_finished(self, finished_req_ids: set[str] | list[str]) -> None:
for req_id in finished_req_ids:
gen = self._stream_gens.pop(str(req_id), None)
if gen is not None:
gen.close() |
b63746b to
4ad0ef6
Compare
Integrates OpenMOSS-Team/MOSS-TTS-Nano (0.1B) as a single GPUGenerationWorker stage. Both the AR LM and MOSS-Audio-Tokenizer-Nano codec run inside MossTTSNanoForGeneration.forward(), removing the need for an inter-stage connector. Key design choices: - Weights loaded in load_weights() not __init__ (avoids pre-CUDA alloc) - trust_remote_code delegates to upstream HF model classes - codec path read from config.audio_tokenizer_pretrained_name_or_path - inference_stream() collects progressive audio chunks for low latency - 48 kHz stereo output; voice clone + continuation modes supported Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…codec unavailable Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
…io/speech integration - Register moss_tts_nano as TTS model stage in serving_speech.py (model detection, validation, prompt building) - Fix registry mod_relname: moss_tts_nano -> modeling_moss_tts_nano - Fix stage config: is_comprehension=true (required for generate task) - Fix default mode: voice_clone -> continuation (built-in presets) - Add compute_logits stub for VllmModelForTextGeneration protocol - Remove unused _sentinel nn.Parameter - Add Gradio demo with AudioWorklet streaming player (48kHz stereo) - Add online/offline serving docs and launch scripts TODO: Single-stage generation models don't support true streaming (progressive audio chunks). Current TTFP = full generation time. Needs multi-step scheduling support in GPUGenerationModelRunner. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
- Add online serving e2e test (non-streaming, streaming, Chinese) - Add online serving user guide doc (API, voices, Gradio, curl/Python) - Add offline inference user guide doc Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
- Add MOSS-TTS-Nano E2E test to .buildkite/test-merge.yml - Remove docs from docs/ (will be synced from examples/ READMEs) Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Switch from GPUGenerationWorker to AR worker with OmniARScheduler, using the VoxCPM-style generator pattern for streaming: - inference_stream() stored per-request in _stream_gens dict - Each forward() call yields one audio chunk via next(generator) - compute_logits() emits EOS only when last chunk is yielded - AR scheduler loops until EOS, enabling progressive audio output TTFP reduced from ~3.1s to ~0.11s (30x improvement). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
The upstream inference_stream() API has no voice/spk parameter; voice preset selection is not yet wired into the call. Remove the dead assignment to silence ruff F841. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
- Wire ref_audio through /v1/audio/speech: _build_moss_tts_params is now
async, resolves ref_audio via MediaConnector and passes the waveform as
prompt_audio_array so the model can materialise a temp WAV for upstream
inference_stream (previously voice cloning via the REST endpoint was a
no-op). Serving now also requires ref_text when ref_audio is provided.
- Fix per-request EOS in batched decode: replace the shared
_ar_emit_stop_token bool with a _ar_last_chunk_flags list so
compute_logits emits EOS per row; finished requests no longer wait for
the slowest peer in a max_num_seqs=4 batch.
- Snapshot and restore CPU+CUDA RNG state around torch.manual_seed to
limit global-state bleed; add comment noting deterministic output
under concurrent batching is best-effort (upstream inference_stream
uses the global RNG).
- Align _DEFAULT_MODE with the offline example and tests ("voice_clone").
- Clean up outdated "buffer first" comment in _create_stream_gen; document
single-threaded AR-worker assumption for _stream_gens; add one-liner
explaining the load_weights iterator drain.
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Follow-up to vllm-project#2958. MOSS-TTS-Nano was authored before the schema refactor and still shipped as a legacy stage_configs yaml; this aligns it with the same layout the 5 migrated TTS models now use. - vllm_omni/model_executor/models/moss_tts_nano/pipeline.py declares MOSS_TTS_NANO_PIPELINE (single LLM_AR stage, owns_tokenizer, audio output, stop_token_ids=[2] as a hard EOS backstop). - vllm_omni/deploy/moss_tts_nano.yaml holds runtime knobs (max_num_seqs, gpu_memory_utilization, enforce_eager, default_sampling_params, skip_mm_profiling); trust_remote_code stays at deploy top-level. - vllm_omni/config/pipeline_registry.py registers the entry so the lazy registry can resolve it. - moss_tts_nano/__init__.py exports MossTTSNanoForGeneration (VoxCPM2 pattern). - Removed vllm_omni/model_executor/stage_configs/moss_tts_nano.yaml. Examples, shell scripts, READMEs, and buildkite-invoked tests are updated to use `vllm serve <model> --omni` / `--deploy-config` (auto-load kicks in; no --stage-configs-path or --trust-remote-code). Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
forward() only popped _stream_gens on normal completion, so cancelled, timed-out, or preempted requests leaked their generator and skipped the finally block that unlinks the temp WAV files. Implement on_requests_finished to close each finished generator, which raises GeneratorExit inside it and triggers the existing cleanup. Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com>
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
4ad0ef6 to
c5b0983
Compare
Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
…fs clash Fish Speech's example README also has a `## Model` heading. mkdocs-autorefs treats both as primary URLs for the symbol `model`, producing a warning per cross-ref site (>100 warnings). With `--strict` + `fail_on_warning: true` in .readthedocs.yml, this fails the docs build (RTD #32425202). Renaming this PR's new heading to `## Model checkpoint` removes the slug conflict and gets the docs build green. Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
e56f1e3 to
887bde5
Compare
MOSS-TTS-Nano upstream is voice-cloning-only — there are no built-in speaker presets. The integration shipped 15 invented voice names (Junhao/Ava/Adam/...) with no resolution layer mapping name → audio, so every request defaulted to mode='voice_clone' with prompt_audio_path unset and the model raised ValueError on serve. The post-merge L4 build caught this (RTD #8107, MOSS-TTS-Nano E2E Test). Changes: - serving_speech: require ref_audio + ref_text in /v1/audio/speech; ignore the OpenAI-schema voice field with a clear error message. - modeling: drop _DEFAULT_VOICE; dummy run no longer carries voice. - examples/online: rewrite README + gradio_demo around required ref audio upload. Drop the 15-row preset table + dropdown + examples. - examples/offline: --prompt-audio and --prompt-text now required. Drop --voice and --batch (no per-voice batch makes sense without presets). README points users to upstream assets/audio/zh_1.wav. - tests: session-scoped fixture downloads upstream zh_1.wav (~50 KB) and reuses it across cases. Drop test_moss_tts_nano_voice_presets (no presets to test). All paths use XDG_CACHE_HOME or pytest tmp_path_factory — no /tmp shared-dir writes. Refs: vllm-project#2753 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
MOSS-TTS-Nano upstream is voice-cloning-only — there are no built-in speaker presets. The integration shipped 15 invented voice names (Junhao/Ava/Adam/...) with no resolution layer mapping name → audio, so every request defaulted to mode='voice_clone' with prompt_audio_path unset and the model raised ValueError on serve. The post-merge L4 build caught this (RTD #8107, MOSS-TTS-Nano E2E Test). Changes: - serving_speech: require ref_audio + ref_text in /v1/audio/speech; ignore the OpenAI-schema voice field with a clear error message. - modeling: drop _DEFAULT_VOICE; dummy run no longer carries voice. - examples/online: rewrite README + gradio_demo around required ref audio upload. Drop the 15-row preset table + dropdown + examples. - examples/offline: --prompt-audio and --prompt-text now required. Drop --voice and --batch (no per-voice batch makes sense without presets). README points users to upstream assets/audio/zh_1.wav. - tests: session-scoped fixture downloads upstream zh_1.wav (~50 KB) and reuses it across cases. Drop test_moss_tts_nano_voice_presets (no presets to test). All paths use XDG_CACHE_HOME or pytest tmp_path_factory — no /tmp shared-dir writes. Refs: vllm-project#2753 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
MOSS-TTS-Nano upstream is voice-cloning-only — there are no built-in speaker presets. The integration shipped 15 invented voice names (Junhao/Ava/Adam/...) with no resolution layer mapping name → audio, so every request defaulted to mode='voice_clone' with prompt_audio_path unset and the model raised ValueError on serve. The post-merge L4 build caught this (RTD #8107, MOSS-TTS-Nano E2E Test). Changes: - serving_speech: require ref_audio + ref_text in /v1/audio/speech; ignore the OpenAI-schema voice field with a clear error message. - modeling: drop _DEFAULT_VOICE; dummy run no longer carries voice. - examples/online: rewrite README + gradio_demo around required ref audio upload. Drop the 15-row preset table + dropdown + examples. - examples/offline: --prompt-audio and --prompt-text now required. Drop --voice and --batch (no per-voice batch makes sense without presets). README points users to upstream assets/audio/zh_1.wav. - tests: session-scoped fixture downloads upstream zh_1.wav (~50 KB) and reuses it across cases. Drop test_moss_tts_nano_voice_presets (no presets to test). All paths use XDG_CACHE_HOME or pytest tmp_path_factory — no /tmp shared-dir writes. Refs: vllm-project#2753 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
MOSS-TTS-Nano upstream is voice-cloning-only — there are no built-in speaker presets. The integration shipped 15 invented voice names (Junhao/Ava/Adam/...) with no resolution layer mapping name → audio, so every request defaulted to mode='voice_clone' with prompt_audio_path unset and the model raised ValueError on serve. The post-merge L4 build caught this (RTD #8107, MOSS-TTS-Nano E2E Test). Changes: - serving_speech: require ref_audio + ref_text in /v1/audio/speech; ignore the OpenAI-schema voice field with a clear error message. - modeling: drop _DEFAULT_VOICE; dummy run no longer carries voice. - examples/online: rewrite README + gradio_demo around required ref audio upload. Drop the 15-row preset table + dropdown + examples. - examples/offline: --prompt-audio and --prompt-text now required. Drop --voice and --batch (no per-voice batch makes sense without presets). README points users to upstream assets/audio/zh_1.wav. - tests: session-scoped fixture downloads upstream zh_1.wav (~50 KB) and reuses it across cases. Drop test_moss_tts_nano_voice_presets (no presets to test). All paths use XDG_CACHE_HOME or pytest tmp_path_factory — no /tmp shared-dir writes. Refs: vllm-project#2753 Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: Yueqian Lin <linyueqian@outlook.com> Signed-off-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Summary
Doc/skill updates split into #2806.
Changes
Model integration:
Serving layer:
Streaming (AR runner):
Examples:
Tests:
Test plan