Skip to content

Support funaudiochat s2s#1748

Open
nemoramo wants to merge 15 commits into
vllm-project:mainfrom
nemoramo:support-funaudiochat-s2s
Open

Support funaudiochat s2s#1748
nemoramo wants to merge 15 commits into
vllm-project:mainfrom
nemoramo:support-funaudiochat-s2s

Conversation

@nemoramo
Copy link
Copy Markdown

@nemoramo nemoramo commented Mar 9, 2026

PR Description

Summary

This PR adds Speech-to-Speech (S2S) support for FunAudioLLM/Fun-Audio-Chat-8B in vLLM-Omni.

Following the vLLM-Omni contributing guide, this PR keeps the scope limited to the upstream-ready integration itself:

  • model implementation
  • model/config registration
  • stage config and stage input processing
  • runtime plumbing required by the S2S pipeline
  • focused tests
  • supported-model documentation

Changes

Model integration

  • add the FunAudioChat stage-0 implementation
  • add the CosyVoice3 code2wav stage-1 path used by the FunAudioChat S2S pipeline
  • add the default FunAudioChat S2S stage config
  • add the FunAudioChat stage input processor

Runtime support

  • register the FunAudioChat model and config so the pipeline can be resolved and loaded correctly
  • update the relevant entrypoint / scheduler / output-processing code paths for FunAudioChat S2S
  • fix OmniRequestOutput handling when pipeline request_output is a list

Tests and docs

  • add focused unit tests for:
    • FunAudioChat native helper logic
    • FunAudioChat stage input processing
    • entrypoint/config resolution
    • OmniRequestOutput regression coverage
  • document FunAudioChat in docs/models/supported_models.md

Notes

  • contributor-local runtime path fallbacks were removed; the integration now relies on the installed package or
    FUN_AUDIO_CHAT_HOME
  • benchmark / diagnostic scripts and local intermediate artifacts are intentionally excluded from this PR to keep the
    change minimal and upstream-ready

Testing

Ran the following checks locally:

python -m py_compile \
  vllm_omni/model_executor/models/funaudiochat/common.py \
  vllm_omni/model_executor/models/funaudiochat/funaudiochat.py \
  vllm_omni/model_executor/models/funaudiochat/funaudiochat_code2wav.py \
  vllm_omni/model_executor/stage_input_processors/funaudiochat.py \
  vllm_omni/outputs.py \
  vllm_omni/engine/arg_utils.py \
  vllm_omni/entrypoints/omni.py \
  vllm_omni/entrypoints/omni_stage.py \
  vllm_omni/core/sched/omni_generation_scheduler.py \
  vllm_omni/worker/gpu_ar_model_runner.py

python -m pytest \
  tests/test_outputs.py \
  tests/entrypoints/test_funaudiochat_contrib.py \
  tests/model_executor/models/test_funaudiochat_native.py \
  tests/model_executor/stage_input_processors/test_funaudiochat.py -q

Result:

- 26 passed

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d2f3e1f43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread vllm_omni/engine/arg_utils.py Outdated
if hf_config_path is not None or model_arch not in _COSYVOICE3_MODEL_ARCHES:
return hf_config_path

return str(Path(__file__).resolve().parent.parent / "model_executor" / "models" / "cosyvoice3" / "hf_config")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid defaulting hf_config_path to a nonexistent bundle

This fallback always returns .../model_executor/models/cosyvoice3/hf_config when hf_config_path is unset, but that directory is not present in this repo/package, so default stage configs (for example funaudiochat_s2s.yaml, which does not set hf_config_path) will pass an invalid local path into model config creation and fail before loading the model unless users manually override hf_config_path.

Useful? React with 👍 / 👎.

Comment on lines +33 to +34
valid_rows = audio_token_ids.any(dim=-1)
audio_token_ids = audio_token_ids[valid_rows]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep all-zero codec rows when flattening stage-0 tokens

Using audio_token_ids.any(dim=-1) drops any 2D row made entirely of zeros, but token 0 is treated as valid elsewhere in this same function (filtered >= 0) and in tests, so a legitimate all-zero codec group from stage-0 would be silently removed before code2wav, shortening or corrupting synthesized audio for that request.

Useful? React with 👍 / 👎.

@linyueqian
Copy link
Copy Markdown
Collaborator

resolve conflicts please

@linyueqian
Copy link
Copy Markdown
Collaborator

Tried to run this locally but wasn't able to start the server. There are several API incompatibilities with the current main branch:

  • OmniModelConfig.__post_init__ in the PR adds language_model_only as a parameter, but the installed vLLM/vllm-omni doesn't have this as an InitVar field, so pydantic rejects it
  • OmniInputProcessor.__init__ passes renderer to the parent InputProcessor, but the current InputProcessor.__init__ doesn't accept that kwarg
  • The process_inputs method signature also diverges from main (e.g., ProcessorInputs vs DictPrompt | TokPrompt, different parameter ordering)

It looks like this PR was developed against a different version of vllm-omni. Could you rebase onto the latest main?

@linyueqian linyueqian self-requested a review March 9, 2026 18:43
@nemoramo nemoramo force-pushed the support-funaudiochat-s2s branch 2 times, most recently from 1768dd8 to b4d8620 Compare March 10, 2026 02:13
@nemoramo
Copy link
Copy Markdown
Author

Tried to run this locally but wasn't able to start the server. There are several API incompatibilities with the current main branch:

  • OmniModelConfig.__post_init__ in the PR adds language_model_only as a parameter, but the installed vLLM/vllm-omni doesn't have this as an InitVar field, so pydantic rejects it
  • OmniInputProcessor.__init__ passes renderer to the parent InputProcessor, but the current InputProcessor.__init__ doesn't accept that kwarg
  • The process_inputs method signature also diverges from main (e.g., ProcessorInputs vs DictPrompt | TokPrompt, different parameter ordering)

It looks like this PR was developed against a different version of vllm-omni. Could you rebase onto the latest main?
Thanks for your review.

  1. OmniModelConfig.__post_init__ / language_model_only this has been solved.
  2. OmniInputProcessor.__init__ passing renderer
    This has also been aligned with the current main. The rebased branch no longer depends on the older parent
    InputProcessor.__init__ signature.
  3. process_inputs signature divergence also rebase solved this
    I also re-tested the updated branch against latest main and verified that the server starts successfully and /v1/chat/ completions works.

@nemoramo
Copy link
Copy Markdown
Author

sorry wait, seems still have a mismatch problem

@nemoramo nemoramo force-pushed the support-funaudiochat-s2s branch 2 times, most recently from 4921519 to b6ae819 Compare March 11, 2026 02:31
@linyueqian
Copy link
Copy Markdown
Collaborator

@ramos please let me know if it is ready to be reviewed again. thanks!

@nemoramo
Copy link
Copy Markdown
Author

@ramos please let me know if it is ready to be reviewed again. thanks!

ok! When it's done, I will notify you asap.

@nemoramo nemoramo force-pushed the support-funaudiochat-s2s branch 9 times, most recently from 0a7d6df to fe924de Compare March 12, 2026 04:49
@nemoramo
Copy link
Copy Markdown
Author

@ramos please let me know if it is ready to be reviewed again. thanks!

Thanks for waiting. I have checked this branch and I believe it's ready right now @linyueqian

@nemoramo
Copy link
Copy Markdown
Author

@claude can you also help review this again

@nemoramo
Copy link
Copy Markdown
Author

@codex please also review this?

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 813dc0e16f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +112 to +113
flat = prompt_token_ids[0] if isinstance(prompt_token_ids[0], list) else prompt_token_ids
if len(flat) == 0:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Split prompt_token_ids by request before decoding

This helper collapses batched prompt_token_ids to a single 1-D sequence (prompt_token_ids[0] for list inputs), so when stage-1 runs with more than one request in a batch, requests after index 0 will decode from the wrong codec stream. This corrupts per-request audio output as soon as max_batch_size is increased beyond 1 for throughput.

Useful? React with 👍 / 👎.

del positions, intermediate_tensors, inputs_embeds

sampling_metadata = kwargs.get("sampling_metadata")
token, is_dummy_profile = self._build_decode_tokens(input_ids, sampling_metadata)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decode batched code2wav inputs per request

The forward path builds a single token tensor and emits one concatenated waveform, without using per-request boundaries, so batched requests are merged into one utterance instead of producing isolated outputs. If operators raise stage-1 batching (runtime.max_batch_size > 1), this mixes users’ codec tokens and returns incorrect audio for all requests in that batch.

Useful? React with 👍 / 👎.

Signed-off-by: ramos.ma <wyrmyf@gmail.com>
Signed-off-by: mayufeng <mayufeng@asr-h100>
Signed-off-by: ramos.ma <wyrmyf@gmail.com>
Signed-off-by: ramos.ma <wyrmyf@gmail.com>
Signed-off-by: mayufeng <mayufeng@asr-h100>
Signed-off-by: mayufeng <mayufeng@asr-h100>
Signed-off-by: mayufeng <mayufeng@asr-h100>
Signed-off-by: ramos.ma <wyrmyf@gmail.com>
Signed-off-by: mayufeng <mayufeng@asr-h100>
@nemoramo nemoramo force-pushed the support-funaudiochat-s2s branch from fdb750d to 567e6d5 Compare March 13, 2026 00:10
@linyueqian
Copy link
Copy Markdown
Collaborator

Tested E2E with local checkpoints (~/ckpt/Fun-Audio-Chat-8B + ~/ckpt/Fun-CosyVoice3-0.5B-2512). Audio generation works after fixing the issues below. Nice work getting the full pipeline wired up!

Bugs found

1. defer_finalize kwarg doesn't exist in upstream vLLM

Both gpu_ar_model_runner.py:297 and gpu_generation_model_runner.py:283 call:

self.maybe_get_kv_connector_output(scheduler_output, defer_finalize=defer_finalize)

But the installed vLLM's KVConnectorModelRunnerMixin.maybe_get_kv_connector_output() accepts clear_metadata, not defer_finalize. This crashes on every inference call.

Fix: clear_metadata=not defer_finalize

2. Stage config auto-detection fails

resolve_model_config_path() looks for {model_type}.yaml in stage_configs/, which resolves to funaudiochat.yaml. But the PR names the file funaudiochat_s2s.yaml, so auto-detection silently falls back to a single-stage config.

Fix: Rename to funaudiochat.yaml, or add a mapping for the funaudiochat model type.

3. language_model_only: true strips the discrete audio tower needed for S2S

The bundled YAML sets language_model_only: true, which replaces both continuous_audio_tower and audio_tower with StageMissingLayer. But funaudiochat.py:434 calls self.audio_tower() during speech generation - the discrete encoder is essential for CRQ decoding.

The underlying issue is that the continuous audio tower's profiler generates 300s dummy audio (from max_source_positions=1500), which triggers a flash_attn v2 requirement. Workaround: use hf_overrides: {"audio_config": {"max_source_positions": 100}} with limit_mm_per_prompt: {"audio": 1} to keep the towers loaded while keeping dummy audio short enough to avoid flash_attn.

4. Bundled YAML uses HF repo IDs

The stage config references FunAudioLLM/Fun-Audio-Chat-8B and FunAudioLLM/Fun-CosyVoice3-0.5B-2512. These won't resolve for users with local checkpoints. Consider documenting that users need to override model paths.

Signed-off-by: mayufeng <mayufeng@asr-h100>
Signed-off-by: ramos.ma <wyrmyf@gmail.com>
@nemoramo
Copy link
Copy Markdown
Author

Tested E2E with local checkpoints (~/ckpt/Fun-Audio-Chat-8B + ~/ckpt/Fun-CosyVoice3-0.5B-2512). Audio generation works after fixing the issues below. Nice work getting the full pipeline wired up!

Bugs found

1. defer_finalize kwarg doesn't exist in upstream vLLM

Both gpu_ar_model_runner.py:297 and gpu_generation_model_runner.py:283 call:

self.maybe_get_kv_connector_output(scheduler_output, defer_finalize=defer_finalize)

But the installed vLLM's KVConnectorModelRunnerMixin.maybe_get_kv_connector_output() accepts clear_metadata, not defer_finalize. This crashes on every inference call.

Fix: clear_metadata=not defer_finalize

2. Stage config auto-detection fails

resolve_model_config_path() looks for {model_type}.yaml in stage_configs/, which resolves to funaudiochat.yaml. But the PR names the file funaudiochat_s2s.yaml, so auto-detection silently falls back to a single-stage config.

Fix: Rename to funaudiochat.yaml, or add a mapping for the funaudiochat model type.

3. language_model_only: true strips the discrete audio tower needed for S2S

The bundled YAML sets language_model_only: true, which replaces both continuous_audio_tower and audio_tower with StageMissingLayer. But funaudiochat.py:434 calls self.audio_tower() during speech generation - the discrete encoder is essential for CRQ decoding.

The underlying issue is that the continuous audio tower's profiler generates 300s dummy audio (from max_source_positions=1500), which triggers a flash_attn v2 requirement. Workaround: use hf_overrides: {"audio_config": {"max_source_positions": 100}} with limit_mm_per_prompt: {"audio": 1} to keep the towers loaded while keeping dummy audio short enough to avoid flash_attn.

4. Bundled YAML uses HF repo IDs

The stage config references FunAudioLLM/Fun-Audio-Chat-8B and FunAudioLLM/Fun-CosyVoice3-0.5B-2512. These won't resolve for users with local checkpoints. Consider documenting that users need to override model paths.

Thanks for your thorough review.
I believe these issues have been fixed. Can you review this again. Thanks very much! @linyueqian

@codex please review this again follow the rule of vllm-omni-contrib.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 600f45bd90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +68 to +69
local_path = Path(model_path)
if local_path.exists():
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Expand user path before probing local checkpoint

_resolve_model_path tests Path(model_path).exists() without expanding ~, so a valid local override like ~/ckpt/Fun-CosyVoice3-0.5B-2512 is treated as missing and the code falls through to snapshot_download(model_path). In stage YAML/CLI workflows this causes local-checkpoint runs to fail with a repo/path resolution error even though the directory exists; expanding user/env components before the existence check would keep local S2S setups working.

Useful? React with 👍 / 👎.

@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 13, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

resolve conflicts please.

@linyueqian
Copy link
Copy Markdown
Collaborator

Could you also add the corresponding buildkite CI entries (test-merge.yml / test-nightly.yml) so these tests actually run on L4? See #1911 for reference on how the Qwen3-TTS tests are wired up.

mayufeng and others added 6 commits March 20, 2026 22:48
Signed-off-by: mayufeng <mayufeng@example.com>
Signed-off-by: mayufeng <mayufeng@example.com>
Signed-off-by: mayufeng <mayufeng@example.com>
Signed-off-by: mayufeng <mayufeng@example.com>
Signed-off-by: mayufeng <mayufeng@example.com>
Signed-off-by: ramos.ma <wyrmyf@gmail.com>
@nemoramo nemoramo force-pushed the support-funaudiochat-s2s branch from 0013f49 to 1076195 Compare March 20, 2026 23:29
@linyueqian
Copy link
Copy Markdown
Collaborator

@nemoramo please rebase fix conflicts

@linyueqian linyueqian removed this from the v0.18.0 milestone Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants