Support funaudiochat s2s by nemoramo · Pull Request #1748 · vllm-project/vllm-omni

nemoramo · 2026-03-09T08:28:18Z

PR Description

Summary

This PR adds Speech-to-Speech (S2S) support for FunAudioLLM/Fun-Audio-Chat-8B in vLLM-Omni.

Following the vLLM-Omni contributing guide, this PR keeps the scope limited to the upstream-ready integration itself:

model implementation
model/config registration
stage config and stage input processing
runtime plumbing required by the S2S pipeline
focused tests
supported-model documentation

Changes

Model integration

add the FunAudioChat stage-0 implementation
add the CosyVoice3 code2wav stage-1 path used by the FunAudioChat S2S pipeline
add the default FunAudioChat S2S stage config
add the FunAudioChat stage input processor

Runtime support

register the FunAudioChat model and config so the pipeline can be resolved and loaded correctly
update the relevant entrypoint / scheduler / output-processing code paths for FunAudioChat S2S
fix OmniRequestOutput handling when pipeline request_output is a list

Tests and docs

add focused unit tests for:
- FunAudioChat native helper logic
- FunAudioChat stage input processing
- entrypoint/config resolution
- OmniRequestOutput regression coverage
document FunAudioChat in docs/models/supported_models.md

Notes

contributor-local runtime path fallbacks were removed; the integration now relies on the installed package or
FUN_AUDIO_CHAT_HOME
benchmark / diagnostic scripts and local intermediate artifacts are intentionally excluded from this PR to keep the
change minimal and upstream-ready

Testing

Ran the following checks locally:

python -m py_compile \
  vllm_omni/model_executor/models/funaudiochat/common.py \
  vllm_omni/model_executor/models/funaudiochat/funaudiochat.py \
  vllm_omni/model_executor/models/funaudiochat/funaudiochat_code2wav.py \
  vllm_omni/model_executor/stage_input_processors/funaudiochat.py \
  vllm_omni/outputs.py \
  vllm_omni/engine/arg_utils.py \
  vllm_omni/entrypoints/omni.py \
  vllm_omni/entrypoints/omni_stage.py \
  vllm_omni/core/sched/omni_generation_scheduler.py \
  vllm_omni/worker/gpu_ar_model_runner.py

python -m pytest \
  tests/test_outputs.py \
  tests/entrypoints/test_funaudiochat_contrib.py \
  tests/model_executor/models/test_funaudiochat_native.py \
  tests/model_executor/stage_input_processors/test_funaudiochat.py -q

Result:

- 26 passed

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5d2f3e1f43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T08:36:04Z

+    if hf_config_path is not None or model_arch not in _COSYVOICE3_MODEL_ARCHES:
+        return hf_config_path
+
+    return str(Path(__file__).resolve().parent.parent / "model_executor" / "models" / "cosyvoice3" / "hf_config")


Avoid defaulting hf_config_path to a nonexistent bundle

This fallback always returns .../model_executor/models/cosyvoice3/hf_config when hf_config_path is unset, but that directory is not present in this repo/package, so default stage configs (for example funaudiochat_s2s.yaml, which does not set hf_config_path) will pass an invalid local path into model config creation and fail before loading the model unless users manually override hf_config_path.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-09T08:36:04Z

+        valid_rows = audio_token_ids.any(dim=-1)
+        audio_token_ids = audio_token_ids[valid_rows]


Keep all-zero codec rows when flattening stage-0 tokens

Using audio_token_ids.any(dim=-1) drops any 2D row made entirely of zeros, but token 0 is treated as valid elsewhere in this same function (filtered >= 0) and in tests, so a legitimate all-zero codec group from stage-0 would be silently removed before code2wav, shortening or corrupting synthesized audio for that request.

Useful? React with 👍 / 👎.

linyueqian · 2026-03-09T18:32:17Z

resolve conflicts please

linyueqian · 2026-03-09T18:42:57Z

Tried to run this locally but wasn't able to start the server. There are several API incompatibilities with the current main branch:

OmniModelConfig.__post_init__ in the PR adds language_model_only as a parameter, but the installed vLLM/vllm-omni doesn't have this as an InitVar field, so pydantic rejects it
OmniInputProcessor.__init__ passes renderer to the parent InputProcessor, but the current InputProcessor.__init__ doesn't accept that kwarg
The process_inputs method signature also diverges from main (e.g., ProcessorInputs vs DictPrompt | TokPrompt, different parameter ordering)

It looks like this PR was developed against a different version of vllm-omni. Could you rebase onto the latest main?

nemoramo · 2026-03-10T02:23:40Z

Tried to run this locally but wasn't able to start the server. There are several API incompatibilities with the current main branch:

OmniModelConfig.__post_init__ in the PR adds language_model_only as a parameter, but the installed vLLM/vllm-omni doesn't have this as an InitVar field, so pydantic rejects it

OmniInputProcessor.__init__ passes renderer to the parent InputProcessor, but the current InputProcessor.__init__ doesn't accept that kwarg

The process_inputs method signature also diverges from main (e.g., ProcessorInputs vs DictPrompt | TokPrompt, different parameter ordering)

It looks like this PR was developed against a different version of vllm-omni. Could you rebase onto the latest main?
Thanks for your review.

OmniModelConfig.__post_init__ / language_model_only this has been solved.
OmniInputProcessor.__init__ passing renderer
This has also been aligned with the current main. The rebased branch no longer depends on the older parent
InputProcessor.__init__ signature.
process_inputs signature divergence also rebase solved this
I also re-tested the updated branch against latest main and verified that the server starts successfully and /v1/chat/ completions works.

nemoramo · 2026-03-10T02:54:02Z

sorry wait, seems still have a mismatch problem

linyueqian · 2026-03-11T02:51:51Z

@ramos please let me know if it is ready to be reviewed again. thanks!

nemoramo · 2026-03-11T03:16:58Z

@ramos please let me know if it is ready to be reviewed again. thanks!

ok! When it's done, I will notify you asap.

nemoramo · 2026-03-12T10:04:42Z

@ramos please let me know if it is ready to be reviewed again. thanks!

Thanks for waiting. I have checked this branch and I believe it's ready right now @linyueqian

nemoramo · 2026-03-12T10:05:07Z

@claude can you also help review this again

nemoramo · 2026-03-12T10:08:39Z

@codex please also review this?

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 813dc0e16f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-12T10:21:45Z

+            flat = prompt_token_ids[0] if isinstance(prompt_token_ids[0], list) else prompt_token_ids
+            if len(flat) == 0:


Split prompt_token_ids by request before decoding

This helper collapses batched prompt_token_ids to a single 1-D sequence (prompt_token_ids[0] for list inputs), so when stage-1 runs with more than one request in a batch, requests after index 0 will decode from the wrong codec stream. This corrupts per-request audio output as soon as max_batch_size is increased beyond 1 for throughput.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-12T10:21:46Z

+        del positions, intermediate_tensors, inputs_embeds
+
+        sampling_metadata = kwargs.get("sampling_metadata")
+        token, is_dummy_profile = self._build_decode_tokens(input_ids, sampling_metadata)


Decode batched code2wav inputs per request

The forward path builds a single token tensor and emits one concatenated waveform, without using per-request boundaries, so batched requests are merged into one utterance instead of producing isolated outputs. If operators raise stage-1 batching (runtime.max_batch_size > 1), this mixes users’ codec tokens and returns incorrect audio for all requests in that batch.

Useful? React with 👍 / 👎.

Signed-off-by: ramos.ma <wyrmyf@gmail.com> Signed-off-by: mayufeng <mayufeng@asr-h100>

Signed-off-by: ramos.ma <wyrmyf@gmail.com>

Signed-off-by: mayufeng <mayufeng@asr-h100>

Signed-off-by: ramos.ma <wyrmyf@gmail.com> Signed-off-by: mayufeng <mayufeng@asr-h100>

linyueqian · 2026-03-13T03:17:19Z

Tested E2E with local checkpoints (~/ckpt/Fun-Audio-Chat-8B + ~/ckpt/Fun-CosyVoice3-0.5B-2512). Audio generation works after fixing the issues below. Nice work getting the full pipeline wired up!

Bugs found

1. `defer_finalize` kwarg doesn't exist in upstream vLLM

Both gpu_ar_model_runner.py:297 and gpu_generation_model_runner.py:283 call:

self.maybe_get_kv_connector_output(scheduler_output, defer_finalize=defer_finalize)

But the installed vLLM's KVConnectorModelRunnerMixin.maybe_get_kv_connector_output() accepts clear_metadata, not defer_finalize. This crashes on every inference call.

Fix: clear_metadata=not defer_finalize

2. Stage config auto-detection fails

resolve_model_config_path() looks for {model_type}.yaml in stage_configs/, which resolves to funaudiochat.yaml. But the PR names the file funaudiochat_s2s.yaml, so auto-detection silently falls back to a single-stage config.

Fix: Rename to funaudiochat.yaml, or add a mapping for the funaudiochat model type.

3. `language_model_only: true` strips the discrete audio tower needed for S2S

The bundled YAML sets language_model_only: true, which replaces both continuous_audio_tower and audio_tower with StageMissingLayer. But funaudiochat.py:434 calls self.audio_tower() during speech generation - the discrete encoder is essential for CRQ decoding.

The underlying issue is that the continuous audio tower's profiler generates 300s dummy audio (from max_source_positions=1500), which triggers a flash_attn v2 requirement. Workaround: use hf_overrides: {"audio_config": {"max_source_positions": 100}} with limit_mm_per_prompt: {"audio": 1} to keep the towers loaded while keeping dummy audio short enough to avoid flash_attn.

4. Bundled YAML uses HF repo IDs

The stage config references FunAudioLLM/Fun-Audio-Chat-8B and FunAudioLLM/Fun-CosyVoice3-0.5B-2512. These won't resolve for users with local checkpoints. Consider documenting that users need to override model paths.

Signed-off-by: mayufeng <mayufeng@asr-h100>

Signed-off-by: ramos.ma <wyrmyf@gmail.com>

nemoramo · 2026-03-13T09:21:17Z

Tested E2E with local checkpoints (~/ckpt/Fun-Audio-Chat-8B + ~/ckpt/Fun-CosyVoice3-0.5B-2512). Audio generation works after fixing the issues below. Nice work getting the full pipeline wired up!

Bugs found

1. defer_finalize kwarg doesn't exist in upstream vLLM

Both gpu_ar_model_runner.py:297 and gpu_generation_model_runner.py:283 call:
self.maybe_get_kv_connector_output(scheduler_output, defer_finalize=defer_finalize)
But the installed vLLM's KVConnectorModelRunnerMixin.maybe_get_kv_connector_output() accepts clear_metadata, not defer_finalize. This crashes on every inference call.

Fix: clear_metadata=not defer_finalize

2. Stage config auto-detection fails

resolve_model_config_path() looks for {model_type}.yaml in stage_configs/, which resolves to funaudiochat.yaml. But the PR names the file funaudiochat_s2s.yaml, so auto-detection silently falls back to a single-stage config.

Fix: Rename to funaudiochat.yaml, or add a mapping for the funaudiochat model type.

3. language_model_only: true strips the discrete audio tower needed for S2S

The bundled YAML sets language_model_only: true, which replaces both continuous_audio_tower and audio_tower with StageMissingLayer. But funaudiochat.py:434 calls self.audio_tower() during speech generation - the discrete encoder is essential for CRQ decoding.

The underlying issue is that the continuous audio tower's profiler generates 300s dummy audio (from max_source_positions=1500), which triggers a flash_attn v2 requirement. Workaround: use hf_overrides: {"audio_config": {"max_source_positions": 100}} with limit_mm_per_prompt: {"audio": 1} to keep the towers loaded while keeping dummy audio short enough to avoid flash_attn.

4. Bundled YAML uses HF repo IDs

The stage config references FunAudioLLM/Fun-Audio-Chat-8B and FunAudioLLM/Fun-CosyVoice3-0.5B-2512. These won't resolve for users with local checkpoints. Consider documenting that users need to override model paths.

Thanks for your thorough review.
I believe these issues have been fixed. Can you review this again. Thanks very much! @linyueqian

@codex please review this again follow the rule of vllm-omni-contrib.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 600f45bd90

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-13T09:29:36Z

+        local_path = Path(model_path)
+        if local_path.exists():


Expand user path before probing local checkpoint

_resolve_model_path tests Path(model_path).exists() without expanding ~, so a valid local override like ~/ckpt/Fun-CosyVoice3-0.5B-2512 is treated as missing and the code falls through to snapshot_download(model_path). In stage YAML/CLI workflows this causes local-checkpoint runs to fail with a repo/path resolution error even though the directory exists; expanding user/env components before the existence check would keep local S2S setups working.

Useful? React with 👍 / 👎.

linyueqian · 2026-03-18T18:57:05Z

resolve conflicts please.

linyueqian · 2026-03-18T18:58:17Z

Could you also add the corresponding buildkite CI entries (test-merge.yml / test-nightly.yml) so these tests actually run on L4? See #1911 for reference on how the Qwen3-TTS tests are wired up.

Signed-off-by: mayufeng <mayufeng@example.com>

Signed-off-by: ramos.ma <wyrmyf@gmail.com>

linyueqian · 2026-03-24T04:48:11Z

@nemoramo please rebase fix conflicts

nemoramo requested a review from hsliuustc0106 as a code owner March 9, 2026 08:28

nemoramo mentioned this pull request Mar 9, 2026

how to achieve acceleration for the S2S Speech-to-Speech mode？ FunAudioLLM/Fun-Audio-Chat#45

Open

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

nemoramo mentioned this pull request Mar 9, 2026

support funaudiochat ST2T mode inference-only in vLLM, speed up around 20x+ in short audios (1-30s) ,and around 50x+ in long audios FunAudioLLM/Fun-Audio-Chat#39

Open

nemoramo force-pushed the support-funaudiochat-s2s branch 2 times, most recently from 5910a38 to 3807c1b Compare March 9, 2026 09:33

nemoramo mentioned this pull request Mar 9, 2026

[RFC][Model] Add Fun-Audio-Chat-8B Support #452

Closed

5 tasks

linyueqian self-requested a review March 9, 2026 18:43

nemoramo force-pushed the support-funaudiochat-s2s branch 2 times, most recently from 1768dd8 to b4d8620 Compare March 10, 2026 02:13

nemoramo force-pushed the support-funaudiochat-s2s branch 2 times, most recently from 4921519 to b6ae819 Compare March 11, 2026 02:31

nemoramo force-pushed the support-funaudiochat-s2s branch 9 times, most recently from 0a7d6df to fe924de Compare March 12, 2026 04:49

chatgpt-codex-connector Bot reviewed Mar 12, 2026

View reviewed changes

nemoramo added 7 commits March 13, 2026 08:08

Support FunAudioChat S2S in vLLM-Omni

9a71146

Signed-off-by: ramos.ma <wyrmyf@gmail.com> Signed-off-by: mayufeng <mayufeng@asr-h100>

Fix FunAudioChat remote validation regressions

d51fb28

Signed-off-by: ramos.ma <wyrmyf@gmail.com>

Add FunAudioChat sampler regression test

573e4ce

Signed-off-by: ramos.ma <wyrmyf@gmail.com>

Reduce FunAudioChat runner scope

925197e

Signed-off-by: mayufeng <mayufeng@asr-h100>

Trim nonessential runner test scaffolding

2105386

Signed-off-by: mayufeng <mayufeng@asr-h100>

Reduce shared runner scope for FunAudioChat

0a7485c

Signed-off-by: mayufeng <mayufeng@asr-h100>

Trim shared runner scope for FunAudioChat

567e6d5

Signed-off-by: ramos.ma <wyrmyf@gmail.com> Signed-off-by: mayufeng <mayufeng@asr-h100>

nemoramo force-pushed the support-funaudiochat-s2s branch from fdb750d to 567e6d5 Compare March 13, 2026 00:10

nemoramo added 2 commits March 13, 2026 14:04

Address latest FunAudioChat review fixes

f0fb323

Signed-off-by: mayufeng <mayufeng@asr-h100>

Move FunAudioChat local checkpoint note to stage config

600f45b

Signed-off-by: ramos.ma <wyrmyf@gmail.com>

chatgpt-codex-connector Bot reviewed Mar 13, 2026

View reviewed changes

Gaohan123 added this to the v0.18.0 milestone Mar 13, 2026

linyueqian mentioned this pull request Mar 18, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

mayufeng and others added 6 commits March 20, 2026 22:48

Merge upstream/main into support-funaudiochat-s2s for FunAudioChat

50ccc73

Signed-off-by: mayufeng <mayufeng@example.com>

Restore serving compatibility after upstream merge

63c592d

Signed-off-by: mayufeng <mayufeng@example.com>

Fix KV connector finalize API after upstream merge

edd50c0

Signed-off-by: mayufeng <mayufeng@example.com>

Preserve stage-specific model paths during stage init

7757604

Signed-off-by: mayufeng <mayufeng@example.com>

Fix generation KV connector API after upstream merge

f821a27

Signed-off-by: mayufeng <mayufeng@example.com>

Fix FunAudioChat conflict resolution regressions

1076195

Signed-off-by: ramos.ma <wyrmyf@gmail.com>

nemoramo force-pushed the support-funaudiochat-s2s branch from 0013f49 to 1076195 Compare March 20, 2026 23:29

linyueqian removed this from the v0.18.0 milestone Mar 24, 2026

		valid_rows = audio_token_ids.any(dim=-1)
		audio_token_ids = audio_token_ids[valid_rows]

		flat = prompt_token_ids[0] if isinstance(prompt_token_ids[0], list) else prompt_token_ids
		if len(flat) == 0:

Conversation

nemoramo commented Mar 9, 2026

Summary

Changes

Model integration

Runtime support

Tests and docs

Notes

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 9, 2026

Uh oh!

linyueqian commented Mar 9, 2026

Uh oh!

nemoramo commented Mar 10, 2026

Uh oh!

nemoramo commented Mar 10, 2026

Uh oh!

linyueqian commented Mar 11, 2026

Uh oh!

nemoramo commented Mar 11, 2026

Uh oh!

nemoramo commented Mar 12, 2026

Uh oh!

nemoramo commented Mar 12, 2026

Uh oh!

nemoramo commented Mar 12, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 13, 2026

Bugs found

1. defer_finalize kwarg doesn't exist in upstream vLLM

2. Stage config auto-detection fails

3. language_model_only: true strips the discrete audio tower needed for S2S

4. Bundled YAML uses HF repo IDs

Uh oh!

nemoramo commented Mar 13, 2026

Bugs found

1. defer_finalize kwarg doesn't exist in upstream vLLM

2. Stage config auto-detection fails

3. language_model_only: true strips the discrete audio tower needed for S2S

4. Bundled YAML uses HF repo IDs

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

linyueqian commented Mar 18, 2026

Uh oh!

linyueqian commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

1. `defer_finalize` kwarg doesn't exist in upstream vLLM

3. `language_model_only: true` strips the discrete audio tower needed for S2S

1. `defer_finalize` kwarg doesn't exist in upstream vLLM

3. `language_model_only: true` strips the discrete audio tower needed for S2S