[RFC][Model] Add Fun-Audio-Chat-8B Support by JaredforReal · Pull Request #452 · vllm-project/vllm-omni

JaredforReal · 2025-12-24T05:45:08Z

This PR adds support for Fun-Audio-Chat-8B, an omni-modal model that supports speech-to-text (S2T) and speech-to-speech (S2S) capabilities.

Model: FunAudioLLM/Fun-Audio-Chat-8B

Current Status

Phase 1 (S2T) - ✅ Completed

Ported FunAudioChatAudioEncoder (Whisper-like, 32-layer transformer)
Ported FunAudioChatDiscreteEncoder (embedding + group averaging)
Implemented main model class with Qwen3 backbone integration
Basic tests passing

Phase 2 (S2S) - ✅ Completed

FunAudioChatDecoder (CRQ Transformer for speech token generation)
CosyVoice3 integration for token-to-waveform synthesis

Phase 3(E2E tests and Docs) - WIP

Questions for Discussion

with Phase 2, I'd like feedback on:

Stage Architecture: Should S2S use a 2-stage pipeline LLM → CRQ+CosyVoice or LLM+CRQ → CosyVoice
or a 3-stage LLM → CRQ → CosyVoice(Current)
CosyVoice3 Model Loading: The model requires Fun-CosyVoice3-0.5B-2512 for vocoder. What's the preferred approach:
- Auto-download from HuggingFace (Current)
- Require users to specify a local path
- Bundle path in config
GPU Memory: CosyVoice3 needs ~1GB VRAM. Should it share GPU with LLM or support separate GPU allocation?

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

hsliuustc0106 · 2025-12-24T06:54:06Z

@linyueqian PTAL

JaredforReal · 2025-12-24T10:10:10Z

in stage_configs/, we got 2 new yaml file fun_audio_chat_s2s.yaml and fun_audio_chat.yaml.
should I name them fun_audio_chat_s2s.yaml and fun_audio_chat_s2t.yaml (which is more trivial)
fun_audio_chat_multiconnector.yaml and fun_audio_chat.yaml (align with current files)
WDTY @hsliuustc0106 @linyueqian

hsliuustc0106 · 2025-12-24T13:38:54Z

in stage_configs/, we got 2 new yaml file fun_audio_chat_s2s.yaml and fun_audio_chat.yaml. should I name them fun_audio_chat_s2s.yaml and fun_audio_chat_s2t.yaml (which is more trivial) fun_audio_chat_multiconnector.yaml and fun_audio_chat.yaml (align with current files) WDTY @hsliuustc0106 @linyueqian

I vote for the trivial one

Signed-off-by: JaredforReal <w13431838023@gmail.com>

- Implement FunAudioChatForConditionalGeneration main model - Integrate both audio encoders - Use vLLM Qwen3ForCausalLM for language model - Dual-resolution audio representation fusion - Complete weight loading logic - Add multimodal processor for audio inputs - Update registry and stage configuration S2T mode only (audio_invert_tower disabled). CosyVoice integration for S2S planned in Phase 2. Signed-off-by: JaredforReal <w13431838023@gmail.com>

Verified import, registry, audio_encoders, hf_processor and hf_config Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Add Speech-to-Speech (S2S) support for Fun-Audio-Chat with 3-stage pipeline: - Stage 0 (Main): Audio understanding + Text generation with latent output - Stage 1 (CRQ Decoder): Hidden states → Speech tokens (25Hz) - Stage 2 (CosyVoice): Speech tokens → Audio waveform (24kHz) Signed-off-by: JaredforReal <w13431838023@gmail.com>

Verified imports, registry, stage config, stage processirs and CRQ decoder architecture Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Register custom model types (like 'funaudiochat') to Transformers' CONFIG_MAPPING so that AutoConfig.from_pretrained() can recognize them even without auto_map in the model's config.json. Signed-off-by: JaredforReal <w13431838023@gmail.com>

- Fix FunAudioChatMultiModalProcessor to inherit from BaseMultiModalProcessor - Fix FunAudioChatProcessingInfo to inherit from BaseProcessingInfo - Fix FunAudioChatDummyInputsBuilder to inherit from BaseDummyInputsBuilder - Add configuration_fun_audio_chat.py with proper FunAudioChatConfig - Add processing_fun_audio_chat.py for HF processor registration - Update arg_utils.py to register both Config and Processor to Transformers - Update run scripts to auto-detect local pretrained models - Fix end2end.py stage config path - Add compute_logits() and sample() methods Signed-off-by: JaredforReal <w13431838023@gmail.com>

Major changes: - Adapt embed_multimodal to vLLM v1 interface (**kwargs signature) - Implement _get_mm_fields_config and _get_prompt_updates for proper batching - Add handle_oov_mm_token parameter to embed_input_ids - Support both packed and legacy audio feature formats - Add chunking strategy to continuous audio encoder - Fix feature_exist_mask 2D->1D squeeze in discrete encoder - Update example scripts with proper Fun-Audio-Chat prompts - Add unit tests for processor and field config Known issue: S2T output still garbled - root cause under investigation Signed-off-by: JaredforReal <w13431838023@gmail.com>

## Bugs Fixed ### Bug 1: Attention Output Reshape in Continuous Encoder **File**: vllm_omni/model_executor/models/fun_audio_chat/audio_encoder.py (Line ~109) **Issue**: Multi-head attention output was missing transpose before reshape **Root Cause**: Attention output shape was (batch, heads, seq, head_dim) but reshape expected (batch, seq, heads, head_dim) **Fix**: Added .transpose(1, 2) before reshape: - Before: attn_output.reshape(seq_length, -1) - After: attn_output.transpose(1, 2).reshape(seq_length, -1).contiguous() **Impact**: Max difference reduced from 3.17 to 0.0625 (bfloat16 precision limit) **Validation**: Unit test test_continuous_encoder_detailed() confirmed fix via layer-by-layer comparison ### Bug 2: Language Model Head Weight Tying **File**: vllm_omni/model_executor/models/fun_audio_chat/fun_audio_chat.py (Line 436) **Issue**: text_config.tie_word_embeddings incorrectly set to True, causing lm_head to share weights with embed_tokens **Root Cause**: vLLM's config initialization mutated text_config during model setup **Impact**: Language model computed wrong logits - predicted token 110153 ('悲剧') instead of 77045 ('Absolutely') **Fix**: Force text_config.tie_word_embeddings = False before vLLM model initialization **Validation**: - HF lm_head weights: mean=-0.000224, std=0.025269 (correct) - Before fix: vLLM lm_head shared embed_tokens weights - After fix: vLLM lm_head has separate weights matching HF - S2T inference now produces correct output: 'Absolutely, music can be a powerful tool for calming your mind...' ## Testing Summary - Created comprehensive unit test suite (tests/fun_audio_chat/test_unit_comparison.py) - 8 test functions with layer-by-layer diagnostics - All tests passing with bfloat16 precision tolerance - Full pipeline output matches HF reference within expected precision limits - S2T offline inference verified working correctly ## Code Quality - Removed all debug logging from production code - Translated all Chinese comments to English - Fixed import ordering per linting standards - Added memory configuration parameters (max-model-len, gpu-memory-utilization) to end2end.py Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

hsliuustc0106 · 2025-12-31T15:42:56Z

Please check whether this model can be merged before the end of this week, we are going to release our next version. @linyueqian @JaredforReal

JaredforReal · 2025-12-31T16:01:02Z

@hsliuustc0106 do my best

Signed-off-by: JaredforReal <w13431838023@gmail.com>

…nfig_path Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

…list) Signed-off-by: JaredforReal <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

tensorflowt · 2026-01-04T07:02:05Z

@hsliuustc0106 do my best

@hsliuustc0106 Excuse me, is this feature currently available?

hsliuustc0106 · 2026-01-04T07:12:08Z

@hsliuustc0106 do my best

@hsliuustc0106 Excuse me, is this feature currently available?

which feature?

tensorflowt · 2026-01-04T08:31:04Z

@hsliuustc0106 do my best

@hsliuustc0106 Excuse me, is this feature currently available?

which feature?

Fun-Audio-Chat-8B s2s infer ,Are there any example inference commands? Can I use this pull request directly after compiling the source code?

JaredforReal · 2026-01-04T08:36:52Z

@tensorflowt s2t is available now, Im working to get s2s right

hsliuustc0106 · 2026-01-04T14:10:23Z

@Bounty-hunter PTAL

JaredforReal · 2026-01-04T15:17:27Z

So sorry, maybe I should clean up the code, and deliver s2t first. I will complete the s2s stage in follow-up PR?

hsliuustc0106 · 2026-01-05T00:55:06Z

So sorry, maybe I should clean up the code, and deliver s2t first. I will complete the s2s stage in follow-up PR?

yes, let's split a huge PR into smaller PRs, but please take care about the tests

siddharth1712 · 2026-01-15T05:31:46Z

Hi,

Just wanted to check if this implementation is complete and can one run FunAudio Chat model with vLLM-Omni for s2s?

lishunyang12 · 2026-02-21T07:55:34Z

@JaredforReal This is a big one — S2T phase looks complete with the audio encoder and discrete encoder ported, and S2S with CosyVoice is in progress. What's blocking the S2S side? Is it the CRQ decoder integration or something else? Would be great to get this across the finish line.

nemoramo · 2026-03-09T15:25:26Z

#1748

Signed-off-by: Jared Wen <w13431838023@gmail.com>

Signed-off-by: JaredforReal <w13431838023@gmail.com>

linyueqian · 2026-03-12T15:19:59Z

Hi @JaredforReal, thanks for the work on this! It looks like #1748 also adds Fun-Audio-Chat S2S support. Could you take a look at that PR and help test it? Would be great to consolidate efforts and avoid duplicated work.

JaredforReal · 2026-03-13T02:30:03Z

@hsliuustc0106 @lishunyang12 so so sorry for the huge delay, I am cleaning up the code and testing S2T mode, and will deliver the half-way-done project to @nemoramo, and help testing the S2S mode

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal mentioned this pull request Dec 24, 2025

[New Model]: Fun-Audio-Chat support #436

Open

1 task

hsliuustc0106 mentioned this pull request Dec 25, 2025

[Roadmap]: preparing for v0.12.0 release #165

Closed

61 tasks

linyueqian reviewed Dec 25, 2025

View reviewed changes

Comment thread vllm_omni/model_executor/models/fun_audio_chat/cosyvoice.py

linyueqian reviewed Dec 25, 2025

View reviewed changes

Comment thread vllm_omni/model_executor/models/fun_audio_chat/audio_encoder.py Outdated

JaredforReal added 16 commits December 31, 2025 23:15

file init

5c19a70

Signed-off-by: JaredforReal <w13431838023@gmail.com>

test(fun-audio-chat): implement S2T stage tests

e30a234

Verified import, registry, audio_encoders, hf_processor and hf_config Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix(fun-audio-chat): add type annotations to fix strict mkdocs build

b14b112

Signed-off-by: JaredforReal <w13431838023@gmail.com>

test(fun-audio-chat): implement S2S stage tests

3772d8a

Verified imports, registry, stage config, stage processirs and CRQ decoder architecture Signed-off-by: JaredforReal <w13431838023@gmail.com>

add offline end2end and online serving example

37e30b2

Signed-off-by: JaredforReal <w13431838023@gmail.com>

simplify s2s config

999c975

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix crq_decoder logic

daf0e74

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix: crq_decoder weight loading returns correct prefix names

bdc94fb

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix: ensure CRQ decoder returns tensors on correct device

cab7336

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix: return correct shape hidden_states in CRQ decoder dummy run

f0d7704

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal force-pushed the feat/fun_audio_chat branch from d87fa28 to d6e78cb Compare December 31, 2025 15:15

hsliuustc0106 mentioned this pull request Dec 31, 2025

[Model] Fun cosy voice3-0.5-b-2512 #498

Merged

5 tasks

fix: convert bfloat16 to float32 before numpy serialization

a933fe3

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal force-pushed the feat/fun_audio_chat branch from d6e78cb to a933fe3 Compare December 31, 2025 16:27

fix: add stage_configs_path to engine_args to bypass resolve_model_co…

00d3c9e

…nfig_path Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal added 3 commits January 1, 2026 00:54

did some utils changes to allow custon model to be recognized

a3c0235

Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix: remove text_response from additional_information (not Tensor or …

992521a

…list) Signed-off-by: JaredforReal <w13431838023@gmail.com>

fix: use CRQ BOS token instead of text tokens for CRQ decoder prompt

5be2904

Signed-off-by: JaredforReal <w13431838023@gmail.com>

david6666666 mentioned this pull request Jan 16, 2026

vLLM-Omni Model Support #808

Open

63 tasks

JaredforReal added 3 commits March 12, 2026 10:53

Merge branch 'main' into feat/fun_audio_chat

127364e

Signed-off-by: Jared Wen <w13431838023@gmail.com>

update configuration

b521db2

Signed-off-by: JaredforReal <w13431838023@gmail.com>

clean up processor

2b79ee4

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal added 9 commits March 13, 2026 11:49

update encoder attention

96b75ae

Signed-off-by: JaredforReal <w13431838023@gmail.com>

Merge branch 'main' into feat/fun_audio_chat

68d096b

align get_text_config

2f7db28

Signed-off-by: JaredforReal <w13431838023@gmail.com>

change super() position

2cfcac9

Signed-off-by: JaredforReal <w13431838023@gmail.com>

processor_dict in _get_arguments_from_pretrained

c5e9553

Signed-off-by: JaredforReal <w13431838023@gmail.com>

reduce gpu memory utilization

d75a878

Signed-off-by: JaredforReal <w13431838023@gmail.com>

update BaseDummyInputsBuilder position

07e4882

Signed-off-by: JaredforReal <w13431838023@gmail.com>

debug preocess input additional information

b366894

Signed-off-by: JaredforReal <w13431838023@gmail.com>

add solution for None, ndarray, tuple for additional informance

165fe40

Signed-off-by: JaredforReal <w13431838023@gmail.com>

JaredforReal closed this Mar 20, 2026

JaredforReal deleted the feat/fun_audio_chat branch March 27, 2026 02:51

Conversation

JaredforReal commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current Status

Questions for Discussion

Related

Uh oh!

hsliuustc0106 commented Dec 24, 2025

Uh oh!

JaredforReal commented Dec 24, 2025

Uh oh!

hsliuustc0106 commented Dec 24, 2025

Uh oh!

Uh oh!

Uh oh!

hsliuustc0106 commented Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JaredforReal commented Dec 31, 2025

Uh oh!

tensorflowt commented Jan 4, 2026

Uh oh!

hsliuustc0106 commented Jan 4, 2026

Uh oh!

tensorflowt commented Jan 4, 2026

Uh oh!

JaredforReal commented Jan 4, 2026

Uh oh!

hsliuustc0106 commented Jan 4, 2026

Uh oh!

JaredforReal commented Jan 4, 2026

Uh oh!

hsliuustc0106 commented Jan 5, 2026

Uh oh!

siddharth1712 commented Jan 15, 2026

Uh oh!

lishunyang12 commented Feb 21, 2026

Uh oh!

nemoramo commented Mar 9, 2026

Uh oh!

linyueqian commented Mar 12, 2026

Uh oh!

JaredforReal commented Mar 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

JaredforReal commented Dec 24, 2025 •

edited

Loading

hsliuustc0106 commented Dec 31, 2025 •

edited

Loading