Skip to content

[RFC][Model] Add Fun-Audio-Chat-8B Support#452

Closed
JaredforReal wants to merge 35 commits into
vllm-project:mainfrom
JaredforReal:feat/fun_audio_chat
Closed

[RFC][Model] Add Fun-Audio-Chat-8B Support#452
JaredforReal wants to merge 35 commits into
vllm-project:mainfrom
JaredforReal:feat/fun_audio_chat

Conversation

@JaredforReal
Copy link
Copy Markdown
Contributor

@JaredforReal JaredforReal commented Dec 24, 2025

This PR adds support for Fun-Audio-Chat-8B, an omni-modal model that supports speech-to-text (S2T) and speech-to-speech (S2S) capabilities.

Model: FunAudioLLM/Fun-Audio-Chat-8B

Current Status

Phase 1 (S2T) - ✅ Completed

  • Ported FunAudioChatAudioEncoder (Whisper-like, 32-layer transformer)
  • Ported FunAudioChatDiscreteEncoder (embedding + group averaging)
  • Implemented main model class with Qwen3 backbone integration
  • Basic tests passing

Phase 2 (S2S) - ✅ Completed

  • FunAudioChatDecoder (CRQ Transformer for speech token generation)
  • CosyVoice3 integration for token-to-waveform synthesis

Phase 3(E2E tests and Docs) - WIP

Questions for Discussion

with Phase 2, I'd like feedback on:

  1. Stage Architecture: Should S2S use a 2-stage pipeline LLM → CRQ+CosyVoice or LLM+CRQ → CosyVoice
    or a 3-stage LLM → CRQ → CosyVoice(Current)

  2. CosyVoice3 Model Loading: The model requires Fun-CosyVoice3-0.5B-2512 for vocoder. What's the preferred approach:

    • Auto-download from HuggingFace (Current)
    • Require users to specify a local path
    • Bundle path in config
  3. GPU Memory: CosyVoice3 needs ~1GB VRAM. Should it share GPU with LLM or support separate GPU allocation?

Related


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@linyueqian PTAL

@JaredforReal
Copy link
Copy Markdown
Contributor Author

in stage_configs/, we got 2 new yaml file fun_audio_chat_s2s.yaml and fun_audio_chat.yaml.
should I name them fun_audio_chat_s2s.yaml and fun_audio_chat_s2t.yaml (which is more trivial)
fun_audio_chat_multiconnector.yaml and fun_audio_chat.yaml (align with current files)
WDTY @hsliuustc0106 @linyueqian

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

in stage_configs/, we got 2 new yaml file fun_audio_chat_s2s.yaml and fun_audio_chat.yaml. should I name them fun_audio_chat_s2s.yaml and fun_audio_chat_s2t.yaml (which is more trivial) fun_audio_chat_multiconnector.yaml and fun_audio_chat.yaml (align with current files) WDTY @hsliuustc0106 @linyueqian

I vote for the trivial one

Comment thread vllm_omni/model_executor/models/fun_audio_chat/cosyvoice.py
Comment thread vllm_omni/model_executor/models/fun_audio_chat/audio_encoder.py Outdated
Signed-off-by: JaredforReal <w13431838023@gmail.com>
- Implement FunAudioChatForConditionalGeneration main model
  - Integrate both audio encoders
  - Use vLLM Qwen3ForCausalLM for language model
  - Dual-resolution audio representation fusion
  - Complete weight loading logic

- Add multimodal processor for audio inputs
- Update registry and stage configuration

S2T mode only (audio_invert_tower disabled).
CosyVoice integration for S2S planned in Phase 2.

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Verified import, registry, audio_encoders, hf_processor and hf_config

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Add Speech-to-Speech (S2S) support for Fun-Audio-Chat with 3-stage pipeline:
- Stage 0 (Main): Audio understanding + Text generation with latent output
- Stage 1 (CRQ Decoder): Hidden states → Speech tokens (25Hz)
- Stage 2 (CosyVoice): Speech tokens → Audio waveform (24kHz)

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Verified imports, registry, stage config, stage processirs and CRQ decoder architecture

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Register custom model types (like 'funaudiochat') to Transformers' CONFIG_MAPPING
so that AutoConfig.from_pretrained() can recognize them even without auto_map
in the model's config.json.

Signed-off-by: JaredforReal <w13431838023@gmail.com>
- Fix FunAudioChatMultiModalProcessor to inherit from BaseMultiModalProcessor
- Fix FunAudioChatProcessingInfo to inherit from BaseProcessingInfo
- Fix FunAudioChatDummyInputsBuilder to inherit from BaseDummyInputsBuilder
- Add configuration_fun_audio_chat.py with proper FunAudioChatConfig
- Add processing_fun_audio_chat.py for HF processor registration
- Update arg_utils.py to register both Config and Processor to Transformers
- Update run scripts to auto-detect local pretrained models
- Fix end2end.py stage config path
- Add compute_logits() and sample() methods

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Major changes:
- Adapt embed_multimodal to vLLM v1 interface (**kwargs signature)
- Implement _get_mm_fields_config and _get_prompt_updates for proper batching
- Add handle_oov_mm_token parameter to embed_input_ids
- Support both packed and legacy audio feature formats
- Add chunking strategy to continuous audio encoder
- Fix feature_exist_mask 2D->1D squeeze in discrete encoder
- Update example scripts with proper Fun-Audio-Chat prompts
- Add unit tests for processor and field config

Known issue: S2T output still garbled - root cause under investigation

Signed-off-by: JaredforReal <w13431838023@gmail.com>
## Bugs Fixed

### Bug 1: Attention Output Reshape in Continuous Encoder
**File**: vllm_omni/model_executor/models/fun_audio_chat/audio_encoder.py (Line ~109)
**Issue**: Multi-head attention output was missing transpose before reshape
**Root Cause**: Attention output shape was (batch, heads, seq, head_dim) but reshape expected (batch, seq, heads, head_dim)
**Fix**: Added .transpose(1, 2) before reshape:
  - Before: attn_output.reshape(seq_length, -1)
  - After: attn_output.transpose(1, 2).reshape(seq_length, -1).contiguous()
**Impact**: Max difference reduced from 3.17 to 0.0625 (bfloat16 precision limit)
**Validation**: Unit test test_continuous_encoder_detailed() confirmed fix via layer-by-layer comparison

### Bug 2: Language Model Head Weight Tying
**File**: vllm_omni/model_executor/models/fun_audio_chat/fun_audio_chat.py (Line 436)
**Issue**: text_config.tie_word_embeddings incorrectly set to True, causing lm_head to share weights with embed_tokens
**Root Cause**: vLLM's config initialization mutated text_config during model setup
**Impact**: Language model computed wrong logits - predicted token 110153 ('悲剧') instead of 77045 ('Absolutely')
**Fix**: Force text_config.tie_word_embeddings = False before vLLM model initialization
**Validation**:
  - HF lm_head weights: mean=-0.000224, std=0.025269 (correct)
  - Before fix: vLLM lm_head shared embed_tokens weights
  - After fix: vLLM lm_head has separate weights matching HF
  - S2T inference now produces correct output: 'Absolutely, music can be a powerful tool for calming your mind...'

## Testing Summary
- Created comprehensive unit test suite (tests/fun_audio_chat/test_unit_comparison.py)
- 8 test functions with layer-by-layer diagnostics
- All tests passing with bfloat16 precision tolerance
- Full pipeline output matches HF reference within expected precision limits
- S2T offline inference verified working correctly

## Code Quality
- Removed all debug logging from production code
- Translated all Chinese comments to English
- Fixed import ordering per linting standards
- Added memory configuration parameters (max-model-len, gpu-memory-utilization) to end2end.py

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
@hsliuustc0106
Copy link
Copy Markdown
Collaborator

hsliuustc0106 commented Dec 31, 2025

Please check whether this model can be merged before the end of this week, we are going to release our next version. @linyueqian @JaredforReal

@JaredforReal
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 do my best

Signed-off-by: JaredforReal <w13431838023@gmail.com>
…nfig_path

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
…list)

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
@tensorflowt
Copy link
Copy Markdown

@hsliuustc0106 do my best

@hsliuustc0106 Excuse me, is this feature currently available?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 do my best

@hsliuustc0106 Excuse me, is this feature currently available?

which feature?

@tensorflowt
Copy link
Copy Markdown

@hsliuustc0106 do my best

@hsliuustc0106 Excuse me, is this feature currently available?

which feature?

Fun-Audio-Chat-8B s2s infer ,Are there any example inference commands? Can I use this pull request directly after compiling the source code?

@JaredforReal
Copy link
Copy Markdown
Contributor Author

@tensorflowt s2t is available now, Im working to get s2s right

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

@Bounty-hunter PTAL

@JaredforReal
Copy link
Copy Markdown
Contributor Author

So sorry, maybe I should clean up the code, and deliver s2t first. I will complete the s2s stage in follow-up PR?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

So sorry, maybe I should clean up the code, and deliver s2t first. I will complete the s2s stage in follow-up PR?

yes, let's split a huge PR into smaller PRs, but please take care about the tests

@siddharth1712
Copy link
Copy Markdown

Hi,

Just wanted to check if this implementation is complete and can one run FunAudio Chat model with vLLM-Omni for s2s?

@david6666666 david6666666 mentioned this pull request Jan 16, 2026
63 tasks
@lishunyang12
Copy link
Copy Markdown
Collaborator

@JaredforReal This is a big one — S2T phase looks complete with the audio encoder and discrete encoder ported, and S2S with CosyVoice is in progress. What's blocking the S2S side? Is it the CRQ decoder integration or something else? Would be great to get this across the finish line.

@nemoramo
Copy link
Copy Markdown

nemoramo commented Mar 9, 2026

#1748

Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
@linyueqian
Copy link
Copy Markdown
Collaborator

Hi @JaredforReal, thanks for the work on this! It looks like #1748 also adds Fun-Audio-Chat S2S support. Could you take a look at that PR and help test it? Would be great to consolidate efforts and avoid duplicated work.

@JaredforReal
Copy link
Copy Markdown
Contributor Author

@hsliuustc0106 @lishunyang12 so so sorry for the huge delay, I am cleaning up the code and testing S2T mode, and will deliver the half-way-done project to @nemoramo, and help testing the S2S mode

Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
@JaredforReal JaredforReal deleted the feat/fun_audio_chat branch March 27, 2026 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[New Model]: Fun-Audio-Chat support

7 participants