[RFC][Model] Add Fun-Audio-Chat-8B Support#452
Conversation
|
@linyueqian PTAL |
|
in |
I vote for the trivial one |
- Implement FunAudioChatForConditionalGeneration main model - Integrate both audio encoders - Use vLLM Qwen3ForCausalLM for language model - Dual-resolution audio representation fusion - Complete weight loading logic - Add multimodal processor for audio inputs - Update registry and stage configuration S2T mode only (audio_invert_tower disabled). CosyVoice integration for S2S planned in Phase 2. Signed-off-by: JaredforReal <w13431838023@gmail.com>
Verified import, registry, audio_encoders, hf_processor and hf_config Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Add Speech-to-Speech (S2S) support for Fun-Audio-Chat with 3-stage pipeline: - Stage 0 (Main): Audio understanding + Text generation with latent output - Stage 1 (CRQ Decoder): Hidden states → Speech tokens (25Hz) - Stage 2 (CosyVoice): Speech tokens → Audio waveform (24kHz) Signed-off-by: JaredforReal <w13431838023@gmail.com>
Verified imports, registry, stage config, stage processirs and CRQ decoder architecture Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Register custom model types (like 'funaudiochat') to Transformers' CONFIG_MAPPING so that AutoConfig.from_pretrained() can recognize them even without auto_map in the model's config.json. Signed-off-by: JaredforReal <w13431838023@gmail.com>
- Fix FunAudioChatMultiModalProcessor to inherit from BaseMultiModalProcessor - Fix FunAudioChatProcessingInfo to inherit from BaseProcessingInfo - Fix FunAudioChatDummyInputsBuilder to inherit from BaseDummyInputsBuilder - Add configuration_fun_audio_chat.py with proper FunAudioChatConfig - Add processing_fun_audio_chat.py for HF processor registration - Update arg_utils.py to register both Config and Processor to Transformers - Update run scripts to auto-detect local pretrained models - Fix end2end.py stage config path - Add compute_logits() and sample() methods Signed-off-by: JaredforReal <w13431838023@gmail.com>
Major changes: - Adapt embed_multimodal to vLLM v1 interface (**kwargs signature) - Implement _get_mm_fields_config and _get_prompt_updates for proper batching - Add handle_oov_mm_token parameter to embed_input_ids - Support both packed and legacy audio feature formats - Add chunking strategy to continuous audio encoder - Fix feature_exist_mask 2D->1D squeeze in discrete encoder - Update example scripts with proper Fun-Audio-Chat prompts - Add unit tests for processor and field config Known issue: S2T output still garbled - root cause under investigation Signed-off-by: JaredforReal <w13431838023@gmail.com>
## Bugs Fixed
### Bug 1: Attention Output Reshape in Continuous Encoder
**File**: vllm_omni/model_executor/models/fun_audio_chat/audio_encoder.py (Line ~109)
**Issue**: Multi-head attention output was missing transpose before reshape
**Root Cause**: Attention output shape was (batch, heads, seq, head_dim) but reshape expected (batch, seq, heads, head_dim)
**Fix**: Added .transpose(1, 2) before reshape:
- Before: attn_output.reshape(seq_length, -1)
- After: attn_output.transpose(1, 2).reshape(seq_length, -1).contiguous()
**Impact**: Max difference reduced from 3.17 to 0.0625 (bfloat16 precision limit)
**Validation**: Unit test test_continuous_encoder_detailed() confirmed fix via layer-by-layer comparison
### Bug 2: Language Model Head Weight Tying
**File**: vllm_omni/model_executor/models/fun_audio_chat/fun_audio_chat.py (Line 436)
**Issue**: text_config.tie_word_embeddings incorrectly set to True, causing lm_head to share weights with embed_tokens
**Root Cause**: vLLM's config initialization mutated text_config during model setup
**Impact**: Language model computed wrong logits - predicted token 110153 ('悲剧') instead of 77045 ('Absolutely')
**Fix**: Force text_config.tie_word_embeddings = False before vLLM model initialization
**Validation**:
- HF lm_head weights: mean=-0.000224, std=0.025269 (correct)
- Before fix: vLLM lm_head shared embed_tokens weights
- After fix: vLLM lm_head has separate weights matching HF
- S2T inference now produces correct output: 'Absolutely, music can be a powerful tool for calming your mind...'
## Testing Summary
- Created comprehensive unit test suite (tests/fun_audio_chat/test_unit_comparison.py)
- 8 test functions with layer-by-layer diagnostics
- All tests passing with bfloat16 precision tolerance
- Full pipeline output matches HF reference within expected precision limits
- S2T offline inference verified working correctly
## Code Quality
- Removed all debug logging from production code
- Translated all Chinese comments to English
- Fixed import ordering per linting standards
- Added memory configuration parameters (max-model-len, gpu-memory-utilization) to end2end.py
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
d87fa28 to
d6e78cb
Compare
|
Please check whether this model can be merged before the end of this week, we are going to release our next version. @linyueqian @JaredforReal |
|
@hsliuustc0106 do my best |
Signed-off-by: JaredforReal <w13431838023@gmail.com>
d6e78cb to
a933fe3
Compare
…nfig_path Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
…list) Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
@hsliuustc0106 Excuse me, is this feature currently available? |
which feature? |
Fun-Audio-Chat-8B s2s infer ,Are there any example inference commands? Can I use this pull request directly after compiling the source code? |
|
@tensorflowt s2t is available now, Im working to get s2s right |
|
@Bounty-hunter PTAL |
|
So sorry, maybe I should clean up the code, and deliver s2t first. I will complete the s2s stage in follow-up PR? |
yes, let's split a huge PR into smaller PRs, but please take care about the tests |
|
Hi, Just wanted to check if this implementation is complete and can one run FunAudio Chat model with vLLM-Omni for s2s? |
|
@JaredforReal This is a big one — S2T phase looks complete with the audio encoder and discrete encoder ported, and S2S with CosyVoice is in progress. What's blocking the S2S side? Is it the CRQ decoder integration or something else? Would be great to get this across the finish line. |
Signed-off-by: Jared Wen <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
|
Hi @JaredforReal, thanks for the work on this! It looks like #1748 also adds Fun-Audio-Chat S2S support. Could you take a look at that PR and help test it? Would be great to consolidate efforts and avoid duplicated work. |
|
@hsliuustc0106 @lishunyang12 so so sorry for the huge delay, I am cleaning up the code and testing S2T mode, and will deliver the half-way-done project to @nemoramo, and help testing the S2S mode |
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
Signed-off-by: JaredforReal <w13431838023@gmail.com>
This PR adds support for Fun-Audio-Chat-8B, an omni-modal model that supports speech-to-text (S2T) and speech-to-speech (S2S) capabilities.
Model: FunAudioLLM/Fun-Audio-Chat-8B
Current Status
Phase 1 (S2T) - ✅ Completed
FunAudioChatAudioEncoder(Whisper-like, 32-layer transformer)FunAudioChatDiscreteEncoder(embedding + group averaging)Phase 2 (S2S) - ✅ Completed
FunAudioChatDecoder(CRQ Transformer for speech token generation)Phase 3(E2E tests and Docs) - WIP
Questions for Discussion
with Phase 2, I'd like feedback on:
Stage Architecture: Should S2S use a 2-stage pipeline
LLM → CRQ+CosyVoiceorLLM+CRQ → CosyVoiceor a 3-stage
LLM → CRQ → CosyVoice(Current)CosyVoice3 Model Loading: The model requires
Fun-CosyVoice3-0.5B-2512for vocoder. What's the preferred approach:GPU Memory: CosyVoice3 needs ~1GB VRAM. Should it share GPU with LLM or support separate GPU allocation?
Related
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)