Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
5c19a70
file init
JaredforReal Dec 23, 2025
46d18b2
feat(fun-audio-chat): S2T support
JaredforReal Dec 24, 2025
e30a234
test(fun-audio-chat): implement S2T stage tests
JaredforReal Dec 24, 2025
b14b112
fix(fun-audio-chat): add type annotations to fix strict mkdocs build
JaredforReal Dec 24, 2025
13ce858
feat(fun-audio-chat): implement 3-stage S2S pipeline
JaredforReal Dec 24, 2025
3772d8a
test(fun-audio-chat): implement S2S stage tests
JaredforReal Dec 24, 2025
37e30b2
add offline end2end and online serving example
JaredforReal Dec 24, 2025
e0b1d59
fix(fun-audio-chat): register funaudiochat model type to Transformers
JaredforReal Dec 24, 2025
1538ac5
fix: Fun-Audio-Chat multimodal processor signature for vLLM registry
JaredforReal Dec 24, 2025
13dcda6
WIP: Fun-Audio-Chat vLLM v1 multimodal interface adaptation
JaredforReal Dec 25, 2025
72294b2
Fix Fun-Audio-Chat S2T output - Bug Report
JaredforReal Dec 25, 2025
999c975
simplify s2s config
JaredforReal Dec 31, 2025
daf0e74
fix crq_decoder logic
JaredforReal Dec 31, 2025
bdc94fb
fix: crq_decoder weight loading returns correct prefix names
JaredforReal Dec 31, 2025
cab7336
fix: ensure CRQ decoder returns tensors on correct device
JaredforReal Dec 31, 2025
f0d7704
fix: return correct shape hidden_states in CRQ decoder dummy run
JaredforReal Dec 31, 2025
a933fe3
fix: convert bfloat16 to float32 before numpy serialization
JaredforReal Dec 31, 2025
00d3c9e
fix: add stage_configs_path to engine_args to bypass resolve_model_co…
JaredforReal Dec 31, 2025
0e2d72d
remove unit tests
JaredforReal Dec 31, 2025
055b00e
remove unit tests
JaredforReal Dec 31, 2025
a3c0235
did some utils changes to allow custon model to be recognized
JaredforReal Dec 31, 2025
992521a
fix: remove text_response from additional_information (not Tensor or …
JaredforReal Dec 31, 2025
5be2904
fix: use CRQ BOS token instead of text tokens for CRQ decoder prompt
JaredforReal Dec 31, 2025
127364e
Merge branch 'main' into feat/fun_audio_chat
JaredforReal Mar 12, 2026
b521db2
update configuration
JaredforReal Mar 12, 2026
2b79ee4
clean up processor
JaredforReal Mar 12, 2026
96b75ae
update encoder attention
JaredforReal Mar 13, 2026
68d096b
Merge branch 'main' into feat/fun_audio_chat
JaredforReal Mar 13, 2026
2f7db28
align get_text_config
JaredforReal Mar 19, 2026
2cfcac9
change super() position
JaredforReal Mar 19, 2026
c5e9553
processor_dict in _get_arguments_from_pretrained
JaredforReal Mar 19, 2026
d75a878
reduce gpu memory utilization
JaredforReal Mar 19, 2026
07e4882
update BaseDummyInputsBuilder position
JaredforReal Mar 19, 2026
b366894
debug preocess input additional information
JaredforReal Mar 19, 2026
165fe40
add solution for None, ndarray, tuple for additional informance
JaredforReal Mar 19, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
281 changes: 281 additions & 0 deletions examples/offline_inference/fun_audio_chat/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,281 @@
# Fun-Audio-Chat-8B

## Overview

Fun-Audio-Chat-8B is a multimodal speech dialogue model developed by Alibaba's FunAudioLLM team, supporting:

- **Speech-to-Text (S2T)**: Speech understanding and transcription
- **Speech-to-Speech (S2S)**: End-to-end voice conversation (audio in, audio out)

## Model Architecture

Fun-Audio-Chat uses a 3-stage pipeline for S2S functionality:

```
┌─────────────────────────────────────────────────────────────────────────────┐
│ Fun-Audio-Chat S2S Pipeline │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Stage 0: Main Model (Fun-Audio-Chat-8B) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Input: Audio (16kHz WAV) + Text Prompt │ │
│ │ Process: Audio Encoder → Qwen3-8B LLM → Text Generation │ │
│ │ Output: Text Response + Hidden States (for S2S) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Stage 1: CRQ Decoder (Residual Quantization Decoder) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Input: Hidden States from Stage 0 │ │
│ │ Process: CRQ Transformer (28 layers, 1024 hidden) → Speech Tokens │ │
│ │ Output: Speech Tokens @ 25Hz (group_size=5, codebook=6565) │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Stage 2: CosyVoice (Token-to-Waveform Synthesis) │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Input: Speech Tokens from Stage 1 │ │
│ │ Process: CosyVoice3-0.5B Flow Matching → Audio Waveform │ │
│ │ Output: Audio Waveform @ 24kHz │ │
│ └─────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
```

## Hardware Requirements

| Mode | GPU Configuration | VRAM Required |
| ------------------------ | ----------------- | ------------------------ |
| S2T (understanding only) | 1x GPU | ~20GB |
| S2S (full pipeline) | 1x GPU | ~35GB |
| S2S (recommended) | 2x GPU | GPU0: ~20GB, GPU1: ~15GB |

**Tested on:**

- 2x NVIDIA H100 80GB
- 1x NVIDIA A100 80GB (single-card S2S)

## Installation

### 1. Install vLLM-Omni

```bash
cd /path/to/vllm-omni
pip install -e .
```

### 2. Install CosyVoice (S2S mode only)

```bash
# Clone CosyVoice repository
git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice

# Install dependencies
pip install -e .

# Install Matcha-TTS
cd third_party/Matcha-TTS
pip install -e .
```

Or set environment variable pointing to existing CosyVoice installation:

```bash
export COSYVOICE_PATH=/path/to/CosyVoice
```

### 3. Download Models

Models will be downloaded automatically on first run. You can also pre-download:

```bash
# Use China mirror for faster downloads (optional)
export HF_ENDPOINT=https://hf-mirror.com

# Download Fun-Audio-Chat-8B
huggingface-cli download FunAudioLLM/Fun-Audio-Chat-8B

# Download CosyVoice3 (S2S mode only)
huggingface-cli download FunAudioLLM/Fun-CosyVoice3-0.5B-2512
```

## Quick Start

### Speech-to-Text (S2T) Mode

Speech understanding only, outputs text:

```bash
cd examples/offline_inference/fun_audio_chat

# Use default audio
bash run_s2t.sh

# Use custom audio
python end2end.py --mode s2t --audio-path /path/to/audio.wav
```

### Speech-to-Speech (S2S) Mode

Full voice conversation, audio in → audio out:

```bash
cd examples/offline_inference/fun_audio_chat

# Single GPU mode (requires ~35GB VRAM)
bash run_s2s.sh

# Dual GPU mode (recommended)
bash run_s2s_2gpu.sh

# Use custom audio
python end2end.py --mode s2s --audio-path /path/to/audio.wav --output-dir output
```

## Command Line Arguments

```bash
python end2end.py [OPTIONS]

Options:
--mode Run mode: s2t (speech-to-text) or s2s (speech-to-speech)
Default: s2t

--audio-path, -a Input audio file path (supports WAV, MP3, FLAC)
Default: uses built-in test audio

--output-dir Output directory
Default: output

--model-path Fun-Audio-Chat model path
Default: FunAudioLLM/Fun-Audio-Chat-8B

--cosyvoice-path CosyVoice3 model path (S2S mode only)
Default: auto-download

--stage-configs Custom stage config file path
Default: auto-select s2t or s2s config

--speaker Speaker name or embedding path
Default: 中文女

--sampling-rate Input audio sampling rate
Default: 16000

--temperature Generation temperature
Default: 0.7

--max-tokens Maximum tokens to generate
Default: 2048

--seed Random seed
Default: 42

--enable-stats Enable detailed statistics logging
Default: False
```

## Configuration Files

### S2T Config (1 stage)

```yaml
# vllm_omni/model_executor/stage_configs/fun_audio_chat.yaml
stage_args:
- stage_id: 0
runtime:
devices: "0"
engine_args:
model_arch: FunAudioChatForConditionalGeneration
gpu_memory_utilization: 0.8
engine_output_type: token
```

### S2S Config (3 stages)

```yaml
# vllm_omni/model_executor/stage_configs/fun_audio_chat_s2s.yaml
stage_args:
- stage_id: 0 # Main: audio understanding
runtime:
devices: "0"
engine_args:
model_arch: FunAudioChatForConditionalGeneration
gpu_memory_utilization: 0.7
engine_output_type: latent # output hidden states

- stage_id: 1 # CRQ: speech token generation
runtime:
devices: "1" # can be on different GPU
engine_args:
model_arch: FunAudioChatCRQDecoder

- stage_id: 2 # CosyVoice: speech synthesis
runtime:
devices: "1"
engine_args:
model_arch: FunAudioChatCosyVoice
```

## Output Format

### S2T Mode

```
output/
├── request_0.txt # Transcribed text
└── request_0.json # Full metadata
```

### S2S Mode

```
output/
├── request_0.txt # Transcribed text
├── request_0.wav # Generated audio (24kHz)
└── request_0.json # Full metadata
```

## FAQ

### Q: librosa backend error?

Install ffmpeg:

```bash
sudo apt update
sudo apt install ffmpeg
```

### Q: Out of memory?

1. Use S2T mode (only needs ~20GB)
2. Use dual-GPU configuration
3. Lower `gpu_memory_utilization`

### Q: CosyVoice import failed?

Ensure paths are set correctly:

```bash
export COSYVOICE_PATH=/path/to/CosyVoice
export PYTHONPATH=$COSYVOICE_PATH:$COSYVOICE_PATH/third_party/Matcha-TTS:$PYTHONPATH
```

### Q: Model download timeout?

Use China mirror:

```bash
export HF_ENDPOINT=https://hf-mirror.com
```

## References

- [Fun-Audio-Chat Official Repo](https://github.com/FunAudioLLM/Fun-Audio-Chat)
- [Fun-Audio-Chat-8B Model](https://huggingface.co/FunAudioLLM/Fun-Audio-Chat-8B)
- [CosyVoice3 Model](https://huggingface.co/FunAudioLLM/Fun-CosyVoice3-0.5B-2512)
- [CosyVoice Repo](https://github.com/FunAudioLLM/CosyVoice)

## License

Apache 2.0
Loading