Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 67 additions & 9 deletions docs/design/feature/async_chunk_design.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ The `async_chunk` feature enables asynchronous, chunked processing of data acros

For qwen3-omni:
- **Thinker → Talker**: Per decode step (typically chunk_size=1)
- **Talker → Code2Wav**: Accumulated to `codec_chunk_frames` (default=25) before sending. During the initial phase, a dynamic initial chunk size (IC) is automatically selected based on server load to reduce TTFA. Use the per-request `initial_codec_chunk_frames` API field to override.
- **Talker → Code2Wav**: Accumulated to `codec_chunk_frames` (default=25) before sending. During the initial phase, a dynamic initial chunk size (IC) is automatically selected based on server load to reduce TTFP. Use the per-request `initial_codec_chunk_frames` API field to override.
- **Code2Wav**: Streaming decode with code2wav chunk_size

With `async_chunk`:
Expand Down Expand Up @@ -75,26 +75,84 @@ Enabling **async_chunk** (False→True) sharply reduces time-to-first-audio (TTF
</p>

## Architecture
### Data Flow

#### Sequential Flow
### Async Chunk Pipeline Overview

The following diagram illustrates the **Async Chunk Architecture** for multi-stage models (e.g., Qwen3-Omni with Thinker → Talker → Code2Wav), showing how data flows through the 4-stage pipeline with parallel processing and dual-stream output:

<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-non-async-chunk.png">
<img alt="Data Flow between stages" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-non-async-chunk.png" width=100%>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-async-chunk.png">
<img alt="Async Chunk Pipeline Architecture" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-async-chunk.png" width=100%>
</picture>
</p>

#### Async Chunk Flow
**Diagram Legend:**
| Step | Stage Type | Description |
|:------:|:-----------:|:------------|
| `prefill` | Initialization | Context processing, KV cache initialization |
| `decode` | Autoregressive | Token-by-token generation in AR stages |
| `codes` | Audio Encoding | RVQ codec codes from Talker stage |
| `output` | Final Output | Text chunks or audio waveforms |

### Data Flow

#### Stage 0: Thinker (Multimodal Understanding + Text Generation)
- **Prefill**: Processes multimodal input (text/image/audio/video), initializes KV cache
- **Decode Loop**: Generates text tokens autoregressively
- **Chunk Triggers**: Each decode step (typically `chunk_size=1`) can trigger downstream processing
- **Dual Output**:
- **Text Stream**: `text_0`, `text_1`, `text_2`... `text_n` streamed to output
- **Hidden States**: Passed to Talker stage for audio synthesis

#### Stage 1: Talker (Text → RVQ Audio Codes)
- **Prefill**: Receives hidden states from Thinker as semantic condition
- **Decode Loop**: Generates RVQ codec codes autoregressively
- **Accumulation**: Codes accumulate to `codec_chunk_frames` (default=25) before forwarding
- **Dynamic IC**: Initial chunk size auto-selected based on server load to optimize TTFP
- **Output**: `codes` blocks (chunk 0, 1, ... n) sent to Code2Wav

#### Stage 2: Code2Wav (Vocoder Decoder)
- **Non-Autoregressive**: Processes RVQ codes in parallel batches
- **Streaming Decode**: Converts codes to audio waveforms chunk-by-chunk
- **Batching**: Supports batched inference for multiple concurrent requests
- **Output**: Audio segments `audio_0`, `audio_1`, ... `audio_n`

#### Stage 3: Output (Dual Stream)
- **Text Streaming**: `text_0` → `text_1` → `text_2` → ... (user sees response in real-time)
- **Audio Streaming**: `audio_0` → `audio_1` → ... (user hears audio progressively)

### Execution Timeline

```
Timeline: Parallel vs Sequential

Sequential (async_chunk=false):
[Thinker: ████████████████████] (2.0s)
[Talker: ████████████████████] (3.0s)
[Code2Wav: ████] (1.0s)
Total: 6.0s, TTFP: 6.0s

Async Chunk (async_chunk=true):
[Thinker: ████░░░░████░░░░████] (2.0s, streaming)
[Talker: ░░████░░░░████░░] (3.0s, parallel)
[Code2Wav: ░░░░████░░] (1.0s, batched)
Total: ~3.5s, TTFP: ~0.5s

█ = Active computation ░ = Waiting/idle
```

#### Sequential Flow (for comparison)
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-async-chunk.png">
<img alt="Data Flow between stages" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-async-chunk.png" width=100%>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-non-async-chunk.png">
<img alt="Sequential Data Flow" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/qwen3-omni-non-async-chunk.png" width=100%>
</picture>
</p>

### Async Chunk architecture
In sequential mode, each stage must wait for the previous stage to complete entirely before starting.

### Async Chunk System Architecture
<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" src="https://raw.githubusercontent.com/vllm-project/vllm-omni/refs/heads/main/docs/source/architecture/async-chunk-architecture.png">
Expand Down
Binary file modified docs/source/architecture/qwen3-omni-async-chunk.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/source/architecture/qwen3-omni-non-async-chunk.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading