diff --git a/docs/design/feature/async_chunk_design.md b/docs/design/feature/async_chunk_design.md index 202ef0e18e8..45314a0aec6 100644 --- a/docs/design/feature/async_chunk_design.md +++ b/docs/design/feature/async_chunk_design.md @@ -19,7 +19,7 @@ The `async_chunk` feature enables asynchronous, chunked processing of data acros For qwen3-omni: - **Thinker → Talker**: Per decode step (typically chunk_size=1) -- **Talker → Code2Wav**: Accumulated to `codec_chunk_frames` (default=25) before sending. During the initial phase, a dynamic initial chunk size (IC) is automatically selected based on server load to reduce TTFA. Use the per-request `initial_codec_chunk_frames` API field to override. +- **Talker → Code2Wav**: Accumulated to `codec_chunk_frames` (default=25) before sending. During the initial phase, a dynamic initial chunk size (IC) is automatically selected based on server load to reduce TTFP. Use the per-request `initial_codec_chunk_frames` API field to override. - **Code2Wav**: Streaming decode with code2wav chunk_size With `async_chunk`: @@ -75,26 +75,84 @@ Enabling **async_chunk** (False→True) sharply reduces time-to-first-audio (TTF

## Architecture -### Data Flow -#### Sequential Flow +### Async Chunk Pipeline Overview + +The following diagram illustrates the **Async Chunk Architecture** for multi-stage models (e.g., Qwen3-Omni with Thinker → Talker → Code2Wav), showing how data flows through the 4-stage pipeline with parallel processing and dual-stream output: +

- - Data Flow between stages + + Async Chunk Pipeline Architecture

-#### Async Chunk Flow +**Diagram Legend:** +| Step | Stage Type | Description | +|:------:|:-----------:|:------------| +| `prefill` | Initialization | Context processing, KV cache initialization | +| `decode` | Autoregressive | Token-by-token generation in AR stages | +| `codes` | Audio Encoding | RVQ codec codes from Talker stage | +| `output` | Final Output | Text chunks or audio waveforms | +### Data Flow + +#### Stage 0: Thinker (Multimodal Understanding + Text Generation) +- **Prefill**: Processes multimodal input (text/image/audio/video), initializes KV cache +- **Decode Loop**: Generates text tokens autoregressively +- **Chunk Triggers**: Each decode step (typically `chunk_size=1`) can trigger downstream processing +- **Dual Output**: + - **Text Stream**: `text_0`, `text_1`, `text_2`... `text_n` streamed to output + - **Hidden States**: Passed to Talker stage for audio synthesis + +#### Stage 1: Talker (Text → RVQ Audio Codes) +- **Prefill**: Receives hidden states from Thinker as semantic condition +- **Decode Loop**: Generates RVQ codec codes autoregressively +- **Accumulation**: Codes accumulate to `codec_chunk_frames` (default=25) before forwarding +- **Dynamic IC**: Initial chunk size auto-selected based on server load to optimize TTFP +- **Output**: `codes` blocks (chunk 0, 1, ... n) sent to Code2Wav + +#### Stage 2: Code2Wav (Vocoder Decoder) +- **Non-Autoregressive**: Processes RVQ codes in parallel batches +- **Streaming Decode**: Converts codes to audio waveforms chunk-by-chunk +- **Batching**: Supports batched inference for multiple concurrent requests +- **Output**: Audio segments `audio_0`, `audio_1`, ... `audio_n` + +#### Stage 3: Output (Dual Stream) +- **Text Streaming**: `text_0` → `text_1` → `text_2` → ... (user sees response in real-time) +- **Audio Streaming**: `audio_0` → `audio_1` → ... (user hears audio progressively) + +### Execution Timeline + +``` +Timeline: Parallel vs Sequential + +Sequential (async_chunk=false): +[Thinker: ████████████████████] (2.0s) + [Talker: ████████████████████] (3.0s) + [Code2Wav: ████] (1.0s) +Total: 6.0s, TTFP: 6.0s + +Async Chunk (async_chunk=true): +[Thinker: ████░░░░████░░░░████] (2.0s, streaming) + [Talker: ░░████░░░░████░░] (3.0s, parallel) + [Code2Wav: ░░░░████░░] (1.0s, batched) +Total: ~3.5s, TTFP: ~0.5s + +█ = Active computation ░ = Waiting/idle +``` + +#### Sequential Flow (for comparison)

- - Data Flow between stages + + Sequential Data Flow

-### Async Chunk architecture +In sequential mode, each stage must wait for the previous stage to complete entirely before starting. + +### Async Chunk System Architecture

diff --git a/docs/source/architecture/qwen3-omni-async-chunk.png b/docs/source/architecture/qwen3-omni-async-chunk.png index b2d98b80f33..e73ca84b283 100644 Binary files a/docs/source/architecture/qwen3-omni-async-chunk.png and b/docs/source/architecture/qwen3-omni-async-chunk.png differ diff --git a/docs/source/architecture/qwen3-omni-non-async-chunk.png b/docs/source/architecture/qwen3-omni-non-async-chunk.png index da5610a11bb..47a9ba66a5e 100644 Binary files a/docs/source/architecture/qwen3-omni-non-async-chunk.png and b/docs/source/architecture/qwen3-omni-non-async-chunk.png differ