Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 48 additions & 0 deletions examples/online_serving/qwen3_tts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -378,6 +378,54 @@ Server -> Client:
{"type": "session.done", "total_sentences": 1}
```

## Choosing an Execution Backend: Uniproc vs Multiprocessing

Qwen3-TTS stage configs support two execution backends controlled by the
`distributed_executor_backend` engine arg. The performance tradeoff between
them is **both hardware- and task-dependent**, so there is no single best
default (see [#2603](https://github.com/vllm-project/vllm-omni/issues/2603),
[#2604](https://github.com/vllm-project/vllm-omni/pull/2604) for the full
investigation).

| Backend | Stage config setting | Behaviour |
| ------- | -------------------- | --------- |
| **Uniproc** (default, world_size=1) | `distributed_executor_backend` omitted | Both stages run inside the orchestrator process. Avoids IPC serialisation, D2H copies, and msgpack overhead between stages. |
| **Multiprocessing** | `distributed_executor_backend: "mp"` | Each stage runs in its own subprocess. The Talker can continue decoding while Code2Wav runs the vocoder in parallel, improving pipeline utilisation under concurrency. |

> **Note:** When `distributed_executor_backend` is omitted and `world_size=1`,
> vLLM [automatically uses the uniproc executor](https://github.com/vllm-project/vllm/blob/main/vllm/config/parallel.py#L825).
> When `world_size > 1`, it defaults to `mp`.

### When uniproc wins

The uniproc path eliminates inter-process data transfer (D2H copies,
msgpack serialisation/deserialisation, tensor detaching). This matters most
when per-request processing is heavy relative to autoregressive decode.

The Base cloning task involves reference-audio encoding on every request, making IPC
overhead a larger fraction of total cost. Qwen3-Omni shows a similar pattern.

### When multiprocessing (`mp`) wins

For lighter per-request workloads, process-level parallelism between the
Talker and Code2Wav stages dominates.

CustomVoice is lighter per-request (no reference audio encoding), so the
process-level parallelism of `mp` outweighs its serialisation cost at
concurrency ≥ 4.

### How to switch

To use the uniproc executor on a single-GPU setup, pass the
`qwen3_tts_uniproc.yaml` stage config:

```bash
vllm serve Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--omni \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_uniproc.yaml \
--port 8091
```

## Limitations

- **Single request**: Batch processing is not yet optimized for online serving.
Expand Down
97 changes: 97 additions & 0 deletions vllm_omni/model_executor/stage_configs/qwen3_tts_uniproc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
async_chunk: true
stage_args:
- stage_id: 0
stage_type: llm
is_comprehension: true
runtime:
devices: "0"
engine_args:
model_stage: qwen3_tts
max_num_seqs: 10
model_arch: Qwen3TTSTalkerForConditionalGeneration
worker_type: ar
scheduler_cls: vllm_omni.core.sched.omni_ar_scheduler.OmniARScheduler
enforce_eager: false
trust_remote_code: true
async_scheduling: true
enable_prefix_caching: false
engine_output_type: latent
gpu_memory_utilization: 0.3
max_num_batched_tokens: 512
max_model_len: 4096
custom_process_next_stage_input_func: vllm_omni.model_executor.stage_input_processors.qwen3_tts.talker2code2wav_async_chunk
# Use named connector to apply runtime.connectors.extra.
output_connectors:
to_stage_1: connector_of_shared_memory
default_sampling_params:
temperature: 0.9
top_k: 50
max_tokens: 4096
seed: 42
detokenize: false
repetition_penalty: 1.05
stop_token_ids: [2150]

- stage_id: 1
stage_type: llm
runtime:
devices: "0"
engine_args:
model_stage: code2wav
max_num_seqs: 1
model_arch: Qwen3TTSCode2Wav
worker_type: generation
scheduler_cls: vllm_omni.core.sched.omni_generation_scheduler.OmniGenerationScheduler
enforce_eager: true
trust_remote_code: true
async_scheduling: true
enable_prefix_caching: false
engine_output_type: audio
gpu_memory_utilization: 0.3
# Must be divisible by num_code_groups and cover (left_context + chunk).
# Prefill length is Q * num_frames (e.g. 16 * 2148 = 34368); keep headroom past 32k.
max_num_batched_tokens: 65536
# async_chunk appends windows per step; max_model_len must cover accumulated flat codec stream.
max_model_len: 65536
engine_input_source: [0]
final_output: true
final_output_type: audio
# Distributed connector configuration
input_connectors:
from_stage_0: connector_of_shared_memory
tts_args:
max_instructions_length: 500
default_sampling_params:
temperature: 0.0
top_p: 1.0
top_k: -1
max_tokens: 65536
seed: 42
detokenize: true
repetition_penalty: 1.0

runtime:
enabled: true
defaults:
window_size: -1
max_inflight: 1

connectors:
connector_of_shared_memory:
name: SharedMemoryConnector
extra:
shm_threshold_bytes: 65536
# Frame-aligned codec streaming transport.
codec_streaming: true
# Connector polling / timeout (unit: loop count, sleep interval in seconds).
connector_get_sleep_s: 0.01
connector_get_max_wait_first_chunk: 3000
connector_get_max_wait: 300
# Match the decoder sliding attention window to avoid chunk-boundary noise.
codec_chunk_frames: 25
codec_left_context_frames: 72

edges:
- from: 0
to: 1
window_size: -1
Loading