Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/design/qwen3_omni_tts_performance_optimization.md
Original file line number Diff line number Diff line change
Expand Up @@ -411,6 +411,8 @@ Notes:
- `runtime.max_batch_size` controls stage-level batching.
- Thinker/Talker commonly use `enforce_eager: false` for CUDA Graph paths.
- Code2Wav often remains eager (`enforce_eager: true`) depending on runtime behavior.
- Qwen3-Omni defaults `VLLM_USE_FLASHINFER_MOE_FP16=0`. The Triton has been more stable & faster
than the FlashInfer CUTLASS unquantized MoE backend on recent vLLM rebases.

#### 2) Enable async chunk

Expand Down
9 changes: 9 additions & 0 deletions docs/user_guide/examples/online_serving/qwen3_omni.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091

The default deployment configuration situated at `vllm_omni/deploy/qwen3_omni_moe.yaml` is resolved and loaded
automatically via the model registry, obviating the necessity for the `--deploy-config` flag in standard deployment topologies.
The bundled Qwen3-Omni setup defaults `VLLM_USE_FLASHINFER_MOE_FP16=0`. This keeps the Thinker & Talker on vLLM's
Triton unquantized MoE path and avoids the performance regression observed with the FlashInfer CUTLASS unquantized MoE
backend.
Asynchronous chunk streaming is **enabled by default** within the bundled configuration.

To explicitly utilize a custom deployment YAML, specify the configuration path:
Expand Down Expand Up @@ -72,6 +75,12 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
```

To experiment with the FlashInfer FP16 MoE path, set `VLLM_USE_FLASHINFER_MOE_FP16=1` before launching the server:
```bash
VLLM_USE_FLASHINFER_MOE_FP16=1 \
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
```

For the stage-based CLI, you usually do **not** need `--stage-overrides` for
that kind of change. Since each command launches one stage, just pass the knob
directly on that stage command:
Expand Down
13 changes: 13 additions & 0 deletions vllm_omni/engine/stage_init_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -482,6 +482,15 @@ def prepare_engine_environment() -> None:
pass


def _maybe_set_qwen3_omni_moe_env(engine_args_dict: dict[str, Any]) -> None:
if (
engine_args_dict.get("model_arch") == "Qwen3OmniMoeForConditionalGeneration"
and "VLLM_USE_FLASHINFER_MOE_FP16" not in os.environ
):
os.environ["VLLM_USE_FLASHINFER_MOE_FP16"] = "0"
logger.info("[stage_init] Set VLLM_USE_FLASHINFER_MOE_FP16=0 for Qwen3-Omni stage")


def split_devices_for_replicas(
devices_str: str | None,
num_replicas: int,
Expand Down Expand Up @@ -762,6 +771,10 @@ def build_engine_args_dict(
default_sp = _to_dict(getattr(stage_config, "default_sampling_params", {}))
engine_args_dict["has_sampling_extra_args"] = bool(default_sp.get("extra_args"))

# TODO: Remove this after the performance regression is fixed
# Set VLLM_USE_FLASHINFER_MOE_FP16=0 for Qwen3-Omni to avoid performance regression
_maybe_set_qwen3_omni_moe_env(engine_args_dict)

return engine_args_dict


Expand Down
Loading