diff --git a/docs/cli/serve.md b/docs/cli/serve.md index 47a873b7211..035fa056731 100644 --- a/docs/cli/serve.md +++ b/docs/cli/serve.md @@ -1,5 +1,59 @@ # vllm-omni serve +## Stage-based CLI quickstart + +The stage-based CLI is designed for deployments that require launching each pipeline stage in an isolated process +(e.g., across separate operating system processes, distinct GPUs, or distributed hosts). + +- For **migrated models** that utilize the bundled deployment YAML configurations located in + `vllm_omni/deploy/`, the `--deploy-config` flag is only required to override the default configuration. By default, executing `vllm serve MODEL --omni ...` + automatically loads the bundled deployment configuration. +- For **legacy models** utilizing configuration files located in + `vllm_omni/model_executor/stage_configs/`, the `--stage-configs-path` parameter remains mandatory. + +Example: Initializing Stage 0 (Orchestrator and API Server): + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --port 8091 \ + --stage-id 0 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +Example: Initializing a Headless Worker Stage (Stage 1): + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +When utilizing a custom deployment YAML based on the new schema, append `--deploy-config /path/to/override.yaml` to each command execution. Conversely, for legacy models, substitute this parameter with `--stage-configs-path /path/to/stage_configs.yaml`. + +In the standard execution paradigm, the `--stage-overrides` argument is utilized to apply stage-specific configurations from a single CLI command. +However, under the **stage-based CLI** paradigm, where each process strictly encapsulates a single stage, it is recommended to specify tuning parameters directly via discrete command-line flags for the respective stage, rather than constructing a composite `--stage-overrides` JSON string. + +For example, as an alternative to the following composite configuration: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}' +``` + +the stage-based CLI permits the direct initialization of Stage 1 with explicit parameters: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --gpu-memory-utilization 0.5 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + ## JSON CLI Arguments --8<-- "docs/cli/json_tip.inc.md" diff --git a/docs/configuration/stage_configs.md b/docs/configuration/stage_configs.md index 55b4053cc71..4a7c9cc67c5 100644 --- a/docs/configuration/stage_configs.md +++ b/docs/configuration/stage_configs.md @@ -88,6 +88,55 @@ stages: | `--async-chunk` / `--no-async-chunk` | Flip the deploy YAML's `async_chunk:` bool. Unset (default) leaves the YAML value in force. | | `--stage-configs-path` | **Deprecated.** Accepts legacy `stage_args` yamls and (auto-detected) new deploy yamls; emits a deprecation warning. Migrate to `--deploy-config`. To be removed in a follow-up PR. | +### Stage-Based CLI Paradigm + +The stage-based CLI paradigm facilitates the execution of discrete pipeline stages within isolated processes: + +- **Stage 0** typically encapsulates the orchestrator and the primary API server. Invocation requires `--stage-id 0`, + `--omni-master-address`, `--omni-master-port`, and standard port declarations (e.g., `--port`). +- **Worker Stages** operate without a distinct API server (i.e., using `--headless`), are assigned sequential `--stage-id` identifiers, and must reference the corresponding + `--omni-master-address` and `--omni-master-port` parameters to successfully register with Stage 0. + +For migrated architectures, the system automatically resolves and loads the bundled deployment YAML. Consequently, the primary execution path +does **not** necessitate the explicit definition of `--deploy-config`: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --port 8091 \ + --stage-id 0 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 + +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +When instantiating a custom deployment YAML conforming to the updated schema, append the `--deploy-config /path/to/override.yaml` directive +to all node invocations. For legacy architectures (e.g., BAGEL) configured via deprecated `stage_args:` schemas, continue to specify the relevant configuration via `--stage-configs-path /path/to/config.yaml`. + +In the context of standard initialization architectures, utilizing the `--stage-overrides` parameter operates as the optimal methodology +for delineating stage-specific tuning from the CLI interface: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}' +``` + +Conversely, in the context of the **stage-based CLI** paradigm, given that each execution process exclusively instantiates a single pipeline stage, configuration override attributes +can be defined uniformly via explicit CLI flags on the corresponding instantiation command, rendering composite `--stage-overrides` JSON strings unnecessary: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --gpu-memory-utilization 0.5 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + ### Precedence From highest to lowest: @@ -133,6 +182,17 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ --stage-overrides '{"0": {"max_num_seqs": 8}}' ``` +Within the stage-based CLI paradigm, equivalent configuration parameters can inherently be passed directly +as command-line arguments to the designated single-stage process instantiation: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 0 \ + --max-num-seqs 8 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + Effective config per stage after the merge: | Stage | Field | Final value | Source | @@ -153,9 +213,14 @@ Therefore, as a core part of vLLM-Omni, the stage configs for a model have sever - Input and output dependencies for each stage. - Default input parameters. -If users want to modify some part of it. The custom stage_configs file can be input as input argument in both online and offline. Just like examples below: +To override specific parameters, explicitly inject the customized configuration schema +in both online and offline instantiation flows. Prioritize the `--deploy-config` flag +when loading the new-schema deploy YAML schemas, reserving the `--stage-configs-path` parameter +exclusively to maintain compatibility with legacy `stage_args` YAML constructs. + +Examples: -For offline (Assume necessary dependencies have ben imported): +For offline (Assume necessary dependencies have been imported): ```python model_name = "Qwen/Qwen2.5-Omni-7B" omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml") @@ -163,7 +228,13 @@ omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs. For online serving: ```bash -vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file +vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --deploy-config /path/to/deploy_config.yaml +``` + +Legacy online serving: + +```bash +vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file ``` !!! important We are actively iterating on the definition of stage configs, and we welcome all feedbacks from both community users and developers to help us shape the development! diff --git a/docs/user_guide/examples/online_serving/bagel.md b/docs/user_guide/examples/online_serving/bagel.md index 9de31926aa1..1a3fec9f426 100644 --- a/docs/user_guide/examples/online_serving/bagel.md +++ b/docs/user_guide/examples/online_serving/bagel.md @@ -22,9 +22,16 @@ Or use the convenience script: ```bash cd /workspace/vllm-omni/examples/online_serving/bagel +# Launch both stages in one session (legacy convenience flow) bash run_server.sh + +# Launch a single stage per terminal +bash run_server_stage_cli.sh --stage 0 +bash run_server_stage_cli.sh --stage 1 ``` +If you have a custom stage configs file, launch the server with the command below: + ```bash vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file ``` @@ -115,12 +122,13 @@ mooncake_master \ **2. Launch Stage 0 (Thinker / Orchestrator)** on the orchestrator node: ```bash +# API server port for client requests: 8000 vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \ - --port 8000 \ # API server port for client requests + --port 8000 \ --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \ --stage-id 0 \ - -oma \ - -omp 8091 + --omni-master-address \ + --omni-master-port 8091 ``` **3. Launch Stage 1 (DiT)** on the remote node in headless mode: @@ -130,8 +138,8 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \ --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \ --stage-id 1 \ --headless \ - -oma \ - -omp 8091 + --omni-master-address \ + --omni-master-port 8091 ``` **Mooncake Master arguments:** @@ -150,8 +158,8 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \ | :------- | :---------- | | `--stage-id` | Which stage this process runs (0 = Thinker, 1 = DiT) | | `--headless` | Run without the API server (worker-only mode) | -| `-oma` | Orchestrator master address | -| `-omp` | Orchestrator master port for Stage 1 to connect to Stage 0 for task coordination | +| `--omni-master-address` | Orchestrator master address | +| `--omni-master-port` | Orchestrator master port for Stage 1 to connect to Stage 0 for task coordination | > [!IMPORTANT] > **Startup Order**: Stage 0 (orchestrator) must be launched **before** Stage 1 (headless). @@ -165,7 +173,7 @@ All nodes must have network connectivity to each other. Ensure the following por | :--- | :------- | :------ | :-------- | | 50051 | TCP | Mooncake Master RPC | Worker → Orchestrator | | 8080 | TCP | Mooncake HTTP Metadata Server | Worker → Orchestrator | -| 8091 | TCP | Orchestrator Master (`-omp`) | Worker → Orchestrator | +| 8091 | TCP | Orchestrator Master (`--omni-master-port`) | Worker → Orchestrator | | 8000 | TCP | API Server (`--port`) | Client → Orchestrator | | 9003 | TCP | Metrics (optional) | Monitoring → Orchestrator | diff --git a/docs/user_guide/examples/online_serving/qwen3_omni.md b/docs/user_guide/examples/online_serving/qwen3_omni.md index 611eb6fd3fc..22d89ee8018 100644 --- a/docs/user_guide/examples/online_serving/qwen3_omni.md +++ b/docs/user_guide/examples/online_serving/qwen3_omni.md @@ -15,15 +15,72 @@ Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 ``` -If you want to open async chunking for qwen3-omni, launch the server with command below +The default deployment configuration situated at `vllm_omni/deploy/qwen3_omni_moe.yaml` is resolved and loaded +automatically via the model registry, obviating the necessity for the `--deploy-config` flag in standard deployment topologies. +Asynchronous chunk streaming is **enabled by default** within the bundled configuration. +To explicitly utilize a custom deployment YAML, specify the configuration path: ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /vllm_omni/deploy/qwen3_omni_moe.yaml +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --deploy-config /path/to/deploy_config_file ``` -If you have custom stage configs file, launch the server with command below +### Launch individual stages (stage-based CLI) + +Adopt the stage-based CLI architecture to independently instantiate execution processes per functional stage. + +**1. Stage 0 (Thinker + API server)** + ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /path/to/deploy_config_file +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --port 8091 \ + --stage-id 0 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +**2. Stage 1 (Talker)** + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +**3. Stage 2 (Code2Wav)** + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 2 \ + --headless \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +Add `--deploy-config /path/to/deploy_config_file` to every command if you want +to override the bundled deploy YAML. + +For the regular one-process launch, stage-specific CLI tuning is usually done +with `--stage-overrides`, for example: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}' +``` + +For the stage-based CLI, you usually do **not** need `--stage-overrides` for +that kind of change. Since each command launches one stage, just pass the knob +directly on that stage command: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --gpu-memory-utilization 0.5 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 ``` ### Send Multi-modal Request diff --git a/examples/online_serving/bagel/README.md b/examples/online_serving/bagel/README.md index 0939bc5f387..4a87940434b 100644 --- a/examples/online_serving/bagel/README.md +++ b/examples/online_serving/bagel/README.md @@ -19,7 +19,12 @@ Or use the convenience script: ```bash cd /workspace/vllm-omni/examples/online_serving/bagel +# Initialize all stages within a single unified session (legacy operational sequence) bash run_server.sh + +# Initialize each stage in a discrete isolated process terminal +bash run_server_stage_cli.sh --stage 0 +bash run_server_stage_cli.sh --stage 1 ``` ```bash @@ -112,12 +117,13 @@ mooncake_master \ **2. Launch Stage 0 (Thinker / Orchestrator)** on the orchestrator node: ```bash +# API server port for client requests: 8000 vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \ - --port 8000 \ # API server port for client requests + --port 8000 \ --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \ --stage-id 0 \ - -oma \ - -omp 8091 + --omni-master-address \ + --omni-master-port 8091 ``` **3. Launch Stage 1 (DiT)** on the remote node in headless mode: @@ -127,8 +133,8 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \ --stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \ --stage-id 1 \ --headless \ - -oma \ - -omp 8091 + --omni-master-address \ + --omni-master-port 8091 ``` **Mooncake Master arguments:** @@ -145,14 +151,10 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \ | Argument | Description | | :------- | :---------- | -| `--stage-id` | Which stage this process runs (0 = Thinker, 1 = DiT) | -| `--headless` | Run without the API server (worker-only mode) | -| `-oma` | Orchestrator master address | -| `-omp` | Orchestrator master port for Stage 1 to connect to Stage 0 for task coordination | - -> [!IMPORTANT] -> **Startup Order**: Stage 0 (orchestrator) must be launched **before** Stage 1 (headless). -> Stage 0 will appear to hang on startup until Stage 1 (worker) connects — this is expected behavior. +| `--stage-id` | Designates the pipeline stage assigned to the process (e.g., 0 = Thinker, 1 = DiT) | +| `--headless` | Executes the worker stage autonomously without initializing an API server | +| `--omni-master-address` | Specifies the IP address binding the Orchestrator master node | +| `--omni-master-port` | Specifies the targeted port establishing task coordination between Stage 1 and Stage 0 | **Network Requirements** @@ -162,7 +164,7 @@ All nodes must have network connectivity to each other. Ensure the following por | :--- | :------- | :------ | :-------- | | 50051 | TCP | Mooncake Master RPC | Worker → Orchestrator | | 8080 | TCP | Mooncake HTTP Metadata Server | Worker → Orchestrator | -| 8091 | TCP | Orchestrator Master (`-omp`) | Worker → Orchestrator | +| 8091 | TCP | Orchestrator Master (`--omni-master-port`) | Worker → Orchestrator | | 8000 | TCP | API Server (`--port`) | Client → Orchestrator | | 9003 | TCP | Metrics (optional) | Monitoring → Orchestrator | diff --git a/examples/online_serving/bagel/run_server_stage_cli.sh b/examples/online_serving/bagel/run_server_stage_cli.sh index 2d0b4bc369e..18b4c937cac 100644 --- a/examples/online_serving/bagel/run_server_stage_cli.sh +++ b/examples/online_serving/bagel/run_server_stage_cli.sh @@ -116,8 +116,8 @@ run_stage_0() { --port "$PORT" \ --stage-configs-path "$STAGE_CONFIGS_PATH" \ --stage-id 0 \ - -oma "$MASTER_ADDRESS" \ - -omp "$MASTER_PORT" \ + --omni-master-address "$MASTER_ADDRESS" \ + --omni-master-port "$MASTER_PORT" \ "${EXTRA_ARGS[@]}" } @@ -127,8 +127,8 @@ run_stage_1() { --stage-configs-path "$STAGE_CONFIGS_PATH" \ --stage-id 1 \ --headless \ - -oma "$MASTER_ADDRESS" \ - -omp "$MASTER_PORT" \ + --omni-master-address "$MASTER_ADDRESS" \ + --omni-master-port "$MASTER_PORT" \ "${EXTRA_ARGS[@]}" } diff --git a/examples/online_serving/qwen3_omni/README.md b/examples/online_serving/qwen3_omni/README.md index 32722b3db4e..c85970555f9 100644 --- a/examples/online_serving/qwen3_omni/README.md +++ b/examples/online_serving/qwen3_omni/README.md @@ -12,21 +12,80 @@ Please refer to [README.md](../../../README.md) vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 ``` -The default deploy config at `vllm_omni/deploy/qwen3_omni_moe.yaml` is loaded -automatically by the model registry — no `--deploy-config` flag needed for the -common case. Async-chunk streaming is **enabled by default** in the bundled config. -NPU / ROCm / XPU per-platform deltas are merged in automatically from the -`platforms:` section of the same YAML. +The default deployment configuration, situated at `vllm_omni/deploy/qwen3_omni_moe.yaml`, is resolved and loaded +automatically via the model registry, obviating the `--deploy-config` flag in standard deployment topologies. +Asynchronous chunk streaming operates as **enabled by default** within this bundled configuration. +Additionally, NPU, ROCm, and XPU per-platform configuration deltas are deterministically merged from the +`platforms`: section of the corresponding YAML. -**Note:** The OpenAI-style **`/v1/realtime`** WebSocket (streaming PCM audio in, audio + transcription out) is **not supported** when `async_chunk` is enabled. Use the default omni layout or a stage config with `async_chunk: false` for realtime sessions. - -If you have a custom deploy YAML, point at it explicitly: +**Note:** The OpenAI-style **`/v1/realtime`** WebSocket interface (facilitating streaming PCM audio input alongside audio and transcription output) +is currently **unsupported** while the `async_chunk` configuration attribute is enabled. +It is requisite to instantiate the default omni architecture or utilize a deployment configuration specifying `async_chunk: false` to facilitate real-time streaming sessions. +To explicitly utilize a custom deployment YAML, mandate the configuration path accordingly: ```bash vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ --deploy-config /path/to/your_deploy_config.yaml ``` +### Launch individual stages (stage-based CLI) + +Use the stage-based CLI when you want to run one stage per process. + +**1. Stage 0 (Thinker + API server)** + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --port 8091 \ + --stage-id 0 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +**2. Stage 1 (Talker)** + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +**3. Stage 2 (Code2Wav)** + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 2 \ + --headless \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + +Append `--deploy-config /path/to/your_deploy_config.yaml` to each node invocation if it is necessary +to explicitly override the bundled deployment YAML schema. + +For standard **unified-process** launcher, stage-specific CLI configuration tuning is conventionally implemented +via the `--stage-overrides` directive, as demonstrated below: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}' +``` + +Conversely, within the stage-based CLI paradigm, `--stage-overrides` modifiers are typically **unnecessary** +for this category of optimization. Given that each instantiation strictly initiates a single functional stage, +parameter flags can be systematically assigned directly onto that specific stage's command sequence: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \ + --stage-id 1 \ + --headless \ + --gpu-memory-utilization 0.5 \ + --omni-master-address 127.0.0.1 \ + --omni-master-port 26000 +``` + ### Tuning deployment parameters Most engine knobs (`max_num_batched_tokens`, `max_model_len`, `enforce_eager`, @@ -93,6 +152,9 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ Per-stage values are always treated as explicit and beat YAML defaults for the named stage. Other stages keep their YAML values. +If you switch to the stage-based CLI, the same per-stage tuning can usually be +passed directly on that stage's command instead of using `--stage-overrides`. + #### 3. Custom deploy YAML When per-stage overrides get long, write a small overlay YAML that inherits diff --git a/recipes/Qwen/Qwen3-Omni.md b/recipes/Qwen/Qwen3-Omni.md index 081e1453d37..f78e4dda2aa 100644 --- a/recipes/Qwen/Qwen3-Omni.md +++ b/recipes/Qwen/Qwen3-Omni.md @@ -50,13 +50,22 @@ Start the server from the repository root: vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 ``` -To enable async chunking, use the bundled stage config: +Async chunking is enabled by default in the bundled deployment config. For +common runtime tuning, prefer CLI overrides instead of editing or passing a +custom YAML file: ```bash -vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct \ - --omni \ - --port 8091 \ - --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_omni_moe_async_chunk.yaml +# Disable async chunking for /v1/realtime sessions +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --no-async-chunk +``` + +Use a custom deploy config only for advanced cases such as custom topology, +connector wiring, or a larger overlay of stage defaults: + +```bash +vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \ + --deploy-config /path/to/your_qwen3_omni_overrides.yaml ``` #### Verification @@ -85,6 +94,6 @@ curl http://localhost:8091/v1/chat/completions \ #### Notes -- Memory usage: Size depends on runtime options and output modalities; leave headroom for multimodal workloads. -- Key flags: `--omni` is required; `--stage-configs-path` is optional for custom or async-chunk stage configs. -- Known limitations: This starter recipe is intentionally narrow and focuses on the single-GPU online-serving path already documented in the repo examples. +- Memory usage: Size depends on runtime options and output modalities; leave headroom for multimodal workloads. Prefer CLI overrides such as `--gpu-memory-utilization` for routine tuning. +- Key flags: `--omni` is required; async chunking is enabled by default; use `--no-async-chunk` for realtime sessions and `--deploy-config` only for advanced custom deployments. +- Known limitations: The `/v1/realtime` WebSocket flow is currently unsupported while async chunking is enabled. This starter recipe is intentionally narrow and focuses on the single-GPU online-serving path already documented in the repo examples.