Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
54 changes: 54 additions & 0 deletions docs/cli/serve.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,59 @@
# vllm-omni serve

## Stage-based CLI quickstart

The stage-based CLI is designed for deployments that require launching each pipeline stage in an isolated process
(e.g., across separate operating system processes, distinct GPUs, or distributed hosts).

- For **migrated models** that utilize the bundled deployment YAML configurations located in
Comment on lines +3 to +8
`vllm_omni/deploy/`, the `--deploy-config` flag is only required to override the default configuration. By default, executing `vllm serve MODEL --omni ...`
automatically loads the bundled deployment configuration.
- For **legacy models** utilizing configuration files located in
`vllm_omni/model_executor/stage_configs/`, the `--stage-configs-path` parameter remains mandatory.

Example: Initializing Stage 0 (Orchestrator and API Server):

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--port 8091 \
--stage-id 0 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

Example: Initializing a Headless Worker Stage (Stage 1):

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

When utilizing a custom deployment YAML based on the new schema, append `--deploy-config /path/to/override.yaml` to each command execution. Conversely, for legacy models, substitute this parameter with `--stage-configs-path /path/to/stage_configs.yaml`.

In the standard execution paradigm, the `--stage-overrides` argument is utilized to apply stage-specific configurations from a single CLI command.
However, under the **stage-based CLI** paradigm, where each process strictly encapsulates a single stage, it is recommended to specify tuning parameters directly via discrete command-line flags for the respective stage, rather than constructing a composite `--stage-overrides` JSON string.

For example, as an alternative to the following composite configuration:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
```

the stage-based CLI permits the direct initialization of Stage 1 with explicit parameters:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--gpu-memory-utilization 0.5 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

## JSON CLI Arguments

--8<-- "docs/cli/json_tip.inc.md"
Expand Down
77 changes: 74 additions & 3 deletions docs/configuration/stage_configs.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,55 @@ stages:
| `--async-chunk` / `--no-async-chunk` | Flip the deploy YAML's `async_chunk:` bool. Unset (default) leaves the YAML value in force. |
| `--stage-configs-path` | **Deprecated.** Accepts legacy `stage_args` yamls and (auto-detected) new deploy yamls; emits a deprecation warning. Migrate to `--deploy-config`. To be removed in a follow-up PR. |

### Stage-Based CLI Paradigm

The stage-based CLI paradigm facilitates the execution of discrete pipeline stages within isolated processes:

- **Stage 0** typically encapsulates the orchestrator and the primary API server. Invocation requires `--stage-id 0`,
`--omni-master-address`, `--omni-master-port`, and standard port declarations (e.g., `--port`).
- **Worker Stages** operate without a distinct API server (i.e., using `--headless`), are assigned sequential `--stage-id` identifiers, and must reference the corresponding
`--omni-master-address` and `--omni-master-port` parameters to successfully register with Stage 0.

For migrated architectures, the system automatically resolves and loads the bundled deployment YAML. Consequently, the primary execution path
does **not** necessitate the explicit definition of `--deploy-config`:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--port 8091 \
--stage-id 0 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000

vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

When instantiating a custom deployment YAML conforming to the updated schema, append the `--deploy-config /path/to/override.yaml` directive
to all node invocations. For legacy architectures (e.g., BAGEL) configured via deprecated `stage_args:` schemas, continue to specify the relevant configuration via `--stage-configs-path /path/to/config.yaml`.

In the context of standard initialization architectures, utilizing the `--stage-overrides` parameter operates as the optimal methodology
for delineating stage-specific tuning from the CLI interface:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
```

Conversely, in the context of the **stage-based CLI** paradigm, given that each execution process exclusively instantiates a single pipeline stage, configuration override attributes
can be defined uniformly via explicit CLI flags on the corresponding instantiation command, rendering composite `--stage-overrides` JSON strings unnecessary:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--gpu-memory-utilization 0.5 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

### Precedence

From highest to lowest:
Expand Down Expand Up @@ -133,6 +182,17 @@ vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"0": {"max_num_seqs": 8}}'
```

Within the stage-based CLI paradigm, equivalent configuration parameters can inherently be passed directly
as command-line arguments to the designated single-stage process instantiation:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 0 \
--max-num-seqs 8 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

Effective config per stage after the merge:

| Stage | Field | Final value | Source |
Expand All @@ -153,17 +213,28 @@ Therefore, as a core part of vLLM-Omni, the stage configs for a model have sever
- Input and output dependencies for each stage.
- Default input parameters.

If users want to modify some part of it. The custom stage_configs file can be input as input argument in both online and offline. Just like examples below:
To override specific parameters, explicitly inject the customized configuration schema
in both online and offline instantiation flows. Prioritize the `--deploy-config` flag
when loading the new-schema deploy YAML schemas, reserving the `--stage-configs-path` parameter
exclusively to maintain compatibility with legacy `stage_args` YAML constructs.

Examples:

For offline (Assume necessary dependencies have ben imported):
For offline (Assume necessary dependencies have been imported):
```python
model_name = "Qwen/Qwen2.5-Omni-7B"
omni = Omni(model=model_name, stage_configs_path="/path/to/custom_stage_configs.yaml")
```

For online serving:
```bash
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
vllm serve Qwen/Qwen2.5-Omni-7B --omni --port 8091 --deploy-config /path/to/deploy_config.yaml
```

Legacy online serving:

```bash
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
```
!!! important
We are actively iterating on the definition of stage configs, and we welcome all feedbacks from both community users and developers to help us shape the development!
Expand Down
24 changes: 16 additions & 8 deletions docs/user_guide/examples/online_serving/bagel.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,9 +22,16 @@ Or use the convenience script:

```bash
cd /workspace/vllm-omni/examples/online_serving/bagel
# Launch both stages in one session (legacy convenience flow)
bash run_server.sh

# Launch a single stage per terminal
bash run_server_stage_cli.sh --stage 0
bash run_server_stage_cli.sh --stage 1
```

If you have a custom stage configs file, launch the server with the command below:

```bash
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni --port 8091 --stage-configs-path /path/to/stage_configs_file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need to keep this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#2936 by @princepride will handle docs relative with Bagel

```
Expand Down Expand Up @@ -115,12 +122,13 @@ mooncake_master \
**2. Launch Stage 0 (Thinker / Orchestrator)** on the orchestrator node:

```bash
# API server port for client requests: 8000
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
--port 8000 \ # API server port for client requests
--port 8000 \
--stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \
--stage-id 0 \
-oma <ORCHESTRATOR_IP> \
-omp 8091
--omni-master-address <ORCHESTRATOR_IP> \
--omni-master-port 8091
```

**3. Launch Stage 1 (DiT)** on the remote node in headless mode:
Expand All @@ -130,8 +138,8 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
--stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \
--stage-id 1 \
--headless \
-oma <ORCHESTRATOR_IP> \
-omp 8091
--omni-master-address <ORCHESTRATOR_IP> \
--omni-master-port 8091
```

**Mooncake Master arguments:**
Expand All @@ -150,8 +158,8 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
| :------- | :---------- |
| `--stage-id` | Which stage this process runs (0 = Thinker, 1 = DiT) |
| `--headless` | Run without the API server (worker-only mode) |
| `-oma` | Orchestrator master address |
| `-omp` | Orchestrator master port for Stage 1 to connect to Stage 0 for task coordination |
| `--omni-master-address` | Orchestrator master address |
| `--omni-master-port` | Orchestrator master port for Stage 1 to connect to Stage 0 for task coordination |

> [!IMPORTANT]
> **Startup Order**: Stage 0 (orchestrator) must be launched **before** Stage 1 (headless).
Expand All @@ -165,7 +173,7 @@ All nodes must have network connectivity to each other. Ensure the following por
| :--- | :------- | :------ | :-------- |
| 50051 | TCP | Mooncake Master RPC | Worker → Orchestrator |
| 8080 | TCP | Mooncake HTTP Metadata Server | Worker → Orchestrator |
| 8091 | TCP | Orchestrator Master (`-omp`) | Worker → Orchestrator |
| 8091 | TCP | Orchestrator Master (`--omni-master-port`) | Worker → Orchestrator |
| 8000 | TCP | API Server (`--port`) | Client → Orchestrator |
| 9003 | TCP | Metrics (optional) | Monitoring → Orchestrator |

Expand Down
65 changes: 61 additions & 4 deletions docs/user_guide/examples/online_serving/qwen3_omni.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,72 @@ Please refer to [README.md](https://github.com/vllm-project/vllm-omni/tree/main/
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091
```

If you want to open async chunking for qwen3-omni, launch the server with command below
The default deployment configuration situated at `vllm_omni/deploy/qwen3_omni_moe.yaml` is resolved and loaded
automatically via the model registry, obviating the necessity for the `--deploy-config` flag in standard deployment topologies.
Asynchronous chunk streaming is **enabled by default** within the bundled configuration.

To explicitly utilize a custom deployment YAML, specify the configuration path:
```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /vllm_omni/deploy/qwen3_omni_moe.yaml
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--deploy-config /path/to/deploy_config_file
```

If you have custom stage configs file, launch the server with command below
### Launch individual stages (stage-based CLI)

Adopt the stage-based CLI architecture to independently instantiate execution processes per functional stage.

**1. Stage 0 (Thinker + API server)**

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 --deploy-config /path/to/deploy_config_file
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--port 8091 \
--stage-id 0 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

**2. Stage 1 (Talker)**

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

**3. Stage 2 (Code2Wav)**

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 2 \
--headless \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

Add `--deploy-config /path/to/deploy_config_file` to every command if you want
to override the bundled deploy YAML.

For the regular one-process launch, stage-specific CLI tuning is usually done
with `--stage-overrides`, for example:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni --port 8091 \
--stage-overrides '{"1": {"gpu_memory_utilization": 0.5}}'
```

For the stage-based CLI, you usually do **not** need `--stage-overrides` for
that kind of change. Since each command launches one stage, just pass the knob
directly on that stage command:

```bash
vllm serve Qwen/Qwen3-Omni-30B-A3B-Instruct --omni \
--stage-id 1 \
--headless \
--gpu-memory-utilization 0.5 \
--omni-master-address 127.0.0.1 \
--omni-master-port 26000
```

### Send Multi-modal Request
Expand Down
30 changes: 16 additions & 14 deletions examples/online_serving/bagel/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,12 @@ Or use the convenience script:

```bash
cd /workspace/vllm-omni/examples/online_serving/bagel
# Initialize all stages within a single unified session (legacy operational sequence)
bash run_server.sh

# Initialize each stage in a discrete isolated process terminal
bash run_server_stage_cli.sh --stage 0
bash run_server_stage_cli.sh --stage 1
```

```bash
Expand Down Expand Up @@ -112,12 +117,13 @@ mooncake_master \
**2. Launch Stage 0 (Thinker / Orchestrator)** on the orchestrator node:

```bash
# API server port for client requests: 8000
vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
--port 8000 \ # API server port for client requests
--port 8000 \
--stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \
--stage-id 0 \
-oma <ORCHESTRATOR_IP> \
-omp 8091
--omni-master-address <ORCHESTRATOR_IP> \
--omni-master-port 8091
```

**3. Launch Stage 1 (DiT)** on the remote node in headless mode:
Expand All @@ -127,8 +133,8 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \
--stage-configs-path vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml \
--stage-id 1 \
--headless \
-oma <ORCHESTRATOR_IP> \
-omp 8091
--omni-master-address <ORCHESTRATOR_IP> \
--omni-master-port 8091
```

**Mooncake Master arguments:**
Expand All @@ -145,14 +151,10 @@ vllm serve ByteDance-Seed/BAGEL-7B-MoT --omni \

| Argument | Description |
| :------- | :---------- |
| `--stage-id` | Which stage this process runs (0 = Thinker, 1 = DiT) |
| `--headless` | Run without the API server (worker-only mode) |
| `-oma` | Orchestrator master address |
| `-omp` | Orchestrator master port for Stage 1 to connect to Stage 0 for task coordination |

> [!IMPORTANT]
> **Startup Order**: Stage 0 (orchestrator) must be launched **before** Stage 1 (headless).
> Stage 0 will appear to hang on startup until Stage 1 (worker) connects — this is expected behavior.
| `--stage-id` | Designates the pipeline stage assigned to the process (e.g., 0 = Thinker, 1 = DiT) |
| `--headless` | Executes the worker stage autonomously without initializing an API server |
| `--omni-master-address` | Specifies the IP address binding the Orchestrator master node |
| `--omni-master-port` | Specifies the targeted port establishing task coordination between Stage 1 and Stage 0 |

Comment thread
wuhang2014 marked this conversation as resolved.
**Network Requirements**

Expand All @@ -162,7 +164,7 @@ All nodes must have network connectivity to each other. Ensure the following por
| :--- | :------- | :------ | :-------- |
| 50051 | TCP | Mooncake Master RPC | Worker → Orchestrator |
| 8080 | TCP | Mooncake HTTP Metadata Server | Worker → Orchestrator |
| 8091 | TCP | Orchestrator Master (`-omp`) | Worker → Orchestrator |
| 8091 | TCP | Orchestrator Master (`--omni-master-port`) | Worker → Orchestrator |
| 8000 | TCP | API Server (`--port`) | Client → Orchestrator |
| 9003 | TCP | Metrics (optional) | Monitoring → Orchestrator |

Expand Down
8 changes: 4 additions & 4 deletions examples/online_serving/bagel/run_server_stage_cli.sh
Original file line number Diff line number Diff line change
Expand Up @@ -116,8 +116,8 @@ run_stage_0() {
--port "$PORT" \
--stage-configs-path "$STAGE_CONFIGS_PATH" \
--stage-id 0 \
-oma "$MASTER_ADDRESS" \
-omp "$MASTER_PORT" \
--omni-master-address "$MASTER_ADDRESS" \
--omni-master-port "$MASTER_PORT" \
"${EXTRA_ARGS[@]}"
}

Expand All @@ -127,8 +127,8 @@ run_stage_1() {
--stage-configs-path "$STAGE_CONFIGS_PATH" \
--stage-id 1 \
--headless \
-oma "$MASTER_ADDRESS" \
-omp "$MASTER_PORT" \
--omni-master-address "$MASTER_ADDRESS" \
--omni-master-port "$MASTER_PORT" \
"${EXTRA_ARGS[@]}"
}

Expand Down
Loading
Loading