Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,10 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTT
<summary> Essential Elements of an Effective PR Description Checklist </summary>

- [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
- [ ] The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the [test style doc](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
- [ ] The test results. Please pasting the results comparison before and after, or e2e results.
- [ ] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the [test style doc](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
- [ ] The test results. Please paste the results comparison before and after, or the e2e results.
- [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. **Please run `mkdocs serve` to sync the documentation editions to `./docs`.**
- [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft.
- [ ] (Optional) Release notes update. If your change is user-facing, please update the release notes draft.
</details>

**BEFORE SUBMITTING, PLEASE READ <https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md>** (anything written below this line will be removed by GitHub Actions)
6 changes: 6 additions & 0 deletions docs/.nav.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ nav:
- examples/README.md
- Offline Inference:
- BAGEL-7B-MoT: user_guide/examples/offline_inference/bagel.md
- GLM-Image Multistage End-to-End Inference: user_guide/examples/offline_inference/glm_image.md
- Image-To-Image: user_guide/examples/offline_inference/image_to_image.md
- Image-To-Video: user_guide/examples/offline_inference/image_to_video.md
- Qwen2.5-Omni: user_guide/examples/offline_inference/qwen2_5_omni.md
Expand All @@ -23,6 +24,7 @@ nav:
- Text-To-Video: user_guide/examples/offline_inference/text_to_video.md
- Online Serving:
- BAGEL-7B-MoT: user_guide/examples/online_serving/bagel.md
- GLM-Image Online Serving: user_guide/examples/online_serving/glm_image.md
- Image-To-Image: user_guide/examples/online_serving/image_to_image.md
- Image-To-Video: user_guide/examples/online_serving/image_to_video.md
- Qwen2.5-Omni: user_guide/examples/online_serving/qwen2_5_omni.md
Expand Down Expand Up @@ -50,10 +52,13 @@ nav:
- Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
- CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
- LoRA: user_guide/diffusion/lora.md
- Hybrid Sharded Data Parallel: design/feature/hsdp.md
- Custom Pipeline: features/custom_pipeline.md
- ComfyUI: features/comfyui.md
- Developer Guide:
- General:
- contributing/README.md
- pr_reviewer.md
- glob: contributing/*
flatten_single_child_sections: true
- Model Implementation:
Expand All @@ -73,6 +78,7 @@ nav:
- design/feature/tensor_parallel.md
- design/feature/cache_dit.md
- design/feature/teacache.md
- design/feature/async_chunk_design.md
- Module Design:
- design/module/ar_module.md
- design/module/dit_module.md
Expand Down
86 changes: 86 additions & 0 deletions docs/user_guide/examples/offline_inference/bagel.md
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,24 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can

------

#### Tensor Parallelism (TP)

For larger models or multi-GPU environments, you can enable Tensor Parallelism (TP) by modifying the stage configuration (e.g., [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)).

1. **Set `tensor_parallel_size`**: Increase this value (e.g., to `2` or `4`).
2. **Set `devices`**: Specify the comma-separated GPU IDs to be used for the stage (e.g., `"0,1"`).

Example configuration for TP=2 on GPUs 0 and 1:
```yaml
engine_args:
tensor_parallel_size: 2
...
runtime:
devices: "0,1"
```

------

#### 🔗 Runtime Configuration

| Parameter | Value | Description |
Expand All @@ -162,6 +180,74 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can
| `max_inflight` | `1` | Maximum inflight requests |
| `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB) |

## Using Mooncake Connector

[Mooncake](https://github.com/kvcache-ai/Mooncake) is a high-performance distributed KV cache transfer engine that enables efficient cross-node data movement via TCP or RDMA, making it ideal for multi-node disaggregated inference.

By default, BAGEL uses `SharedMemoryConnector` for inter-stage communication. You can switch to the Mooncake connector for better performance on multi-GPU setups and to enable multi-node deployment.

### Prerequisites

Install the Mooncake transfer engine:

```bash
# For CUDA-enabled systems (recommended)
pip install mooncake-transfer-engine

# For non-CUDA systems
pip install mooncake-transfer-engine-non-cuda
```

### Step 1: Start the Mooncake Master

On the **primary node**, start the Mooncake master service (run in a separate terminal or background with `&`):

```bash
# Optional: enable disk-backed storage by creating a directory and passing --root_fs_dir.
# Without it, Mooncake runs in memory-only mode, which is sufficient for KV cache transfer.
mkdir -p ./mc_storage

mooncake_master \
--rpc_port=50051 \
--enable_http_metadata_server=true \
--http_metadata_server_host=0.0.0.0 \
--http_metadata_server_port=8080 \
--metrics_port=9003 \
--root_fs_dir=./mc_storage/ \
--cluster_id=mc-local-1 &
```

### Step 2: Run Offline Inference with Mooncake

Use the provided Mooncake stage config [`bagel_multiconnector.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml). Before launching, update the `metadata_server` and `master` addresses in the YAML to match your Mooncake master node's IP (use `127.0.0.1` for single-node testing).

```bash
cd examples/offline_inference/bagel

# Text to Image with Mooncake
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2img \
--prompts "A cute cat" \
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml

# Image to Text with Mooncake
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality img2text \
--image-path /path/to/image.jpg \
--prompts "Describe this image" \
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml

# Text to Text with Mooncake
python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
--modality text2text \
--prompts "What is the capital of France?" \
--stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
```

For more details on the Mooncake connector and multi-node setup, see the [Mooncake Store Connector documentation](https://github.com/vllm-project/vllm-omni/tree/main/docs/design/feature/omni_connectors/mooncake_store_connector.md).

------

## FAQ

- If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.
Expand Down
156 changes: 156 additions & 0 deletions docs/user_guide/examples/offline_inference/glm_image.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,156 @@
# GLM-Image Multistage End-to-End Inference

Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/glm_image>.


This example demonstrates how to run GLM-Image with the vLLM-Omni multistage architecture.

## Architecture

GLM-Image uses a 2-stage pipeline:

```
┌─────────────────────────────────────────────────────────────┐
│ GLM-Image Pipeline │
├─────────────────────────────────────────────────────────────┤
│ │
│ Stage 0 (AR Model) Stage 1 (Diffusion) │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ vLLM-optimized │ │ GlmImagePipeline │ │
│ │ GlmImageFor │ prior │ ┌───────────────┐ │ │
│ │ Conditional │──tokens───►│ │ DiT Denoiser │ │ │
│ │ Generation │ │ └───────────────┘ │ │
│ │ (9B AR model) │ │ │ │ │
│ └─────────────────┘ │ ▼ │ │
│ ▲ │ ┌───────────────┐ │ │
│ │ │ │ VAE Decode │──┼──► Image
│ Text/Image │ └───────────────┘ │ │
│ Input └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```

## Features

- **vLLM-optimized AR**: Uses PagedAttention and tensor parallelism for faster prior token generation
- **Flexible deployment**: AR and Diffusion stages can run on different GPUs
- **Text-to-Image**: Generate images from text descriptions
- **Image-to-Image**: Edit existing images with text prompts

## Usage

### Text-to-Image

```bash
python end2end.py \
--model-path /path/to/glm-image \
--config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
--prompt "A beautiful sunset over the ocean with sailing boats" \
--height 1024 \
--width 1024 \
--output output_t2i.png
```

### Image-to-Image (Image Editing)

```bash
python end2end.py \
--model-path /path/to/glm-image \
--config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
--prompt "Transform this scene into a winter wonderland" \
--image input.png \
--output output_i2i.png
```

### With Custom Parameters

```bash
python end2end.py \
--model-path /path/to/glm-image \
--config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
--prompt "A photorealistic cat sitting on a window sill" \
--height 1024 \
--width 1024 \
--num-inference-steps 50 \
--guidance-scale 1.5 \
--seed 42 \
--output output.png
```

## Shell Scripts

### Run Text-to-Image

```bash
./run_t2i.sh
```

### Run Image-to-Image

```bash
./run_i2i.sh --image /path/to/input.png
```

## Stage Configuration

The stage config (`glm_image.yaml`) defines:

- **Stage 0 (AR)**: Uses `GPUARWorker` with vLLM engine

- Model: `GlmImageForConditionalGeneration`
- Output: `token_ids` (prior tokens)

- **Stage 1 (Diffusion)**: Uses diffusion engine
- Model: `GlmImagePipeline`
- Output: Generated image

See `vllm_omni/model_executor/stage_configs/glm_image.yaml` for full configuration.

## Comparison with Single-Stage

| Aspect | Single-Stage (transformers) | Multistage (vLLM) |
| ----------- | --------------------------- | ------------------- |
| AR Model | transformers native | vLLM PagedAttention |
| Memory | Higher (no KV cache opt) | Lower (optimized) |
| Throughput | Lower | Higher |
| Flexibility | Single GPU | Multi-GPU support |

## Troubleshooting

### OOM Error

Try reducing memory usage:

```bash
# In glm_image.yaml, adjust:
gpu_memory_utilization: 0.5 # Reduce from 0.6
```

### Slow Initialization

The first run loads model weights. Subsequent runs are faster:

```bash
--stage-init-timeout 900 # Increase timeout for slow storage
```

## Requirements

- vLLM-Omni with GLM-Image support
- CUDA-capable GPU (recommended: H100/A100 with 80GB)
- GLM-Image model weights

## Example materials

??? abstract "end2end.py"
``````py
--8<-- "examples/offline_inference/glm_image/end2end.py"
``````
??? abstract "run_i2i.sh"
``````sh
--8<-- "examples/offline_inference/glm_image/run_i2i.sh"
``````
??? abstract "run_t2i.sh"
``````sh
--8<-- "examples/offline_inference/glm_image/run_t2i.sh"
``````
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,11 @@ Key arguments:
- `--vae-use-tiling`: Enable VAE tiling for memory optimization.
- `--cfg-parallel-size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](https://github.com/vllm-project/vllm-omni/tree/main/docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
- `--enable-cpu-offload`: enable CPU offloading for diffusion models.
- `--use-hsdp`: Enable Hybrid Sharded Data Parallel to shard model weights across GPUs.
- `--hsdp-shard-size`: Number of GPUs to shard model weights across within each replica group. -1 (default) auto-calculates as world_size / replicate_size.
- `--hsdp-replicate-size`: Number of replica groups for HSDP. Each replica holds a full sharded copy. Default 1 means pure sharding (no replication).



> ℹ️ If you encounter OOM errors, try using `--vae-use-slicing` and `--vae-use-tiling` to reduce memory usage.

Expand Down
32 changes: 31 additions & 1 deletion docs/user_guide/examples/offline_inference/qwen3_tts.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,13 +90,43 @@ Examples:
python end2end.py --query-type Base --mode-tag icl
```

## Streaming Mode

Add `--streaming` to stream audio chunks progressively via `AsyncOmni` (requires `async_chunk: true` in the stage config):

```bash
python end2end.py --query-type CustomVoice --streaming --output-dir /tmp/out_stream
```

Each 25-frame Code2Wav chunk is logged as it arrives. The final WAV file is written once generation
completes. This demonstrates that audio data is available progressively rather than only at the end.

> **Note:** Streaming uses `AsyncOmni` internally. The non-streaming path (`Omni`) is unchanged.

## Batched Decoding

The Code2Wav stage (stage 1) supports batched decoding, where multiple requests are decoded in a single forward pass through the SpeechTokenizer. To use it, provide a stage config with `max_batch_size > 1` and pass multiple prompts via `--txt-prompts` with a matching `--batch-size`.

```
python end2end.py --query-type CustomVoice \
--txt-prompts benchmark_prompts.txt \
--batch-size 4 \
--stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml
```

**Important:** `--batch-size` must match a CUDA graph capture size (1, 2, 4, 8, 16...) because the Talker's code predictor KV cache is sized to `max_num_seqs`, and CUDA graphs pad the batch to the next capture size. Both stages need `max_batch_size >= batch_size` in the stage config for batching to take effect. If only stage 1 has a higher `max_batch_size`, it won't help — stage 1 can only batch chunks from requests that are in-flight simultaneously, which requires stage 0 to also process multiple requests concurrently.

## Notes

- The script uses the model paths embedded in `end2end.py`. Update them if your local cache path differs.
- Use `--output-dir` (preferred) or `--output-wav` to change the output folder.
- Use `--output-dir` to change the output folder.

## Example materials

??? abstract "benchmark_prompts.txt"
``````txt
--8<-- "examples/offline_inference/qwen3_tts/benchmark_prompts.txt"
``````
??? abstract "end2end.py"
``````py
--8<-- "examples/offline_inference/qwen3_tts/end2end.py"
Expand Down
Loading