vllm-project · hsliuustc0106 · Feb 27, 2026 · Feb 27, 2026 · Feb 27, 2026 · Feb 27, 2026
@@ -12,10 +12,10 @@ PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTT
 <summary> Essential Elements of an Effective PR Description Checklist </summary>
 
 - [ ] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
-- [ ] The test plan. Please providing the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the [test style doc](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
-- [ ] The test results. Please pasting the results comparison before and after, or e2e results.
+- [ ] The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the [test style doc](https://docs.vllm.ai/projects/vllm-omni/en/latest/contributing/ci/tests_style/)
+- [ ] The test results. Please paste the results comparison before and after, or the e2e results.
 - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. **Please run `mkdocs serve` to sync the documentation editions to `./docs`.**
-- [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft.
+- [ ] (Optional) Release notes update. If your change is user-facing, please update the release notes draft.
 </details>
 
 **BEFORE SUBMITTING, PLEASE READ <https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md>** (anything written below this line will be removed by GitHub Actions)
@@ -13,6 +13,7 @@ nav:
     - examples/README.md
     - Offline Inference:
       - BAGEL-7B-MoT: user_guide/examples/offline_inference/bagel.md
+      - GLM-Image Multistage End-to-End Inference: user_guide/examples/offline_inference/glm_image.md
       - Image-To-Image: user_guide/examples/offline_inference/image_to_image.md
       - Image-To-Video: user_guide/examples/offline_inference/image_to_video.md
       - Qwen2.5-Omni: user_guide/examples/offline_inference/qwen2_5_omni.md
@@ -23,6 +24,7 @@ nav:
       - Text-To-Video: user_guide/examples/offline_inference/text_to_video.md
     - Online Serving:
       - BAGEL-7B-MoT: user_guide/examples/online_serving/bagel.md
+      - GLM-Image Online Serving: user_guide/examples/online_serving/glm_image.md
       - Image-To-Image: user_guide/examples/online_serving/image_to_image.md
       - Image-To-Video: user_guide/examples/online_serving/image_to_video.md
       - Qwen2.5-Omni: user_guide/examples/online_serving/qwen2_5_omni.md
@@ -50,10 +52,13 @@ nav:
       - Parallelism Acceleration: user_guide/diffusion/parallelism_acceleration.md
       - CPU Offloading: user_guide/diffusion/cpu_offload_diffusion.md
       - LoRA: user_guide/diffusion/lora.md
+      - Hybrid Sharded Data Parallel: design/feature/hsdp.md
+      - Custom Pipeline: features/custom_pipeline.md
     - ComfyUI: features/comfyui.md
 - Developer Guide:
   - General:
     - contributing/README.md
+    - pr_reviewer.md
     - glob: contributing/*
       flatten_single_child_sections: true
   - Model Implementation:
@@ -73,6 +78,7 @@ nav:
       - design/feature/tensor_parallel.md
       - design/feature/cache_dit.md
       - design/feature/teacache.md
+      - design/feature/async_chunk_design.md
     - Module Design:
       - design/module/ar_module.md
       - design/module/dit_module.md

@@ -154,6 +154,24 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can
 
 ------
 
+#### Tensor Parallelism (TP)
+
+For larger models or multi-GPU environments, you can enable Tensor Parallelism (TP) by modifying the stage configuration (e.g., [`bagel.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel.yaml)).
+
+1. **Set `tensor_parallel_size`**: Increase this value (e.g., to `2` or `4`).
+2. **Set `devices`**: Specify the comma-separated GPU IDs to be used for the stage (e.g., `"0,1"`).
+
+Example configuration for TP=2 on GPUs 0 and 1:
+```yaml
+    engine_args:
+      tensor_parallel_size: 2
+      ...
+    runtime:
+      devices: "0,1"
+```
+
+------
+
 #### 🔗 Runtime Configuration
 
 | Parameter             | Value   | Description                      |
@@ -162,6 +180,74 @@ The default yaml configuration deploys Thinker and DiT on the same GPU. You can
 | `max_inflight`        | `1`     | Maximum inflight requests        |
 | `shm_threshold_bytes` | `65536` | Shared memory threshold (64KB)   |
 
+## Using Mooncake Connector
+
+[Mooncake](https://github.com/kvcache-ai/Mooncake) is a high-performance distributed KV cache transfer engine that enables efficient cross-node data movement via TCP or RDMA, making it ideal for multi-node disaggregated inference.
+
+By default, BAGEL uses `SharedMemoryConnector` for inter-stage communication. You can switch to the Mooncake connector for better performance on multi-GPU setups and to enable multi-node deployment.
+
+### Prerequisites
+
+Install the Mooncake transfer engine:
+
+```bash
+# For CUDA-enabled systems (recommended)
+pip install mooncake-transfer-engine
+
+# For non-CUDA systems
+pip install mooncake-transfer-engine-non-cuda
+```
+
+### Step 1: Start the Mooncake Master
+
+On the **primary node**, start the Mooncake master service (run in a separate terminal or background with `&`):
+
+```bash
+# Optional: enable disk-backed storage by creating a directory and passing --root_fs_dir.
+# Without it, Mooncake runs in memory-only mode, which is sufficient for KV cache transfer.
+mkdir -p ./mc_storage
+
+mooncake_master \
+  --rpc_port=50051 \
+  --enable_http_metadata_server=true \
+  --http_metadata_server_host=0.0.0.0 \
+  --http_metadata_server_port=8080 \
+  --metrics_port=9003 \
+  --root_fs_dir=./mc_storage/ \
+  --cluster_id=mc-local-1 &
+```
+
+### Step 2: Run Offline Inference with Mooncake
+
+Use the provided Mooncake stage config [`bagel_multiconnector.yaml`](https://github.com/vllm-project/vllm-omni/tree/main/vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml). Before launching, update the `metadata_server` and `master` addresses in the YAML to match your Mooncake master node's IP (use `127.0.0.1` for single-node testing).
+
+```bash
+cd examples/offline_inference/bagel
+
+# Text to Image with Mooncake
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2img \
+                  --prompts "A cute cat" \
+                  --stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
+
+# Image to Text with Mooncake
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality img2text \
+                  --image-path /path/to/image.jpg \
+                  --prompts "Describe this image" \
+                  --stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
+
+# Text to Text with Mooncake
+python end2end.py --model ByteDance-Seed/BAGEL-7B-MoT \
+                  --modality text2text \
+                  --prompts "What is the capital of France?" \
+                  --stage-configs-path ../../../vllm_omni/model_executor/stage_configs/bagel_multiconnector.yaml
+```
+
+For more details on the Mooncake connector and multi-node setup, see the [Mooncake Store Connector documentation](https://github.com/vllm-project/vllm-omni/tree/main/docs/design/feature/omni_connectors/mooncake_store_connector.md).
+
+------
+
 ## FAQ
 
 - If you encounter an error about the backend of librosa, try to install ffmpeg with the command below.

@@ -0,0 +1,156 @@
+# GLM-Image Multistage End-to-End Inference
+
+Source <https://github.com/vllm-project/vllm-omni/tree/main/examples/offline_inference/glm_image>.
+
+
+This example demonstrates how to run GLM-Image with the vLLM-Omni multistage architecture.
+
+## Architecture
+
+GLM-Image uses a 2-stage pipeline:
+
+```
+┌─────────────────────────────────────────────────────────────┐
+│                     GLM-Image Pipeline                       │
+├─────────────────────────────────────────────────────────────┤
+│                                                             │
+│  Stage 0 (AR Model)              Stage 1 (Diffusion)        │
+│  ┌─────────────────┐            ┌─────────────────────┐     │
+│  │ vLLM-optimized  │            │  GlmImagePipeline   │     │
+│  │ GlmImageFor     │  prior     │  ┌───────────────┐  │     │
+│  │ Conditional     │──tokens───►│  │ DiT Denoiser  │  │     │
+│  │ Generation      │            │  └───────────────┘  │     │
+│  │ (9B AR model)   │            │         │          │     │
+│  └─────────────────┘            │         ▼          │     │
+│         ▲                       │  ┌───────────────┐  │     │
+│         │                       │  │  VAE Decode   │──┼──► Image
+│    Text/Image                   │  └───────────────┘  │     │
+│      Input                      └─────────────────────┘     │
+│                                                             │
+└─────────────────────────────────────────────────────────────┘
+```
+
+## Features
+
+- **vLLM-optimized AR**: Uses PagedAttention and tensor parallelism for faster prior token generation
+- **Flexible deployment**: AR and Diffusion stages can run on different GPUs
+- **Text-to-Image**: Generate images from text descriptions
+- **Image-to-Image**: Edit existing images with text prompts
+
+## Usage
+
+### Text-to-Image
+
+```bash
+python end2end.py \
+    --model-path /path/to/glm-image \
+    --config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
+    --prompt "A beautiful sunset over the ocean with sailing boats" \
+    --height 1024 \
+    --width 1024 \
+    --output output_t2i.png
+```
+
+### Image-to-Image (Image Editing)
+
+```bash
+python end2end.py \
+    --model-path /path/to/glm-image \
+    --config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
+    --prompt "Transform this scene into a winter wonderland" \
+    --image input.png \
+    --output output_i2i.png
+```
+
+### With Custom Parameters
+
+```bash
+python end2end.py \
+    --model-path /path/to/glm-image \
+    --config-path ../../vllm_omni/model_executor/stage_configs/glm_image.yaml \
+    --prompt "A photorealistic cat sitting on a window sill" \
+    --height 1024 \
+    --width 1024 \
+    --num-inference-steps 50 \
+    --guidance-scale 1.5 \
+    --seed 42 \
+    --output output.png
+```
+
+## Shell Scripts
+
+### Run Text-to-Image
+
+```bash
+./run_t2i.sh
+```
+
+### Run Image-to-Image
+
+```bash
+./run_i2i.sh --image /path/to/input.png
+```
+
+## Stage Configuration
+
+The stage config (`glm_image.yaml`) defines:
+
+- **Stage 0 (AR)**: Uses `GPUARWorker` with vLLM engine
+
+  - Model: `GlmImageForConditionalGeneration`
+  - Output: `token_ids` (prior tokens)
+
+- **Stage 1 (Diffusion)**: Uses diffusion engine
+  - Model: `GlmImagePipeline`
+  - Output: Generated image
+
+See `vllm_omni/model_executor/stage_configs/glm_image.yaml` for full configuration.
+
+## Comparison with Single-Stage
+
+| Aspect      | Single-Stage (transformers) | Multistage (vLLM)   |
+| ----------- | --------------------------- | ------------------- |
+| AR Model    | transformers native         | vLLM PagedAttention |
+| Memory      | Higher (no KV cache opt)    | Lower (optimized)   |
+| Throughput  | Lower                       | Higher              |
+| Flexibility | Single GPU                  | Multi-GPU support   |
+
+## Troubleshooting
+
+### OOM Error
+
+Try reducing memory usage:
+
+```bash
+# In glm_image.yaml, adjust:
+gpu_memory_utilization: 0.5  # Reduce from 0.6
+```
+
+### Slow Initialization
+
+The first run loads model weights. Subsequent runs are faster:
+
+```bash
+--stage-init-timeout 900  # Increase timeout for slow storage
+```
+
+## Requirements
+
+- vLLM-Omni with GLM-Image support
+- CUDA-capable GPU (recommended: H100/A100 with 80GB)
+- GLM-Image model weights
+
+## Example materials
+
+??? abstract "end2end.py"
+    ``````py
+    --8<-- "examples/offline_inference/glm_image/end2end.py"
+    ``````
+??? abstract "run_i2i.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/glm_image/run_i2i.sh"
+    ``````
+??? abstract "run_t2i.sh"
+    ``````sh
+    --8<-- "examples/offline_inference/glm_image/run_t2i.sh"
+    ``````
@@ -69,6 +69,11 @@ Key arguments:
 - `--vae-use-tiling`: Enable VAE tiling for memory optimization.
 - `--cfg-parallel-size`: set it to 2 to enable CFG Parallel. See more examples in [`user_guide`](https://github.com/vllm-project/vllm-omni/tree/main/docs/user_guide/diffusion/parallelism_acceleration.md#cfg-parallel).
 - `--enable-cpu-offload`: enable CPU offloading for diffusion models.
+- `--use-hsdp`: Enable Hybrid Sharded Data Parallel to shard model weights across GPUs.
+- `--hsdp-shard-size`: Number of GPUs to shard model weights across within each replica group. -1 (default) auto-calculates as world_size / replicate_size.
+- `--hsdp-replicate-size`: Number of replica groups for HSDP. Each replica holds a full sharded copy. Default 1 means pure sharding (no replication).
+
+
 
 > ℹ️ If you encounter OOM errors, try using `--vae-use-slicing` and `--vae-use-tiling` to reduce memory usage.
 

@@ -90,13 +90,43 @@ Examples:
 python end2end.py --query-type Base --mode-tag icl
 ```
 
+## Streaming Mode
+
+Add `--streaming` to stream audio chunks progressively via `AsyncOmni` (requires `async_chunk: true` in the stage config):
+
+```bash
+python end2end.py --query-type CustomVoice --streaming --output-dir /tmp/out_stream
+```
+
+Each 25-frame Code2Wav chunk is logged as it arrives. The final WAV file is written once generation
+completes. This demonstrates that audio data is available progressively rather than only at the end.
+
+> **Note:** Streaming uses `AsyncOmni` internally. The non-streaming path (`Omni`) is unchanged.
+
+## Batched Decoding
+
+The Code2Wav stage (stage 1) supports batched decoding, where multiple requests are decoded in a single forward pass through the SpeechTokenizer. To use it, provide a stage config with `max_batch_size > 1` and pass multiple prompts via `--txt-prompts` with a matching `--batch-size`.
+
+```
+python end2end.py --query-type CustomVoice \
+    --txt-prompts benchmark_prompts.txt \
+    --batch-size 4 \
+    --stage-configs-path vllm_omni/model_executor/stage_configs/qwen3_tts_batch.yaml
+```
+
+**Important:** `--batch-size` must match a CUDA graph capture size (1, 2, 4, 8, 16...) because the Talker's code predictor KV cache is sized to `max_num_seqs`, and CUDA graphs pad the batch to the next capture size. Both stages need `max_batch_size >= batch_size` in the stage config for batching to take effect. If only stage 1 has a higher `max_batch_size`, it won't help — stage 1 can only batch chunks from requests that are in-flight simultaneously, which requires stage 0 to also process multiple requests concurrently.
+
 ## Notes
 
 - The script uses the model paths embedded in `end2end.py`. Update them if your local cache path differs.
-- Use `--output-dir` (preferred) or `--output-wav` to change the output folder.
+- Use `--output-dir` to change the output folder.
 
 ## Example materials
 
+??? abstract "benchmark_prompts.txt"
+    ``````txt
+    --8<-- "examples/offline_inference/qwen3_tts/benchmark_prompts.txt"
+    ``````
 ??? abstract "end2end.py"
     ``````py
     --8<-- "examples/offline_inference/qwen3_tts/end2end.py"