From 877d61eee05647f32a98765d6ddcbf9d249ac198 Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Mon, 20 Apr 2026 13:27:00 +0800
Subject: [PATCH 1/9] docs(recipes): add GLM-Image recipe

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/GLM/GLM-Image.md | 154 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 154 insertions(+)
 create mode 100644 recipes/GLM/GLM-Image.md

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
new file mode 100644
index 00000000000..cd8963234b3
--- /dev/null
+++ b/recipes/GLM/GLM-Image.md
@@ -0,0 +1,154 @@
+# GLM-Image for text-to-image and image editing on 2× or 1× A800 80GB
+
+## Summary
+
+- Vendor: Z.ai 
+- Model: `zai-org/GLM-Image`
+- Task: Text-to-image (T2I) and image-to-image / editing (I2I)
+- Mode: Online serving with the OpenAI-compatible API
+- Maintainer: Community
+
+## When to use this recipe
+
+Use this recipe when you want a known-good starting point for serving
+`zai-org/GLM-Image` with vLLM-Omni on **two 80 GB NVIDIA A800** GPUs (Ampere-class,
+same default layout as the upstream **2×A100 80GB** example: Stage 0 AR on GPU 0,
+Stage 1 diffusion on GPU 1) and validate the deployment with the existing
+`examples/online_serving/glm_image` clients. For **one** A800 80 GB GPU, follow
+the **1× A800 80GB** section below (custom stage YAML required).
+
+## References
+
+- Upstream or canonical docs:
+  [`docs/user_guide/examples/online_serving/glm_image.md`](../../docs/user_guide/examples/online_serving/glm_image.md)
+- Related example under `examples/`:
+  [`examples/online_serving/glm_image/README.md`](../../examples/online_serving/glm_image/README.md)
+- Related issue or discussion:
+  [#2888](https://github.com/vllm-project/vllm-omni/pull/2888)
+
+## Hardware Support
+
+This recipe documents **dual-GPU** and **single-GPU** CUDA layouts on A800 80 GB
+for the same software stack. Add more platforms (for example ROCm / NPU) as
+community validation lands.
+
+## GPU
+
+### 2× A800 80GB
+
+#### Environment
+
+These versions were taken from a working **editable** install: activate `vllm-omni/.venv` (or your equivalent), then align `pip` / Git with the rows below when reproducing this recipe.
+
+- OS: Linux
+- Python: 3.12
+- Driver / runtime: NVIDIA CUDA stack with **two** A800 80 GB GPUs visible (set `CUDA_VISIBLE_DEVICES` on your host if needed)
+- vLLM: **0.19.0**
+- vLLM-Omni: **0.19.0rc2.dev138+g38d5f2d53** (editable install from this repo; Git **`38d5f2d5`**, `git describe` ≈ **`v0.19.0rc1-138-g38d5f2d5`**)
+- Transformers: **5.5.4** (same `.venv` as above; required so `glm_image` configs load for Stage 0)
+
+#### Command
+
+Start the server from the repository root:
+
+```bash
+vllm serve zai-org/GLM-Image --omni --port 8091
+```
+
+To use the bundled stage config explicitly (same default as above):
+
+```bash
+vllm serve zai-org/GLM-Image \
+  --omni \
+  --port 8091 \
+  --stage-configs-path vllm_omni/model_executor/stage_configs/glm_image.yaml
+```
+
+#### Verification
+
+Run one of the existing example clients after the server is ready:
+
+```bash
+python examples/online_serving/glm_image/openai_chat_client.py \
+  --prompt "A cute cat sitting on a window sill" \
+  --output glm_image_output.png \
+  --server http://localhost:8091
+```
+After the command finishes, check for the output files:
+
+```bash
+ls glm_image_output.png
+```
+
+#### Notes
+
+- Memory usage: Roughly **~18 GiB + KV** on Stage 0 (AR) and **~20 GiB** on Stage 1 (DiT+VAE) per the user guide; two 80 GB cards match the default split.
+- Key flags: `--omni` is required; `--stage-configs-path` is optional unless you use a custom YAML (for example single-GPU). 
+- Keep **Transformers ≥ 5.5.1** (this recipe used **5.5.4**) so `glm_image` configs resolve; otherwise Stage 0 can fail at `ModelConfig` validation.
+- Known limitations: This starter recipe follows the dual-GPU online path documented under `examples/online_serving/glm_image`. The first request may be slower due to warmup.
+- Generation time: ~62s end-to-end on 2× A800 80GB (50 inference steps, 1024×1024, chat completions client).
+
+### 1× A800 80GB
+
+Default `glm_image.yaml` pins Stage 0 to GPU **0** and Stage 1 to GPU **1**.
+On a single card, both stages must use the **same** device id.
+
+#### Environment
+
+Same software stack as **2× A800 80GB** (Python **3.12**, vLLM **0.19.0**,
+vLLM-Omni **0.19.0rc2.dev138+g38d5f2d53**, Transformers **5.5.4**), but only **one**
+A800 80 GB GPU visible (often `CUDA_VISIBLE_DEVICES=0`).
+
+#### Command
+
+1. Copy the stock stage file and point **Stage 1** at the same GPU as Stage 0:
+
+```bash
+cp vllm_omni/model_executor/stage_configs/glm_image.yaml \
+  vllm_omni/model_executor/stage_configs/glm_image_single_gpu.yaml
+```
+
+In `glm_image_single_gpu.yaml`, Stage 0 already has `runtime.devices: "0"`.
+Under **Stage 1** (`stage_id: 1`), change only the device line from `"1"` to `"0"`:
+
+```yaml
+  - stage_id: 1
+    stage_type: diffusion
+    runtime:
+      process: true
+      devices: "0"   # was "1" in the default dual-GPU file
+      requires_multimodal_data: true
+```
+
+2. Start the server with your file:
+
+```bash
+vllm serve zai-org/GLM-Image \
+  --omni \
+  --port 8091 \
+  --stage-configs-path vllm_omni/model_executor/stage_configs/glm_image_single_gpu.yaml
+```
+
+If you hit **OOM**, lower Stage 0 `engine_args.gpu_memory_utilization` in the same
+YAML (for example from `0.6` to `0.5` or `0.45`) and retry; see the
+[GLM-Image user guide FAQ](../../docs/user_guide/examples/online_serving/glm_image.md#faq).
+
+#### Verification
+
+Same commands as **2× A800 80GB** (Python client and `curl` smoke test); only
+the server startup line changes because of `--stage-configs-path`.
+
+#### Notes
+
+- Memory usage: AR and diffusion **time-share** one 80 GB GPU; peak usage is
+  higher than the dual-GPU split. The user-guide ballpark (~48 GiB + KV for AR,
+  ~22 GiB for DiT+VAE) ~72G for inference in total
+- Key flags: **`--stage-configs-path`** is **required** for single-GPU; the
+  default bundle still targets two GPUs.
+- Keep Transformers **≥ 5.5.1** (here
+  **5.5.4**) for `glm_image` support.
+- Known limitations: Stages no longer run on separate devices in parallel;
+  throughput differs from the 2× recipe. 
+- Generation time: ~62s end-to-end on 2× A800 80GB (50 inference steps, 1024×1024, chat completions client).
+
+

From 6c2e34e99f1f1e637b152dad27f8d26e993071b4 Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Mon, 20 Apr 2026 13:38:12 +0800
Subject: [PATCH 2/9] docs(recipes): link GLM-Image recipe from recipes README
 Signed-off-by: nainiu258 <cperfect02@163.com>

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/README.md | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/recipes/README.md b/recipes/README.md
index 01ecc41f185..539db67df2f 100644
--- a/recipes/README.md
+++ b/recipes/README.md
@@ -30,6 +30,9 @@ recipes/
 - [`Wan-AI/Wan2.2-I2V.md`](./Wan-AI/Wan2.2-I2V.md): image-to-video serving
   recipe for Wan2.2 14B on `8x Ascend NPU (A2/A3)`
 
+- [`GLM/GLM-Image.md`](./GLM/GLM-Image.md):online serving recipe for
+  image generation on `1x A800 80GB` and `2x A800 80GB`
+
 Within a single recipe file, include different hardware support sections such
 as `GPU`, `ROCm`, and `NPU`, and add concrete tested configurations like
 `1x A100 80GB` or `2x L40S` inside those sections when applicable.

From 91280ddb0df62df5b4f2ee9b819286ee6dd38a1c Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Mon, 20 Apr 2026 13:51:00 +0800
Subject: [PATCH 3/9] docs(recipes): update GLM-Image model id and tidy
 whitespace in recipe

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/GLM/GLM-Image.md | 12 +++++-------
 1 file changed, 5 insertions(+), 7 deletions(-)

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
index cd8963234b3..175529312ad 100644
--- a/recipes/GLM/GLM-Image.md
+++ b/recipes/GLM/GLM-Image.md
@@ -2,8 +2,8 @@
 
 ## Summary
 
-- Vendor: Z.ai 
-- Model: `zai-org/GLM-Image`
+- Vendor: Z.ai
+- Model: `GLM/GLM-Image`
 - Task: Text-to-image (T2I) and image-to-image / editing (I2I)
 - Mode: Online serving with the OpenAI-compatible API
 - Maintainer: Community
@@ -11,7 +11,7 @@
 ## When to use this recipe
 
 Use this recipe when you want a known-good starting point for serving
-`zai-org/GLM-Image` with vLLM-Omni on **two 80 GB NVIDIA A800** GPUs (Ampere-class,
+`GLM/GLM-Image` with vLLM-Omni on **two 80 GB NVIDIA A800** GPUs (Ampere-class,
 same default layout as the upstream **2×A100 80GB** example: Stage 0 AR on GPU 0,
 Stage 1 diffusion on GPU 1) and validate the deployment with the existing
 `examples/online_serving/glm_image` clients. For **one** A800 80 GB GPU, follow
@@ -83,7 +83,7 @@ ls glm_image_output.png
 #### Notes
 
 - Memory usage: Roughly **~18 GiB + KV** on Stage 0 (AR) and **~20 GiB** on Stage 1 (DiT+VAE) per the user guide; two 80 GB cards match the default split.
-- Key flags: `--omni` is required; `--stage-configs-path` is optional unless you use a custom YAML (for example single-GPU). 
+- Key flags: `--omni` is required; `--stage-configs-path` is optional unless you use a custom YAML (for example single-GPU).
 - Keep **Transformers ≥ 5.5.1** (this recipe used **5.5.4**) so `glm_image` configs resolve; otherwise Stage 0 can fail at `ModelConfig` validation.
 - Known limitations: This starter recipe follows the dual-GPU online path documented under `examples/online_serving/glm_image`. The first request may be slower due to warmup.
 - Generation time: ~62s end-to-end on 2× A800 80GB (50 inference steps, 1024×1024, chat completions client).
@@ -148,7 +148,5 @@ the server startup line changes because of `--stage-configs-path`.
 - Keep Transformers **≥ 5.5.1** (here
   **5.5.4**) for `glm_image` support.
 - Known limitations: Stages no longer run on separate devices in parallel;
-  throughput differs from the 2× recipe. 
+  throughput differs from the 2× recipe.
 - Generation time: ~62s end-to-end on 2× A800 80GB (50 inference steps, 1024×1024, chat completions client).
-
-

From 47bfea4038f57751b544f8018ec9e536eab9289d Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Mon, 20 Apr 2026 20:36:18 +0800
Subject: [PATCH 4/9] docs:add GLM-Image E2E metrics and stage timing

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/GLM/GLM-Image.md | 22 +++++++++++++++++++---
 1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
index 175529312ad..bb4900c24ff 100644
--- a/recipes/GLM/GLM-Image.md
+++ b/recipes/GLM/GLM-Image.md
@@ -80,13 +80,29 @@ After the command finishes, check for the output files:
 ls glm_image_output.png
 ```
 
+#### Sample end-to-end metrics
+
+One representative **offline** GLM-Image E2E run on this recipe’s **2× A800 80GB**.
+Overall summary from the run’s metrics. Rough wall-time split: **Stage 0 (AR)** ~**25 s**,
+**Stage 1 (diffusion)** ~**34 s** (see `e2e_stage_*_wall_time_ms` below).
+
+| Field | Value |
+| --- | ---: |
+| e2e_requests | 1 |
+| e2e_wall_time_ms | 61,148.679 |
+| e2e_total_tokens | 1,300 |
+| e2e_avg_time_per_request_ms | 61,148.679 |
+| e2e_avg_tokens_per_s | 21.260 |
+| e2e_stage_0_wall_time_ms | 24,708.760 |
+| e2e_stage_1_wall_time_ms | 33,787.442 |
+
 #### Notes
 
 - Memory usage: Roughly **~18 GiB + KV** on Stage 0 (AR) and **~20 GiB** on Stage 1 (DiT+VAE) per the user guide; two 80 GB cards match the default split.
 - Key flags: `--omni` is required; `--stage-configs-path` is optional unless you use a custom YAML (for example single-GPU).
 - Keep **Transformers ≥ 5.5.1** (this recipe used **5.5.4**) so `glm_image` configs resolve; otherwise Stage 0 can fail at `ModelConfig` validation.
 - Known limitations: This starter recipe follows the dual-GPU online path documented under `examples/online_serving/glm_image`. The first request may be slower due to warmup.
-- Generation time: ~62s end-to-end on 2× A800 80GB (50 inference steps, 1024×1024, chat completions client).
+- Generation time: about **61 s** wall time end-to-end for the sample above (50 inference steps, 1024×1024).
 
 ### 1× A800 80GB
 
@@ -135,7 +151,7 @@ YAML (for example from `0.6` to `0.5` or `0.45`) and retry; see the
 
 #### Verification
 
-Same commands as **2× A800 80GB** (Python client and `curl` smoke test); only
+Same commands as **2× A800 80GB** ; only
 the server startup line changes because of `--stage-configs-path`.
 
 #### Notes
@@ -149,4 +165,4 @@ the server startup line changes because of `--stage-configs-path`.
   **5.5.4**) for `glm_image` support.
 - Known limitations: Stages no longer run on separate devices in parallel;
   throughput differs from the 2× recipe.
-- Generation time: ~62s end-to-end on 2× A800 80GB (50 inference steps, 1024×1024, chat completions client).
+- Generation time: ~62s end-to-end on 1× A800 80GB (50 inference steps, 1024×1024, chat completions client).

From 40fc45afc6fbb9ead2c60307f9eb3ee22285124e Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Tue, 21 Apr 2026 16:52:47 +0800
Subject: [PATCH 5/9] Replace the position of yaml

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/GLM/GLM-Image.md | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
index bb4900c24ff..50bb5b8c22f 100644
--- a/recipes/GLM/GLM-Image.md
+++ b/recipes/GLM/GLM-Image.md
@@ -61,7 +61,7 @@ To use the bundled stage config explicitly (same default as above):
 vllm serve zai-org/GLM-Image \
   --omni \
   --port 8091 \
-  --stage-configs-path vllm_omni/model_executor/stage_configs/glm_image.yaml
+  --stage-configs-path vllm_omni/deploy/glm_image.yaml
 ```
 
 #### Verification
@@ -120,8 +120,8 @@ A800 80 GB GPU visible (often `CUDA_VISIBLE_DEVICES=0`).
 1. Copy the stock stage file and point **Stage 1** at the same GPU as Stage 0:
 
 ```bash
-cp vllm_omni/model_executor/stage_configs/glm_image.yaml \
-  vllm_omni/model_executor/stage_configs/glm_image_single_gpu.yaml
+cp vllm_omni/deploy/glm_image.yaml \
+  vllm_omni/deploy/glm_image_single_gpu.yaml
 ```
 
 In `glm_image_single_gpu.yaml`, Stage 0 already has `runtime.devices: "0"`.

From aa295bed3cc77cd63b4742a34ef722a0a342cd5c Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Wed, 22 Apr 2026 09:36:22 +0800
Subject: [PATCH 6/9] fix: update yaml position

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/GLM/GLM-Image.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
index 50bb5b8c22f..32dbb8875a7 100644
--- a/recipes/GLM/GLM-Image.md
+++ b/recipes/GLM/GLM-Image.md
@@ -142,7 +142,7 @@ Under **Stage 1** (`stage_id: 1`), change only the device line from `"1"` to `"0
 vllm serve zai-org/GLM-Image \
   --omni \
   --port 8091 \
-  --stage-configs-path vllm_omni/model_executor/stage_configs/glm_image_single_gpu.yaml
+  --stage-configs-path vllm_omni/deploy/glm_image_single_gpu.yaml
 ```
 
 If you hit **OOM**, lower Stage 0 `engine_args.gpu_memory_utilization` in the same

From 47409d1a99569a6845ecb5dc055f3e02b504eac1 Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Wed, 22 Apr 2026 15:32:29 +0800
Subject: [PATCH 7/9] fixed instruction of start server

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/GLM/GLM-Image.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
index 32dbb8875a7..962152b144a 100644
--- a/recipes/GLM/GLM-Image.md
+++ b/recipes/GLM/GLM-Image.md
@@ -61,7 +61,7 @@ To use the bundled stage config explicitly (same default as above):
 vllm serve zai-org/GLM-Image \
   --omni \
   --port 8091 \
-  --stage-configs-path vllm_omni/deploy/glm_image.yaml
+  --deploy-config vllm_omni/deploy/glm_image.yaml
 ```
 
 #### Verification
@@ -142,7 +142,7 @@ Under **Stage 1** (`stage_id: 1`), change only the device line from `"1"` to `"0
 vllm serve zai-org/GLM-Image \
   --omni \
   --port 8091 \
-  --stage-configs-path vllm_omni/deploy/glm_image_single_gpu.yaml
+  --deploy-config vllm_omni/deploy/glm_image_single_gpu.yaml
 ```
 
 If you hit **OOM**, lower Stage 0 `engine_args.gpu_memory_utilization` in the same

From 21abb4bdcfc5fa1b43a49f748830492b749bacdd Mon Sep 17 00:00:00 2001
From: nainiu258 <cperfect02@163.com>
Date: Tue, 28 Apr 2026 18:38:37 +0800
Subject: [PATCH 8/9] docs: update GLM-Image recipe

Signed-off-by: nainiu258 <cperfect02@163.com>
---
 recipes/GLM/GLM-Image.md | 96 +++++++++-------------------------------
 1 file changed, 20 insertions(+), 76 deletions(-)

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
index 962152b144a..525f395619a 100644
--- a/recipes/GLM/GLM-Image.md
+++ b/recipes/GLM/GLM-Image.md
@@ -1,10 +1,10 @@
-# GLM-Image for text-to-image and image editing on 2× or 1× A800 80GB
+# GLM-Image for text-to-image and image editing on 2× A800 80GB
 
 ## Summary
 
 - Vendor: Z.ai
 - Model: `GLM/GLM-Image`
-- Task: Text-to-image (T2I) and image-to-image / editing (I2I)
+- Task: Text-to-image (T2I) and image-to-image
 - Mode: Online serving with the OpenAI-compatible API
 - Maintainer: Community
 
@@ -14,21 +14,18 @@ Use this recipe when you want a known-good starting point for serving
 `GLM/GLM-Image` with vLLM-Omni on **two 80 GB NVIDIA A800** GPUs (Ampere-class,
 same default layout as the upstream **2×A100 80GB** example: Stage 0 AR on GPU 0,
 Stage 1 diffusion on GPU 1) and validate the deployment with the existing
-`examples/online_serving/glm_image` clients. For **one** A800 80 GB GPU, follow
-the **1× A800 80GB** section below (custom stage YAML required).
+`examples/online_serving/glm_image` clients.
 
 ## References
 
 - Upstream or canonical docs:
   [`docs/user_guide/examples/online_serving/glm_image.md`](../../docs/user_guide/examples/online_serving/glm_image.md)
-- Related example under `examples/`:
-  [`examples/online_serving/glm_image/README.md`](../../examples/online_serving/glm_image/README.md)
 - Related issue or discussion:
   [#2888](https://github.com/vllm-project/vllm-omni/pull/2888)
 
 ## Hardware Support
 
-This recipe documents **dual-GPU** and **single-GPU** CUDA layouts on A800 80 GB
+This recipe documents **dual-GPU** CUDA layouts on A800 80 GB
 for the same software stack. Add more platforms (for example ROCm / NPU) as
 community validation lands.
 
@@ -69,15 +66,25 @@ vllm serve zai-org/GLM-Image \
 Run one of the existing example clients after the server is ready:
 
 ```bash
-python examples/online_serving/glm_image/openai_chat_client.py \
-  --prompt "A cute cat sitting on a window sill" \
-  --output glm_image_output.png \
-  --server http://localhost:8091
+curl -s http://172.18.69.133:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [
+      {"role": "user", "content": "A beautiful landscape painting"}
+    ],
+    "extra_body": {
+      "height": 1920,
+      "width": 1920,
+      "num_inference_steps": 50,
+      "true_cfg_scale": 1.5,
+      "seed": 42
+    }
+  }' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > land.png
 ```
 After the command finishes, check for the output files:
 
 ```bash
-ls glm_image_output.png
+ls output.png
 ```
 
 #### Sample end-to-end metrics
@@ -98,71 +105,8 @@ Overall summary from the run’s metrics. Rough wall-time split: **Stage 0 (AR)*
 
 #### Notes
 
-- Memory usage: Roughly **~18 GiB + KV** on Stage 0 (AR) and **~20 GiB** on Stage 1 (DiT+VAE) per the user guide; two 80 GB cards match the default split.
+- Memory usage: Roughly **~38 GiB + KV** on Stage 0 (AR) and **~20 GiB** on Stage 1 (DiT+VAE) per the user guide; two 80 GB cards match the default split.
 - Key flags: `--omni` is required; `--stage-configs-path` is optional unless you use a custom YAML (for example single-GPU).
 - Keep **Transformers ≥ 5.5.1** (this recipe used **5.5.4**) so `glm_image` configs resolve; otherwise Stage 0 can fail at `ModelConfig` validation.
 - Known limitations: This starter recipe follows the dual-GPU online path documented under `examples/online_serving/glm_image`. The first request may be slower due to warmup.
 - Generation time: about **61 s** wall time end-to-end for the sample above (50 inference steps, 1024×1024).
-
-### 1× A800 80GB
-
-Default `glm_image.yaml` pins Stage 0 to GPU **0** and Stage 1 to GPU **1**.
-On a single card, both stages must use the **same** device id.
-
-#### Environment
-
-Same software stack as **2× A800 80GB** (Python **3.12**, vLLM **0.19.0**,
-vLLM-Omni **0.19.0rc2.dev138+g38d5f2d53**, Transformers **5.5.4**), but only **one**
-A800 80 GB GPU visible (often `CUDA_VISIBLE_DEVICES=0`).
-
-#### Command
-
-1. Copy the stock stage file and point **Stage 1** at the same GPU as Stage 0:
-
-```bash
-cp vllm_omni/deploy/glm_image.yaml \
-  vllm_omni/deploy/glm_image_single_gpu.yaml
-```
-
-In `glm_image_single_gpu.yaml`, Stage 0 already has `runtime.devices: "0"`.
-Under **Stage 1** (`stage_id: 1`), change only the device line from `"1"` to `"0"`:
-
-```yaml
-  - stage_id: 1
-    stage_type: diffusion
-    runtime:
-      process: true
-      devices: "0"   # was "1" in the default dual-GPU file
-      requires_multimodal_data: true
-```
-
-2. Start the server with your file:
-
-```bash
-vllm serve zai-org/GLM-Image \
-  --omni \
-  --port 8091 \
-  --deploy-config vllm_omni/deploy/glm_image_single_gpu.yaml
-```
-
-If you hit **OOM**, lower Stage 0 `engine_args.gpu_memory_utilization` in the same
-YAML (for example from `0.6` to `0.5` or `0.45`) and retry; see the
-[GLM-Image user guide FAQ](../../docs/user_guide/examples/online_serving/glm_image.md#faq).
-
-#### Verification
-
-Same commands as **2× A800 80GB** ; only
-the server startup line changes because of `--stage-configs-path`.
-
-#### Notes
-
-- Memory usage: AR and diffusion **time-share** one 80 GB GPU; peak usage is
-  higher than the dual-GPU split. The user-guide ballpark (~48 GiB + KV for AR,
-  ~22 GiB for DiT+VAE) ~72G for inference in total
-- Key flags: **`--stage-configs-path`** is **required** for single-GPU; the
-  default bundle still targets two GPUs.
-- Keep Transformers **≥ 5.5.1** (here
-  **5.5.4**) for `glm_image` support.
-- Known limitations: Stages no longer run on separate devices in parallel;
-  throughput differs from the 2× recipe.
-- Generation time: ~62s end-to-end on 1× A800 80GB (50 inference steps, 1024×1024, chat completions client).

From 2e33b11faf9013a5347da029cc8fada260cb4bd6 Mon Sep 17 00:00:00 2001
From: nainiu258 <101917677+nainiu258@users.noreply.github.com>
Date: Thu, 28 May 2026 19:10:27 +0800
Subject: [PATCH 9/9] Apply suggestion from @hsliuustc0106

Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
Signed-off-by: nainiu258 <101917677+nainiu258@users.noreply.github.com>
---
 recipes/GLM/GLM-Image.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/recipes/GLM/GLM-Image.md b/recipes/GLM/GLM-Image.md
index 525f395619a..bd11558c92d 100644
--- a/recipes/GLM/GLM-Image.md
+++ b/recipes/GLM/GLM-Image.md
@@ -1,4 +1,4 @@
-# GLM-Image for text-to-image and image editing on 2× A800 80GB
+# GLM-Image for text-to-image and image editing
 
 ## Summary