Skip to content
Merged
112 changes: 112 additions & 0 deletions recipes/GLM/GLM-Image.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# GLM-Image for text-to-image and image editing

## Summary

- Vendor: Z.ai
- Model: `GLM/GLM-Image`
- Task: Text-to-image (T2I) and image-to-image
- Mode: Online serving with the OpenAI-compatible API
- Maintainer: Community

## When to use this recipe

Use this recipe when you want a known-good starting point for serving
`GLM/GLM-Image` with vLLM-Omni on **two 80 GB NVIDIA A800** GPUs (Ampere-class,
same default layout as the upstream **2×A100 80GB** example: Stage 0 AR on GPU 0,
Stage 1 diffusion on GPU 1) and validate the deployment with the existing
`examples/online_serving/glm_image` clients.

## References

- Upstream or canonical docs:
[`docs/user_guide/examples/online_serving/glm_image.md`](../../docs/user_guide/examples/online_serving/glm_image.md)
- Related issue or discussion:
[#2888](https://github.com/vllm-project/vllm-omni/pull/2888)

## Hardware Support

This recipe documents **dual-GPU** CUDA layouts on A800 80 GB
for the same software stack. Add more platforms (for example ROCm / NPU) as
community validation lands.

## GPU

### 2× A800 80GB

#### Environment

These versions were taken from a working **editable** install: activate `vllm-omni/.venv` (or your equivalent), then align `pip` / Git with the rows below when reproducing this recipe.

- OS: Linux
- Python: 3.12
- Driver / runtime: NVIDIA CUDA stack with **two** A800 80 GB GPUs visible (set `CUDA_VISIBLE_DEVICES` on your host if needed)
- vLLM: **0.19.0**
- vLLM-Omni: **0.19.0rc2.dev138+g38d5f2d53** (editable install from this repo; Git **`38d5f2d5`**, `git describe` ≈ **`v0.19.0rc1-138-g38d5f2d5`**)
- Transformers: **5.5.4** (same `.venv` as above; required so `glm_image` configs load for Stage 0)

#### Command

Start the server from the repository root:

```bash
vllm serve zai-org/GLM-Image --omni --port 8091
```

To use the bundled stage config explicitly (same default as above):

```bash
vllm serve zai-org/GLM-Image \
--omni \
--port 8091 \
--deploy-config vllm_omni/deploy/glm_image.yaml
```

#### Verification

Run one of the existing example clients after the server is ready:

```bash
curl -s http://172.18.69.133:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "A beautiful landscape painting"}
],
"extra_body": {
"height": 1920,
"width": 1920,
"num_inference_steps": 50,
"true_cfg_scale": 1.5,
"seed": 42
}
}' | jq -r '.choices[0].message.content[0].image_url.url' | cut -d',' -f2- | base64 -d > land.png
```
After the command finishes, check for the output files:

```bash
ls output.png
```

#### Sample end-to-end metrics

One representative **offline** GLM-Image E2E run on this recipe’s **2× A800 80GB**.
Overall summary from the run’s metrics. Rough wall-time split: **Stage 0 (AR)** ~**25 s**,
**Stage 1 (diffusion)** ~**34 s** (see `e2e_stage_*_wall_time_ms` below).

| Field | Value |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is some problem with this metrics, cc @bjf-frz could you fix it asap?

| --- | ---: |
| e2e_requests | 1 |
| e2e_wall_time_ms | 61,148.679 |
| e2e_total_tokens | 1,300 |
| e2e_avg_time_per_request_ms | 61,148.679 |
| e2e_avg_tokens_per_s | 21.260 |
| e2e_stage_0_wall_time_ms | 24,708.760 |
| e2e_stage_1_wall_time_ms | 33,787.442 |

#### Notes

- Memory usage: Roughly **~38 GiB + KV** on Stage 0 (AR) and **~20 GiB** on Stage 1 (DiT+VAE) per the user guide; two 80 GB cards match the default split.
- Key flags: `--omni` is required; `--stage-configs-path` is optional unless you use a custom YAML (for example single-GPU).
- Keep **Transformers ≥ 5.5.1** (this recipe used **5.5.4**) so `glm_image` configs resolve; otherwise Stage 0 can fail at `ModelConfig` validation.
- Known limitations: This starter recipe follows the dual-GPU online path documented under `examples/online_serving/glm_image`. The first request may be slower due to warmup.
- Generation time: about **61 s** wall time end-to-end for the sample above (50 inference steps, 1024×1024).
3 changes: 3 additions & 0 deletions recipes/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,9 @@ recipes/
- [`Wan-AI/Wan2.2-I2V.md`](./Wan-AI/Wan2.2-I2V.md): image-to-video serving
recipe for Wan2.2 14B on `8x Ascend NPU (A2/A3)`

- [`GLM/GLM-Image.md`](./GLM/GLM-Image.md):online serving recipe for
image generation on `1x A800 80GB` and `2x A800 80GB`

Within a single recipe file, include different hardware support sections such
as `GPU`, `ROCm`, and `NPU`, and add concrete tested configurations like
`1x A100 80GB` or `2x L40S` inside those sections when applicable.
Expand Down
Loading