[codex] Optimize Z-Image packed QKV by BBuf · Pull Request #24117 · sgl-project/sglang

BBuf · 2026-04-30T03:13:56Z

Summary

This PR enables the existing packed QKV projection path for Z-Image when loading the standard BF16 checkpoint, not only the Nunchaku quantized path.

Map checkpoint attention.to_q/to_k/to_v.weight tensors into attention.to_qkv.weight during load.
Always instantiate ZImageAttention.to_qkv so the non-quantized path also uses one merged QKV projection.
No benchmark images or media files are committed; the generated outputs are attached below as GitHub user-attachments.

Benchmark

Environment:

Host/container: radixark03, sglang-diffusion-bbuf
GPU: H200, CUDA_VISIBLE_DEVICES=7
Baseline: origin/main at 2d2be5d7b247626c3259fe33c145d0281db6ac4f
Tuned: this PR at 75d923c25
Native backend check: benchmark logs contain no Falling back to diffusers backend, Using diffusers backend, or Loaded diffusers pipeline lines.

Benchmark-only note: current origin/main hits the installed H200 FA3 ABI mismatch (flash_attn_varlen_func() got an unexpected keyword argument 'out'). I applied the same local FA3 compatibility shim to both benchmark worktrees only; that shim is intentionally not included in this PR.

Skill preset command used for both runs, with only the worktree and --label differing:

CUDA_VISIBLE_DEVICES=7 FLASHINFER_DISABLE_VERSION_CHECK=1 PYTHONPATH=python \
python3 /root/.codex/skills/sglang-diffusion-benchmark-profile/scripts/bench_diffusion_denoise.py \
  --model zimage \
  --label <zimage_qkv_pr_base_fa3shim_03|zimage_qkv_pr_tuned_fa3shim_03> \
  --output-dir outputs/diffusion_benchmarks

Expanded preset command:

sglang generate \
  --model-path=Tongyi-MAI/Z-Image-Turbo \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=9 \
  --guidance-scale=4.0 \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path <json>

Run	Denoise latency	E2E latency	Peak memory	Delta
Baseline	1.2818s	1.5609s	20.8GB	-
Tuned	0.8288s	1.0739s	20.6GB	-35.3% denoise / -31.2% E2E

compare_perf.py reports:

E2E: 1560.94 ms -> 1073.93 ms, -487.01 ms (-31.2%)
DenoisingStage: 1281.80 ms -> 828.80 ms, -453.00 ms (-35.3%)

Raw metric files on the H200 run:

Baseline: /tmp/sglang_zimage_pr_base/outputs/diffusion_benchmarks/zimage_zimage_qkv_pr_base_fa3shim_03.json
Tuned: /tmp/sglang_zimage_pr_tuned/outputs/diffusion_benchmarks/zimage_zimage_qkv_pr_tuned_fa3shim_03.json

Image Comparison

Baseline, generated by the same preset command:

Tuned, generated by the same preset command:

Fallback links:

Validation

git diff --check
python3 -m py_compile python/sglang/multimodal_gen/configs/models/dits/zimage.py python/sglang/multimodal_gen/runtime/models/dits/zimage.py
bench_diffusion_denoise.py --model zimage baseline + tuned on H200
native backend log gate via grep for diffusers fallback strings
compare_perf.py on the two benchmark JSON files

gemini-code-assist · 2026-04-30T03:13:59Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-04-30T03:29:49Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

BBuf · 2026-04-30T03:30:14Z

/tag-and-rerun-ci

BBuf · 2026-05-02T02:17:12Z

/tag-and-rerun-ci

BBuf · 2026-05-03T08:57:53Z

/tag-and-rerun-ci

BBuf · 2026-05-04T01:12:10Z

/tag-and-rerun-ci

* main: (894 commits) [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715) [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268) propagate pytest exit code from test __main__ entries (sgl-project#24487) [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550) Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981) Support Triton MLA FP8 KV cache (sgl-project#20479) [diffusion] chore: align LTX-2 with official (sgl-project#24313) Expand support matrix for pypi wheel release (sgl-project#24565) [codex] Optimize Z-Image packed QKV (sgl-project#24117) [Misc] Fix breaking weight checker test (sgl-project#24553) [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420) ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551) [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279) Improve metrics, observability, and PD deploy tooling (sgl-project#24521) Fix diffusion fallback guards and validation (sgl-project#23335) [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539) [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040) Support getting checksums in weight checker (sgl-project#24537) Refactor buffer patterns in weight checker (sgl-project#24538) Add unit and end-to-end tests for weight checker (sgl-project#24536) ... # Conflicts: # python/sglang/srt/managers/scheduler.py # python/sglang/srt/model_executor/model_runner.py

Optimize Z-Image packed QKV

75d923c

github-actions Bot added the diffusion SGLang Diffusion label Apr 30, 2026

BBuf marked this pull request as ready for review April 30, 2026 03:29

BBuf requested review from mickqian, ping1jing2, yhyang201 and yingluosanqian as code owners April 30, 2026 03:29

github-actions Bot added the run-ci label Apr 30, 2026

mickqian approved these changes Apr 30, 2026

View reviewed changes

Merge branch 'main' into codex/zimage-packed-qkv

47d17c6

BBuf and others added 2 commits May 2, 2026 21:09

Merge remote-tracking branch 'origin/main' into codex/zimage-packed-qkv

b122f5b

Merge branch 'main' into codex/zimage-packed-qkv

cf4cb4f

Merge remote-tracking branch 'origin/main' into update-pr-24117

547f918

BBuf and others added 4 commits May 5, 2026 16:28

Merge branch 'main' into codex/zimage-packed-qkv

52f869f

Merge branch 'main' into codex/zimage-packed-qkv

dee0b3b

Merge branch 'main' into codex/zimage-packed-qkv

29b6618

Fix Z-Image fused QKV scale and LoRA loading

70757c7

BBuf merged commit a9a8b20 into main May 6, 2026
101 of 112 checks passed

BBuf deleted the codex/zimage-packed-qkv branch May 6, 2026 23:51

LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026

[codex] Optimize Z-Image packed QKV (sgl-project#24117)

04bd7de

LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026

[codex] Optimize Z-Image packed QKV (sgl-project#24117)

10bbd9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Optimize Z-Image packed QKV#24117

[codex] Optimize Z-Image packed QKV#24117
BBuf merged 9 commits into
mainfrom
codex/zimage-packed-qkv

BBuf commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot commented Apr 30, 2026

Uh oh!

BBuf commented Apr 30, 2026

Uh oh!

BBuf commented May 2, 2026

Uh oh!

BBuf commented May 3, 2026

Uh oh!

BBuf commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

BBuf commented Apr 30, 2026

Summary

Benchmark

Image Comparison

Validation

Uh oh!

gemini-code-assist Bot commented Apr 30, 2026

Uh oh!

gemini-code-assist Bot commented Apr 30, 2026

Uh oh!

BBuf commented Apr 30, 2026

Uh oh!

BBuf commented May 2, 2026

Uh oh!

BBuf commented May 3, 2026

Uh oh!

BBuf commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants