Skip to content

[codex] Optimize Z-Image packed QKV#24117

Merged
BBuf merged 9 commits into
mainfrom
codex/zimage-packed-qkv
May 6, 2026
Merged

[codex] Optimize Z-Image packed QKV#24117
BBuf merged 9 commits into
mainfrom
codex/zimage-packed-qkv

Conversation

@BBuf
Copy link
Copy Markdown
Collaborator

@BBuf BBuf commented Apr 30, 2026

Summary

This PR enables the existing packed QKV projection path for Z-Image when loading the standard BF16 checkpoint, not only the Nunchaku quantized path.

  • Map checkpoint attention.to_q/to_k/to_v.weight tensors into attention.to_qkv.weight during load.
  • Always instantiate ZImageAttention.to_qkv so the non-quantized path also uses one merged QKV projection.
  • No benchmark images or media files are committed; the generated outputs are attached below as GitHub user-attachments.

Benchmark

Environment:

  • Host/container: radixark03, sglang-diffusion-bbuf
  • GPU: H200, CUDA_VISIBLE_DEVICES=7
  • Baseline: origin/main at 2d2be5d7b247626c3259fe33c145d0281db6ac4f
  • Tuned: this PR at 75d923c25
  • Native backend check: benchmark logs contain no Falling back to diffusers backend, Using diffusers backend, or Loaded diffusers pipeline lines.

Benchmark-only note: current origin/main hits the installed H200 FA3 ABI mismatch (flash_attn_varlen_func() got an unexpected keyword argument 'out'). I applied the same local FA3 compatibility shim to both benchmark worktrees only; that shim is intentionally not included in this PR.

Skill preset command used for both runs, with only the worktree and --label differing:

CUDA_VISIBLE_DEVICES=7 FLASHINFER_DISABLE_VERSION_CHECK=1 PYTHONPATH=python \
python3 /root/.codex/skills/sglang-diffusion-benchmark-profile/scripts/bench_diffusion_denoise.py \
  --model zimage \
  --label <zimage_qkv_pr_base_fa3shim_03|zimage_qkv_pr_tuned_fa3shim_03> \
  --output-dir outputs/diffusion_benchmarks

Expanded preset command:

sglang generate \
  --model-path=Tongyi-MAI/Z-Image-Turbo \
  --prompt="A futuristic cyberpunk city at night, neon lights reflecting on wet streets" \
  --log-level=info \
  --seed=42 \
  --width=1024 \
  --height=1024 \
  --num-inference-steps=9 \
  --guidance-scale=4.0 \
  --save-output \
  --warmup \
  --enable-torch-compile \
  --perf-dump-path <json>
Run Denoise latency E2E latency Peak memory Delta
Baseline 1.2818s 1.5609s 20.8GB -
Tuned 0.8288s 1.0739s 20.6GB -35.3% denoise / -31.2% E2E

compare_perf.py reports:

  • E2E: 1560.94 ms -> 1073.93 ms, -487.01 ms (-31.2%)
  • DenoisingStage: 1281.80 ms -> 828.80 ms, -453.00 ms (-35.3%)

Raw metric files on the H200 run:

  • Baseline: /tmp/sglang_zimage_pr_base/outputs/diffusion_benchmarks/zimage_zimage_qkv_pr_base_fa3shim_03.json
  • Tuned: /tmp/sglang_zimage_pr_tuned/outputs/diffusion_benchmarks/zimage_zimage_qkv_pr_tuned_fa3shim_03.json

Image Comparison

Baseline, generated by the same preset command:

zimage_base.png

Tuned, generated by the same preset command:

zimage_tuned.png

Fallback links:

Validation

  • git diff --check
  • python3 -m py_compile python/sglang/multimodal_gen/configs/models/dits/zimage.py python/sglang/multimodal_gen/runtime/models/dits/zimage.py
  • bench_diffusion_denoise.py --model zimage baseline + tuned on H200
  • native backend log gate via grep for diffusers fallback strings
  • compare_perf.py on the two benchmark JSON files

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@github-actions github-actions Bot added the diffusion SGLang Diffusion label Apr 30, 2026
@BBuf BBuf marked this pull request as ready for review April 30, 2026 03:29
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented Apr 30, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 2, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 3, 2026

/tag-and-rerun-ci

@BBuf
Copy link
Copy Markdown
Collaborator Author

BBuf commented May 4, 2026

/tag-and-rerun-ci

@BBuf BBuf merged commit a9a8b20 into main May 6, 2026
101 of 112 checks passed
@BBuf BBuf deleted the codex/zimage-packed-qkv branch May 6, 2026 23:51
ltcs11 added a commit to ltcs11/sglang that referenced this pull request May 7, 2026
* main: (894 commits)
  [Bug Fix] Fix RunAI streamer: corrupted weights, missing quant init, and broken URIs for multimodal models (sgl-project#22715)
  [Kernel] Deprecate DeepGemm in sgl kernel and apply custom wheel sgl-deep-gemm (sgl-project#24268)
  propagate pytest exit code from test __main__ entries (sgl-project#24487)
  [R3] Avoid implicit CUDA sync in routed experts DP slicing (sgl-project#24550)
  Add ChatCompletionRequest-style support to /v1/tokenize (sgl-project#23981)
  Support Triton MLA FP8 KV cache (sgl-project#20479)
  [diffusion] chore: align LTX-2 with official (sgl-project#24313)
  Expand support matrix for pypi wheel release (sgl-project#24565)
  [codex] Optimize Z-Image packed QKV (sgl-project#24117)
  [Misc] Fix breaking weight checker test (sgl-project#24553)
  [LoRA] Fix qkv_proj LoRA buffer sizing when tp_size > num_key_value_heads (sgl-project#24420)
  ci: bump test_mimo_models.py est_time 330 → 610 (sgl-project#24551)
  [CI] Temporarily disable marco/mcdse-2b-v1 in test_embedding_models (sgl-project#24279)
  Improve metrics, observability, and PD deploy tooling (sgl-project#24521)
  Fix diffusion fallback guards and validation (sgl-project#23335)
  [PD] Prevent update_status to Failed from cleared entries (sgl-project#24539)
  [CP] Register KV cache allgather buffer with symmetric memory (sgl-project#24040)
  Support getting checksums in weight checker (sgl-project#24537)
  Refactor buffer patterns in weight checker (sgl-project#24538)
  Add unit and end-to-end tests for weight checker (sgl-project#24536)
  ...

# Conflicts:
#	python/sglang/srt/managers/scheduler.py
#	python/sglang/srt/model_executor/model_runner.py
LLThomas pushed a commit to LLThomas/sglang that referenced this pull request May 8, 2026
LucQueen pushed a commit to LucQueen/sglang that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

diffusion SGLang Diffusion run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants