[AMD] Use aiter CK layernorm2d for LayerNorm to reduce NSA indexer kernel launches by 1am9trash · Pull Request #22424 · sgl-project/sglang

1am9trash · 2026-04-09T06:23:16Z

Motivation

The current LayerNorm on HIP uses the torch implementation and triggers an extra dtype cast at both entry and exit. This results in 3 kernels per LayerNorm call (cast -> layernorm -> cast), hurting the performance of operations like k_norm() in the GLM-5-FP8 NSA indexer.

Modifications

In LayerNorm.forward_hip(), use the aiter CK kernel layernorm2d_fwd() when the dtype is bf16 or fp16. For other dtypes, fall back to the original torch code path.
Change k_norm dtype in the NSA indexer from fp32 to bf16 when aiter is enabled, so it can take the CK kernel path.

Accuracy Tests

LayerNorm unit test:

Command: python -m pytest python/sglang/test/test_layernorm.py::TestLayerNorm -v
Result: all 384 subtests passed

Model test:

GLM-5-FP8 on MI355 GSM8k (TP8): 0.946

Speed Tests and Profiling

GLM-5-FP8 server command on MI355:

export SGLANG_ROCM_FUSED_DECODE_MLA=0
export ROCM_QUICK_REDUCE_QUANTIZATION=INT4
export SAFETENSORS_FAST_GPU=1
python3 -m sglang.launch_server \
  --model-path GLM-5-FP8 \
  --tp 8 --port 9000 --trust-remote-code \
  --tool-call-parser glm47 --reasoning-parser glm45 \
  --mem-fraction-static 0.85 \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}' \
  --nsa-prefill-backend tilelang --nsa-decode-backend tilelang --disable-radix-cache \
  --kv-cache-dtype fp8_e4m3

Benchmark on MI355X TP8, concurrency 4/8/16/32/64 averaged (baseline: sglang PR #22258 + aiter PR #2575):

ISL/OSL 1k/1k: Throughput +1.4%, TPOT -0.9%
ISL/OSL 8k/1k: Throughput +1.2%, TPOT -2.8%

Per-layer profiling:

Time: ~12us -> ~4us
Kernel: 3 -> 1

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist · 2026-04-09T06:23:21Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

1am9trash and others added 4 commits April 8, 2026 12:16

Use ck layernorm kernel instead of torch implementation

ac9c70e

Use bf16 for LayerNorm in the NSA indexer when aiter is enabled

59eeff5

Merge branch 'sgl-project:main' into use-aiter-ck-layernorm

8b1eafd

Merge branch 'sgl-project:main' into use-aiter-ck-layernorm

e43ee59

1am9trash requested review from BBuf, Edwardf0t1, Fridge003, HaiShaw, Ying1123, ch-wan, hlu1, hubertlu-tw, ispobock, kkHuang-amd and merrymercy as code owners April 9, 2026 06:23

HaiShaw approved these changes Apr 9, 2026

View reviewed changes

HaiShaw merged commit 628df31 into sgl-project:main Apr 9, 2026
54 of 62 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] Use aiter CK layernorm2d for LayerNorm to reduce NSA indexer kernel launches#22424

[AMD] Use aiter CK layernorm2d for LayerNorm to reduce NSA indexer kernel launches#22424
HaiShaw merged 4 commits intosgl-project:mainfrom
1am9trash:use-aiter-ck-layernorm

1am9trash commented Apr 9, 2026 •

edited by HaiShaw

Loading

Uh oh!

gemini-code-assist bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

1am9trash commented Apr 9, 2026 • edited by HaiShaw Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Speed Tests and Profiling

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

1am9trash commented Apr 9, 2026 •

edited by HaiShaw

Loading