Enable trtllm_mha as gemma4 default attn backend. by wenscarl · Pull Request #25006 · sgl-project/sglang

wenscarl · 2026-05-11T19:45:23Z

Summary

Enable trtllm_mha as the default attention backend for Gemma4 on SM100.

When --attention-backend is not specified for Gemma4ForConditionalGeneration, SGLang now selects:

trtllm_mha on SM100
triton otherwise

This keeps the existing non-SM100 behavior unchanged while enabling the Blackwell-optimized MHA backend for Gemma4 by default.

Benchmark

Same server flags otherwise, comparing triton vs trtllm_mha.

Server

sglang serve --model-path google/gemma-4-31B-it \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4 \
    --mem-fraction-static 0.9 \
    --host 0.0.0.0 --port 30000 --tp-size 4

Benchmark Commands

Latency, text:

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 10 --max-concurrency 1

Latency, image:

python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 10 --max-concurrency 1

Throughput, text:

python3 -m sglang.bench_serving --backend sglang \
  --host 0.0.0.0 --port 30000 \
  --dataset-name random --num-prompts 1000 --max-concurrency 100

Throughput, image:

python3 -m sglang.bench_serving --backend sglang-oai-chat \
  --host 0.0.0.0 --port 30000 \
  --dataset-name image --image-count 2 --image-resolution 720p \
  --random-input-len 128 --random-output-len 1024 \
  --num-prompts 1000 --max-concurrency 100

Note: the throughput-image benchmark had 999/1000 successful requests with trtllm_mha; one request was silently dropped with no client-side error logged, possibly due to a request abort.

Latency

concurrency=1, 10 prompts

Metric	triton	trtllm_mha	Delta
Text Duration (s)	37.29	32.75	-12.2%
Text Output tok/s	113.2	128.9	+13.9%
Text Mean TTFT (ms)	72.05	67.04	-7.0%
Text Mean TPOT (ms)	8.55	7.60	-11.1%
Text Median ITL (ms)	8.84	7.63	-13.7%
Image Duration (s)	38.45	33.76	-12.2%
Image Output tok/s	109.8	125.0	+13.9%
Image Mean TTFT (ms)	182.55	179.90	-1.5%
Image Mean TPOT (ms)	8.62	7.57	-12.2%

Throughput

concurrency=100, 1000 prompts

Metric	triton	trtllm_mha	Delta
Text Duration (s)	144.45	117.92	-18.4%
Text Req/s	6.92	8.48	+22.5%
Text Output tok/s	3536.6	4332.3	+22.5%
Text Total tok/s	7086.9	8681.5	+22.5%
Text Mean E2E (ms)	13794	11221	-18.7%
Text Mean TPOT (ms)	27.02	21.88	-19.0%
Text P99 TPOT (ms)	37.47	29.89	-20.2%
Image Successful	1000	999	-1 req
Image Duration (s)	249.45	228.88	-8.2%
Image Req/s	4.01	4.36	+8.7%
Image Output tok/s	2048.0	2231.6	+9.0%
Image Mean E2E (ms)	24286	22351	-8.0%
Image Mean TPOT (ms)	46.02	42.10	-8.5%
Image Mean TTFT (ms)	1270.5	1317.7	+3.7%

CI States

Latest PR Test (Base): Run #25998857394
Latest PR Test (Extra): ⚠️ Not enabled — add run-ci-extra label to opt in.

gemini-code-assist · 2026-05-11T19:45:27Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

gemini-code-assist · 2026-05-11T19:45:32Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

wenscarl · 2026-05-11T19:45:45Z

cc. @nvpohanh

kpham-sgl

Thank you for the contribution! Glad to know flashinfer support headdim=512 now.

sglang/python/sglang/srt/layers/attention/triton_backend.py

Lines 906 to 913 in c7e53e6

    
           if k is None and v is None: 
        
               pool = forward_batch.token_to_kv_pool 
        
               cache_loc = forward_batch.out_cache_loc 
        
               if isinstance(pool, SWAKVPool) and pool.layers_mapping[layer.layer_id][1]: 
        
                   cache_loc = pool.translate_loc_from_full_to_swa(cache_loc) 
        
               k_buffer, v_buffer = pool.get_kv_buffer(layer.layer_id) 
        
               k = k_buffer[cache_loc] 
        
               v = v_buffer[cache_loc]

For E2B and E4B variants with KV cache reuse we need this extra KV cache retrieval path. If you have time can you figure out how to add similar path to the flashinfer backend? If not can you guard this change to the bigger model (31B and 26B-A4B) only?

pyc96 · 2026-05-11T21:02:04Z

Curious does current SGL flahsinfer vesion support it? And does it work with NVFP4 ckpt?

kpham-sgl · 2026-05-11T21:10:21Z

Curious does current SGL flahsinfer vesion support it? And does it work with NVFP4 ckpt?

@wenscarl what flashinfer version was the trtllm_mha support for headdim=512 added in?

wenscarl · 2026-05-11T22:43:23Z

@wenscarl what flashinfer version was the trtllm_mha support for headdim=512 added in?

v0.6.10.post1

wenscarl · 2026-05-11T22:58:46Z

For E2B and E4B variants with KV cache reuse we need this extra KV cache retrieval path. If you have time can you figure out how to add similar path to the flashinfer backend? If not can you guard this change to the bigger model (31B and 26B-A4B) only?

trtllm_mha actually works for E2B/E4B as-is — verified with a run on E2B and it passes.
trtllm_mha is structurally different — the trtllm-gen kernel reads K/V directly from the paged KV cache via
page_table + get_kv_buffer(layer.layer_id), never as a separate per-token window. And the KV-share redirection is
already done in the model at gemma4_causal.py:325-327:

self.attn = RadixAttention(
    ...
    layer_id=(self.kv_shared_layer_index if self.is_kv_shared_layer else self.layer_id),
    ...
)

kpham-sgl · 2026-05-11T23:20:15Z

For E2B and E4B variants with KV cache reuse we need this extra KV cache retrieval path. If you have time can you figure out how to add similar path to the flashinfer backend? If not can you guard this change to the bigger model (31B and 26B-A4B) only?

trtllm_mha actually works for E2B/E4B as-is — verified with a run on E2B and it passes. trtllm_mha is structurally different — the trtllm-gen kernel reads K/V directly from the paged KV cache via page_table + get_kv_buffer(layer.layer_id), never as a separate per-token window. And the KV-share redirection is already done in the model at gemma4_causal.py:325-327:
self.attn = RadixAttention(
    ...
    layer_id=(self.kv_shared_layer_index if self.is_kv_shared_layer else self.layer_id),
    ...
)

ohh right. Thanks for pointing this out!

kpham-sgl · 2026-05-11T23:24:12Z

/tag-and-rerun-ci

kpham-sgl · 2026-05-11T23:27:44Z

Will merge once we upgrade flashinfer to v0.6.10.post1

kpham-sgl · 2026-05-12T22:45:03Z

/rerun-failed-ci

kpham-sgl · 2026-05-17T18:18:56Z

Accuracy checks in #25461 (comment)

Enable trtllm_mha as gemma4 default attn backend.

c2ce9b4

wenscarl marked this pull request as ready for review May 11, 2026 19:45

kpham-sgl self-assigned this May 11, 2026

kpham-sgl reviewed May 11, 2026

View reviewed changes

wenscarl requested a review from kpham-sgl May 11, 2026 23:02

kpham-sgl approved these changes May 11, 2026

View reviewed changes

github-actions Bot added the run-ci label May 11, 2026

Merge branch 'main' into Gemma4_trtllm_attn

bc4c5f7

kpham-sgl mentioned this pull request May 17, 2026

[Spec] Add trtllm_mha support for Gemma 4 MTP draft attention backend #25545

Open

5 tasks

kpham-sgl merged commit c67b287 into sgl-project:main May 17, 2026
113 of 124 checks passed

kpham-sgl mentioned this pull request May 17, 2026

Respect user override for Gemma4 attention backend #25547

Merged

4 tasks

	if k is None and v is None:
	pool = forward_batch.token_to_kv_pool
	cache_loc = forward_batch.out_cache_loc
	if isinstance(pool, SWAKVPool) and pool.layers_mapping[layer.layer_id][1]:
	cache_loc = pool.translate_loc_from_full_to_swa(cache_loc)
	k_buffer, v_buffer = pool.get_kv_buffer(layer.layer_id)
	k = k_buffer[cache_loc]
	v = v_buffer[cache_loc]

Conversation

wenscarl commented May 11, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Server

Benchmark Commands

Latency

Throughput

CI States

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

gemini-code-assist Bot commented May 11, 2026

Uh oh!

wenscarl commented May 11, 2026

Uh oh!

kpham-sgl left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pyc96 commented May 11, 2026

Uh oh!

kpham-sgl commented May 11, 2026

Uh oh!

wenscarl commented May 11, 2026

Uh oh!

wenscarl commented May 11, 2026

Uh oh!

kpham-sgl commented May 11, 2026

Uh oh!

kpham-sgl commented May 11, 2026

Uh oh!

kpham-sgl commented May 11, 2026

Uh oh!

kpham-sgl commented May 12, 2026

Uh oh!

kpham-sgl commented May 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wenscarl commented May 11, 2026 •

edited by github-actions Bot

Loading

kpham-sgl left a comment •

edited

Loading