Enable trtllm_mha as gemma4 default attn backend.#25006
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
cc. @nvpohanh |
There was a problem hiding this comment.
Thank you for the contribution! Glad to know flashinfer support headdim=512 now.
sglang/python/sglang/srt/layers/attention/triton_backend.py
Lines 906 to 913 in c7e53e6
For E2B and E4B variants with KV cache reuse we need this extra KV cache retrieval path. If you have time can you figure out how to add similar path to the flashinfer backend? If not can you guard this change to the bigger model (31B and 26B-A4B) only?
|
Curious does current SGL flahsinfer vesion support it? And does it work with NVFP4 ckpt? |
@wenscarl what flashinfer version was the |
|
trtllm_mha actually works for E2B/E4B as-is — verified with a run on E2B and it passes. |
ohh right. Thanks for pointing this out! |
|
/tag-and-rerun-ci |
|
Will merge once we upgrade flashinfer to v0.6.10.post1 |
|
/rerun-failed-ci |
|
Accuracy checks in #25461 (comment) |
Summary
Enable
trtllm_mhaas the default attention backend for Gemma4 on SM100.When
--attention-backendis not specified forGemma4ForConditionalGeneration, SGLang now selects:trtllm_mhaon SM100tritonotherwiseThis keeps the existing non-SM100 behavior unchanged while enabling the Blackwell-optimized MHA backend for Gemma4 by default.
Benchmark
Same server flags otherwise, comparing
tritonvstrtllm_mha.Server
sglang serve --model-path google/gemma-4-31B-it \ --reasoning-parser gemma4 \ --tool-call-parser gemma4 \ --mem-fraction-static 0.9 \ --host 0.0.0.0 --port 30000 --tp-size 4Benchmark Commands
Latency, text:
Latency, image:
Throughput, text:
Throughput, image:
Note: the throughput-image benchmark had 999/1000 successful requests with
trtllm_mha; one request was silently dropped with no client-side error logged, possibly due to a request abort.Latency
concurrency=1, 10 prompts
Throughput
concurrency=100, 1000 prompts
CI States
Latest PR Test (Base): Run #25998857394⚠️ Not enabled — add
Latest PR Test (Extra):
run-ci-extralabel to opt in.