fix: fallback to triton for attention-sink models (flashinfer unsupported) by alphabetc1 · Pull Request #23139 · sgl-project/sglang

alphabetc1 · 2026-04-18T16:55:57Z

Motivation

The FlashInfer attention backend does not accept a sinks kwarg, but hybrid-SWA models like GptOssForCausalLM and MiMoV2FlashForCausalLM (when add_{swa,full}_attention_sink_bias=True) pass per-head sink scalars to the attention call. Two auto-selection paths used to fall back to flashinfer without considering this, which could leave sink models on an unsupported backend.

Modifications

Add ModelConfig.has_attention_sinks, derived in _derive_hybrid_model via a new _detect_attention_sinks() helper:
- GptOssForCausalLM always has sinks
- MiMoV2FlashForCausalLM / MiMoV2MTP have sinks only when add_swa_attention_sink_bias or add_full_attention_sink_bias is set on hf_text_config
In ServerArgs._get_default_attn_backend, the non-MLA fallback branch now picks triton instead of flashinfer when model_config.has_attention_sinks is true
In ServerArgs._resolve_io_decode_attention_compatibility, the hicache kernel-io + FA3 avoidance path applies the same check before setting decode_attention_backend

Scope of Behavior Change

Only the implicit auto-selection paths are affected:

Scenario	Before	After
Sink model, no `--attention-backend`, non-Hopper / non-Blackwell GPU	flashinfer (unsupported)	triton
Sink model, hicache `io_backend=kernel` with FA3 effective decode	flashinfer (unsupported)	triton
Non-sink models, all configurations	unchanged	unchanged
Explicit `--attention-backend ...`	unchanged	unchanged

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review and Merge Process

Ping Merge Oncalls to start the process. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

gemini-code-assist

Code Review

This pull request introduces a mechanism to detect attention sinks in models and ensures the attention backend defaults to Triton when sinks are present, as FlashInfer currently lacks support for them. The review feedback highlights that the detection logic may need to include additional hybrid SWA architectures and suggests implementing validation to prevent users from manually selecting FlashInfer when attention sinks are active, which would lead to incorrect results.

hnyls2002 · 2026-04-21T20:10:23Z

/tag-and-rerun-ci

…rted) (sgl-project#23139) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>

fix: use triton decode for sink models

6c2df05

gemini-code-assist Bot reviewed Apr 18, 2026

View reviewed changes

Comment thread python/sglang/srt/configs/model_config.py

Comment thread python/sglang/srt/server_args.py

JIACHENG135 reviewed Apr 18, 2026

View reviewed changes

Comment thread python/sglang/srt/server_args.py

github-actions Bot added the run-ci label Apr 21, 2026

hnyls2002 changed the title ~~fix: use triton decode for sink models~~ fix: fallback to triton for attention-sink models (flashinfer unsupported) Apr 21, 2026

hnyls2002 and others added 2 commits April 21, 2026 13:23

Merge branch 'main' into fix/sink-model-decode-backend

66f97ca

lazy get_model_config in hicache decode path

f02f8af

hnyls2002 merged commit e3782d0 into sgl-project:main Apr 21, 2026
23 of 60 checks passed

alphabetc1 deleted the fix/sink-model-decode-backend branch April 22, 2026 02:23

Kangyan-Zhou mentioned this pull request Apr 28, 2026

ci: clean up stale-CUDA mooncake variant in install_extra_deps #23960

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: fallback to triton for attention-sink models (flashinfer unsupported)#23139

fix: fallback to triton for attention-sink models (flashinfer unsupported)#23139
hnyls2002 merged 3 commits intosgl-project:mainfrom
alphabetc1:fix/sink-model-decode-backend

alphabetc1 commented Apr 18, 2026 •

edited by hnyls2002

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hnyls2002 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

alphabetc1 commented Apr 18, 2026 • edited by hnyls2002 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Scope of Behavior Change

Checklist

Review and Merge Process

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hnyls2002 commented Apr 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alphabetc1 commented Apr 18, 2026 •

edited by hnyls2002

Loading