fix: fallback to triton for attention-sink models (flashinfer unsupported)#23139
Merged
hnyls2002 merged 3 commits intosgl-project:mainfrom Apr 21, 2026
Merged
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to detect attention sinks in models and ensures the attention backend defaults to Triton when sinks are present, as FlashInfer currently lacks support for them. The review feedback highlights that the detection logic may need to include additional hybrid SWA architectures and suggests implementing validation to prevent users from manually selecting FlashInfer when attention sinks are active, which would lead to incorrect results.
JIACHENG135
reviewed
Apr 18, 2026
Collaborator
|
/tag-and-rerun-ci |
zhangying098
pushed a commit
to zhangying098/sglang
that referenced
this pull request
Apr 23, 2026
…rted) (sgl-project#23139) Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com>
2 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
The FlashInfer attention backend does not accept a
sinkskwarg, but hybrid-SWA models likeGptOssForCausalLMandMiMoV2FlashForCausalLM(whenadd_{swa,full}_attention_sink_bias=True) pass per-head sink scalars to the attention call. Two auto-selection paths used to fall back toflashinferwithout considering this, which could leave sink models on an unsupported backend.Modifications
ModelConfig.has_attention_sinks, derived in_derive_hybrid_modelvia a new_detect_attention_sinks()helper:GptOssForCausalLMalways has sinksMiMoV2FlashForCausalLM/MiMoV2MTPhave sinks only whenadd_swa_attention_sink_biasoradd_full_attention_sink_biasis set onhf_text_configServerArgs._get_default_attn_backend, the non-MLA fallback branch now pickstritoninstead offlashinferwhenmodel_config.has_attention_sinksis trueServerArgs._resolve_io_decode_attention_compatibility, the hicache kernel-io + FA3 avoidance path applies the same check before settingdecode_attention_backendScope of Behavior Change
Only the implicit auto-selection paths are affected:
--attention-backend, non-Hopper / non-Blackwell GPUio_backend=kernelwith FA3 effective decode--attention-backend ...Checklist
Review and Merge Process
/tag-and-rerun-ci,/tag-run-ci-label,/rerun-failed-ci