Skip to content

fix: fallback to triton for attention-sink models (flashinfer unsupported)#23139

Merged
hnyls2002 merged 3 commits intosgl-project:mainfrom
alphabetc1:fix/sink-model-decode-backend
Apr 21, 2026
Merged

fix: fallback to triton for attention-sink models (flashinfer unsupported)#23139
hnyls2002 merged 3 commits intosgl-project:mainfrom
alphabetc1:fix/sink-model-decode-backend

Conversation

@alphabetc1
Copy link
Copy Markdown
Collaborator

@alphabetc1 alphabetc1 commented Apr 18, 2026

Motivation

The FlashInfer attention backend does not accept a sinks kwarg, but hybrid-SWA models like GptOssForCausalLM and MiMoV2FlashForCausalLM (when add_{swa,full}_attention_sink_bias=True) pass per-head sink scalars to the attention call. Two auto-selection paths used to fall back to flashinfer without considering this, which could leave sink models on an unsupported backend.

Modifications

  • Add ModelConfig.has_attention_sinks, derived in _derive_hybrid_model via a new _detect_attention_sinks() helper:
    • GptOssForCausalLM always has sinks
    • MiMoV2FlashForCausalLM / MiMoV2MTP have sinks only when add_swa_attention_sink_bias or add_full_attention_sink_bias is set on hf_text_config
  • In ServerArgs._get_default_attn_backend, the non-MLA fallback branch now picks triton instead of flashinfer when model_config.has_attention_sinks is true
  • In ServerArgs._resolve_io_decode_attention_compatibility, the hicache kernel-io + FA3 avoidance path applies the same check before setting decode_attention_backend

Scope of Behavior Change

Only the implicit auto-selection paths are affected:

Scenario Before After
Sink model, no --attention-backend, non-Hopper / non-Blackwell GPU flashinfer (unsupported) triton
Sink model, hicache io_backend=kernel with FA3 effective decode flashinfer (unsupported) triton
Non-sink models, all configurations unchanged unchanged
Explicit --attention-backend ... unchanged unchanged

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to detect attention sinks in models and ensures the attention backend defaults to Triton when sinks are present, as FlashInfer currently lacks support for them. The review feedback highlights that the detection logic may need to include additional hybrid SWA architectures and suggests implementing validation to prevent users from manually selecting FlashInfer when attention sinks are active, which would lead to incorrect results.

Comment thread python/sglang/srt/configs/model_config.py
Comment thread python/sglang/srt/server_args.py
Comment thread python/sglang/srt/server_args.py
@hnyls2002
Copy link
Copy Markdown
Collaborator

/tag-and-rerun-ci

@hnyls2002 hnyls2002 changed the title fix: use triton decode for sink models fix: fallback to triton for attention-sink models (flashinfer unsupported) Apr 21, 2026
@hnyls2002 hnyls2002 merged commit e3782d0 into sgl-project:main Apr 21, 2026
23 of 60 checks passed
@alphabetc1 alphabetc1 deleted the fix/sink-model-decode-backend branch April 22, 2026 02:23
zhangying098 pushed a commit to zhangying098/sglang that referenced this pull request Apr 23, 2026
…rted) (sgl-project#23139)

Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com>
Co-authored-by: hnyls2002 <lsyincs@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants