Skip to content

Add registration API for external linear attention backend#21983

Merged
merrymercy merged 7 commits into
sgl-project:mainfrom
charlotte12l:lxy/hybrid-model-registry
Apr 7, 2026
Merged

Add registration API for external linear attention backend#21983
merrymercy merged 7 commits into
sgl-project:mainfrom
charlotte12l:lxy/hybrid-model-registry

Conversation

@charlotte12l
Copy link
Copy Markdown
Contributor

@charlotte12l charlotte12l commented Apr 2, 2026

Motivation

SGLang currently hardcodes hybrid model support (GDN, KDA, Mamba2, Lightning) across 5 core files via isinstance checks and architecture name lists. Adding a new linear attention hybrid model requires modifying all 5 files:

  • model_runner.pymambaish_config property
  • attention_registry.py — backend dispatch in attn_backend_wrapper
  • scheduler.pyis_hybrid_ssm flag for MambaRadixCache
  • server_args.py_handle_model_specific_adjustments
  • triton_backend.pyv_head_dim check

This makes it difficult for external or experimental linear attention models (e.g., custom KDA variants) to integrate with SGLang without forking the core. A registration API lets external models self-register and plug into all 5 integration points with zero modifications to SGLang source.

Modifications

New file: python/sglang/srt/configs/linear_attn_model_registry.py

  • LinearAttnModelSpec dataclass — captures config class, backend class name, arch names, and cache behavior flags
  • register_linear_attn_model() — appends a spec to the global registry (called at import time)
  • get_linear_attn_config() — isinstance-based lookup, returns (spec, resolved_config) or None
  • get_linear_attn_spec_by_arch() — arch name lookup for server_args dispatch
  • import_backend_class() — lazy import of the backend class from a fully-qualified dotted name

Integration (all purely additive, existing behavior unchanged):

  • model_runner.py — new linear_attn_model_spec property; mambaish_config falls back to registry after existing hardcoded checks
  • attention_registry.pyattn_backend_wrapper falls back to registry when no hardcoded backend matches
  • scheduler.pyis_hybrid_ssm checks registry for uses_mamba_radix_cache
  • server_args.py_handle_model_specific_adjustments calls _handle_mamba_radix_cache for registered models
  • triton_backend.pyv_head_dim check includes linear_attn_model_spec

Usage:

from sglang.srt.configs.linear_attn_model_registry import (
    register_linear_attn_model, LinearAttnModelSpec,
)

register_linear_attn_model(LinearAttnModelSpec(
    config_class=MyLinearAttnConfig,
    backend_class_name="sglang.srt.layers.attention.linear.kda_backend.KDAAttnBackend",
    arch_names=["MyModelForCausalLM"],
    uses_mamba_radix_cache=True,
))

Accuracy Tests

Unit tests covering all registry functions (11 tests, all passing):

test_register_and_lookup_by_config ... ok
test_lookup_no_match ... ok
test_lookup_empty_registry ... ok
test_unwrap_text_config ... ok
test_unwrap_text_config_no_match ... ok
test_lookup_by_arch ... ok
test_lookup_by_arch_empty_registry ... ok
test_multiple_registrations ... ok
test_first_match_wins ... ok
test_import_backend_class ... ok
test_spec_defaults ... ok

No accuracy regression is possible — the registry is purely additive. All existing hardcoded checks execute first; the registry is only consulted as a fallback when no existing model matches. When the registry is empty (default), all code paths are identical to before this change.

Checklist

Review and Merge Process

  1. Ping Merge Oncalls to start the process. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • Common commands include /tag-and-rerun-ci, /tag-run-ci-label, /rerun-failed-ci
  4. After green CI and required approvals, ask Merge Oncalls or people with Write permission to merge the PR.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@charlotte12l charlotte12l force-pushed the lxy/hybrid-model-registry branch 2 times, most recently from b1161f2 to 8492037 Compare April 2, 2026 23:40
unwrap_text_config: bool = False

# If True, asserts the model is not used with MLA backends.
mla_incompatible: bool = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when is this used?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread python/sglang/srt/layers/attention/attention_registry.py Outdated
@merrymercy
Copy link
Copy Markdown
Contributor

/tag-and-rerun-ci

@github-actions github-actions Bot added the run-ci label Apr 3, 2026
@charlotte12l charlotte12l force-pushed the lxy/hybrid-model-registry branch from 8492037 to 46e8768 Compare April 3, 2026 05:41
SGLang currently hardcodes hybrid model support (GDN, KDA, Mamba2,
Lightning) across 5 files via isinstance checks and architecture name
lists. Adding a new linear attention hybrid model requires modifying
all 5 core files.

This adds a `register_linear_attn_model()` API so external models can
self-register without modifying SGLang source:

    from sglang.srt.configs.linear_attn_model_registry import (
        register_linear_attn_model, LinearAttnModelSpec,
    )
    register_linear_attn_model(LinearAttnModelSpec(
        config_class=MyConfig,
        backend_class_name="sglang.srt...KDAAttnBackend",
        arch_names=["MyModelForCausalLM"],
        uses_mamba_radix_cache=True,
    ))

All 5 integration points now check the registry as a fallback after
existing hardcoded checks, so this is purely additive with zero
behavior change for existing models:

- model_runner.py: mambaish_config + new linear_attn_model_spec property
- attention_registry.py: backend dispatch fallback
- scheduler.py: is_hybrid_ssm fallback
- server_args.py: _handle_model_specific_adjustments fallback
- triton_backend.py: v_head_dim check fallback

Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
@charlotte12l charlotte12l force-pushed the lxy/hybrid-model-registry branch from 46e8768 to 3258b3e Compare April 3, 2026 05:48
Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
@charlotte12l charlotte12l force-pushed the lxy/hybrid-model-registry branch from 3258b3e to 83d132d Compare April 3, 2026 06:13
@charlotte12l charlotte12l changed the title [Hybrid] Add registration API for external linear attention models Add registration API for external linear attention backend Apr 5, 2026
Remove the redundant mla_incompatible field from LinearAttnModelSpec
and its associated assertion in attention_registry.py, as requested
in PR sgl-project#21983 review.

Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
@charlotte12l
Copy link
Copy Markdown
Contributor Author

@merrymercy I've addressed the comments, those timeout errors are not related...

charlotte12l and others added 2 commits April 6, 2026 11:58
Follow-up to removing the mla_incompatible field from
LinearAttnModelSpec per reviewer feedback.

Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
@merrymercy
Copy link
Copy Markdown
Contributor

/tag-and-rerun-ci

@merrymercy merrymercy merged commit 98f38b1 into sgl-project:main Apr 7, 2026
134 of 145 checks passed
yhyang201 pushed a commit to yhyang201/sglang that referenced this pull request Apr 22, 2026
…ct#21983)

Signed-off-by: Xingyu Liu <charlotteliu12x@gmail.com>
caitengwei pushed a commit to caitengwei/sglang that referenced this pull request Jun 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants