[Feature] Stronger transformers modeling backend with TP, PP, MoE, VLMs, and torch compile by adarshxs · Pull Request #19163 · sgl-project/sglang

adarshxs · 2026-02-22T20:01:42Z

Motivation

Adds a generic modeling backend that uses HF transformers models directly via AutoModel.from_config(), enabling any model with a tp_plan, pp_plan and custom attention support to run on SGLang without a dedicated model implementation.

Modifications

Architecture

Mixin-based design that composes capabilities:
• TransformersBase - core class: meta-device init, recursive module replacement (Linear -> TP, RMSNorm -> fused), attention instance creation, PP support, weight loading via
AutoWeightsLoader
• CausalMixin - LM head + logits processor
• EmbeddingMixin - pooling for embedding models
• MoEMixin - auto-detects experts modules and replaces with TransformersFusedMoE (custom-op backed, fused kernels, EPLB recording)
• MultiModalMixin - vision/audio/video encoder dispatch, M-RoPE, token_type_ids propagation

Concrete classes cover all combinations: TransformersForCausalLM, TransformersMoEForCausalLM, TransformersMultiModalForCausalLM, TransformersMultiModalMoEForCausalLM, plus embedding
and sequence classification variants.

Testing

Conducted the following tests. Infrastructure = (1xH100, 4xA100):

Model	Config	Status
Qwen3-0.6B	TP=1, TP=2, TP=1 + Torch Compile, TP=2 + Torch Compile	PASS
Qwen3-30B-A3B (MoE, 128 experts)	TP=1, TP=2	PASS
Gemma3-4B-IT (VLM)	TP=4, TP=4 + Torch Compile, text+image	PASS
Qwen3-VL-2B (VLM)	TP=1, TP=1 + Torch Compile, text	PASS

#	Category	Model	Architecture	Config
1	Non-MoE LLM Tests	Qwen3-0.6B	TransformersForCausalLM	TP=1
2	Non-MoE LLM Tests	Qwen3-0.6B	TransformersForCausalLM	TP=2
3	Non-MoE LLM Tests	Qwen3-0.6B	TransformersForCausalLM	TP=1+compile
4	Non-MoE LLM Tests	Qwen3-0.6B	TransformersForCausalLM	TP=2+compile
5	Non-MoE LLM Tests	Qwen2.5-7B-Instruct	TransformersForCausalLM	Single GPU
6	Non-MoE LLM Tests	Qwen2.5-7B-Instruct	TransformersForCausalLM	TP=2
7	Non-MoE LLM Tests	Qwen2.5-7B-Instruct	TransformersForCausalLM	TP=4
8	MoE LLM Tests	Qwen3-30B-A3B (128 experts)	TransformersMoEForCausalLM	TP=1
9	MoE LLM Tests	Qwen3-30B-A3B (128 experts)	TransformersMoEForCausalLM	TP=2
10	MoE LLM Tests	Qwen3-30B-A3B (128 experts)	TransformersMoEForCausalLM	EP=2 (tp=2, ep=2)
11	MoE LLM Tests	Qwen3-30B-A3B (128 experts)	TransformersMoEForCausalLM	TP=4
12	MoE LLM Tests	Qwen3-30B-A3B (128 experts)	TransformersMoEForCausalLM	EP=4 (tp=4, ep=4)
13	VLM Tests	Qwen2.5-VL-7B-Instruct	TransformersMultiModalForCausalLM	Single GPU, text+image
14	VLM Tests	Qwen3-VL-2B	TransformersMultiModalForCausalLM	TP=1, text
15	VLM Tests	Qwen3-VL-2B	TransformersMultiModalForCausalLM	TP=1+compile, text
16	VLM Tests	Gemma3-4B-IT	TransformersForCausalLM	Single GPU, text
17	VLM Tests	Gemma3-4B-IT	TransformersMultiModalForCausalLM	Single GPU, text
18	VLM Tests	Gemma3-4B-IT	TransformersMultiModalForCausalLM	TP=4
19	VLM Tests	Gemma3-4B-IT	TransformersMultiModalForCausalLM	TP=4+compile
20	VLM Tests	Gemma3-4B-IT	TransformersMultiModalForCausalLM	text+image
21	VLM Tests	Gemma3-4B-IT	TransformersMultiModalForCausalLM	Single GPU, image
22	Pipeline Parallel Tests	Qwen2.5-7B-Instruct	TransformersForCausalLM	PP=2
23	Pipeline Parallel Tests	Qwen2.5-7B-Instruct	TransformersForCausalLM	PP=4
24	Data Parallel Tests	Qwen2.5-7B-Instruct	TransformersForCausalLM	DP=2
25	Data Parallel Tests	Qwen2.5-7B-Instruct	TransformersForCausalLM	DP=4

Checklist

Format your code according to the Format code with pre-commit.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…orch.compile support

gemini-code-assist · 2026-02-22T20:02:07Z

Summary of Changes

Hello @adarshxs, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances SGLang's model compatibility by introducing a generic backend that directly integrates with Hugging Face Transformers models. The new architecture supports advanced features like tensor and pipeline parallelism, Mixture-of-Experts, and multimodal capabilities, allowing for broader model deployment and more efficient execution, particularly for complex models like VLMs and MoE architectures. This change aims to reduce the effort required to onboard new models and improve performance through optimized operations and torch.compile integration.

Highlights

Generic Transformers Backend: Introduced a new generic modeling backend that directly utilizes Hugging Face AutoModel.from_config() to support a wider range of models without requiring dedicated SGLang implementations.
Mixin-Based Architecture: Implemented a modular, mixin-based design (TransformersBase, CausalMixin, EmbeddingMixin, MoEMixin, MultiModalMixin) to compose capabilities like tensor parallelism (TP), pipeline parallelism (PP), Mixture-of-Experts (MoE), and Vision-Language Models (VLMs).
MoE and Multimodal Support: Added robust support for MoE models by replacing expert modules with TransformersFusedMoE and enhanced multimodal processing through a new TransformersAutoMultimodalProcessor that leverages Hugging Face's native processing for tokenization and feature extraction, including M-RoPE position handling.
Torch Compile Integration: Integrated torch.compile with compatibility checks, specifically disabling it for models with dynamic RoPE scaling to ensure stable graph compilation.
Improved Weight Loading and Module Replacement: Refactored weight loading with a new AutoWeightsLoader for greater flexibility and introduced recursive module replacement for linear layers (TP) and RMSNorm layers (fused) within the Transformers backend.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/configs/model_config.py
- Updated multimodal disabling logic to be conditional on the model implementation, allowing multimodal features for the new Transformers backend.
- Expanded multimodal model detection to include models with vision or audio sub-configurations.
python/sglang/srt/managers/io_struct.py
- Added an optional token_type_ids field to TokenizedGenerateReqInput to support multimodal models that require it.
python/sglang/srt/managers/multimodal_processor.py
- Introduced TransformersAutoMultimodalProcessor to handle generic multimodal processing for the new Transformers backend.
python/sglang/srt/managers/scheduler.py
- Propagated token_type_ids from received requests to the tokenized object.
- Made the pad_input_ids_func call conditional, allowing it to be None for certain backends.
python/sglang/srt/managers/tokenizer_manager.py
- Extracted and flattened token_type_ids from multimodal inputs for propagation.
python/sglang/srt/model_executor/model_runner.py
- Disabled piecewise CUDA graph for the Transformers backend, recommending torch.compile instead.
- Added a check to disable torch.compile if the Transformers backend model is not compatible (e.g., due to dynamic RoPE scaling).
python/sglang/srt/model_loader/utils.py
- Added helper functions _is_moe_model, _is_sequence_classification_model, and _get_transformers_backend_arch to dynamically determine model architecture types.
- Modified resolve_transformers_arch to dynamically select the appropriate Transformers backend class based on model characteristics (e.g., MoE, multimodal, causal LM, embedding, sequence classification).
- Improved compatibility checks and warning messages during model class resolution.
python/sglang/srt/models/transformers.py
- Refactored the TransformersForCausalLM class into a flexible mixin-based hierarchy (TransformersBase, CausalMixin, EmbeddingMixin, MoEMixin, MultiModalMixin) to support various model types and features.
- Implemented TransformersBase for core functionalities including meta-device initialization, recursive module replacement (Linear, RMSNorm), attention instance creation, and pipeline parallel support.
- Introduced TransformersFusedMoE to wrap SGLang's native MoE implementation, handling expert parallelism and weight loading for MoE models.
- Added can_enable_torch_compile function to determine torch.compile compatibility, specifically addressing dynamic RoPE scaling.
- Implemented pipeline parallelism logic, including PPMissingLayer to handle skipped layers on non-participating ranks.
- Updated sglang_flash_attention_forward to align with the new architecture and parameters.
- Expanded the EntryClass list to include all new mixin combinations for causal LMs, embedding models, and sequence classification models with optional MoE and multimodal support.
python/sglang/srt/models/utils.py
- Added the __or__ operator to WeightsMapper for easier combination of weight mapping rules.
- Introduced AutoWeightsLoader to provide a more robust and configurable mechanism for loading model weights, including options to skip or ignore specific parameters.
python/sglang/srt/multimodal/processors/transformers_auto.py
- Added TransformersAutoMultimodalProcessor to generically process multimodal inputs by directly applying Hugging Face processors, handling token expansion, M-RoPE positions, and token_type_ids.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a generic modeling backend for SGLang that leverages Hugging Face transformers models directly. It supports Tensor Parallelism (TP), Pipeline Parallelism (PP), Mixture of Experts (MoE), and Vision-Language Models (VLMs) through a mixin-based architecture. Key components include TransformersBase for core logic like meta-device initialization and module replacement, and specialized mixins for causal, embedding, and multimodal capabilities. The implementation also integrates torch.compile support and provides a generic multimodal processor. Overall, the design is modular and significantly expands SGLang's model compatibility.

python/sglang/srt/models/transformers.py

gemini-code-assist · 2026-02-22T20:07:44Z

python/sglang/srt/models/transformers.py

-        """
+    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
+        loaded: set[str] = set()
+        param_dict = dict(self.named_parameters())


Calling dict(self.named_parameters()) inside load_weights can be expensive, especially since this method is called for every weight shard during model loading. It is better to cache this dictionary in __init__ or lazily upon the first call to load_weights.

References

Avoid redundant computations in performance-critical paths like weight loading. ^(link)

gemini-code-assist · 2026-02-22T20:07:44Z

python/sglang/srt/models/transformers.py

+        """Materialize any parameters still on the meta device."""
+        for name, param in module.named_parameters(recurse=False):
+            if param.device == torch.device("meta"):
+                new_param = nn.Parameter(


The device is hardcoded to "cuda" when materializing parameters from the meta device. This will cause failures on other supported hardware backends (e.g., xpu, npu, cpu). Consider using the current default device or passing the target device to the model constructor.

gemini-code-assist · 2026-02-22T20:07:44Z

python/sglang/srt/models/transformers.py

+                output = output[0]
+        elif hasattr(output, "last_hidden_state"):
+            output = output.last_hidden_state
+        elif isinstance(output, dict):


The logic next(v for v in output.values() if torch.is_tensor(v)) is non-deterministic and risky if the dictionary contains multiple tensors (e.g., both pooler_output and hidden_states). It is safer to look for specific known keys used by Hugging Face multimodal encoders.

gemini-code-assist · 2026-02-22T20:07:44Z

python/sglang/srt/models/utils.py

+            (weight_name.split(".", 1), weight_data)
+            for weight_name, weight_data in weights
+        )
+        for prefix, group in itertools.groupby(weights_by_parts, key=lambda x: x[0][0]):


itertools.groupby only groups consecutive items with the same key. If the weights iterable is not sorted by name, this will result in multiple calls to _load_module for the same submodule with partial weight sets, which is inefficient and may break submodules that expect all weights at once. Ensure the weights are sorted by name before grouping.

Suggested change

for prefix, group in itertools.groupby(weights_by_parts, key=lambda x: x[0][0]):

for prefix, group in itertools.groupby(sorted(weights_by_parts, key=lambda x: x[0][0]), key=lambda x: x[0][0]):

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/multimodal/processors/transformers_auto.py

yuan-luo · 2026-03-08T08:26:24Z

Could you specify the usage commands for these new features and paste the accuracy result?

adarshxs · 2026-03-08T15:02:48Z

Usage Instructions

Basic Usage (Transformers Backend)

Use the transformers modeling backend explicitly:

python -m sglang.launch_server \
  --model-path Qwen/Qwen3-4B \
  --model-impl transformers \
  --host 0.0.0.0 --port 30000

or

sglang serve \
  --model-path Qwen/Qwen3-4B \
  --model-impl transformers

Supported `--model-impl` Values

Value	Behavior
`auto` (default)	Use SGLang native implementation if available, fall back to transformers
`sglang`	SGLang native implementation (error if not available)
`transformers`	HF transformers backend

Accuracy Eval

Model: Qwen/Qwen3-4B
Hardware: 1x H100 NVL

Benchmark	SGLang Native	Transformers Backend
MGSM-EN (n=250)	0.916	0.940
MMLU (n=250)	0.720	0.744

yuan-luo · 2026-03-09T11:30:42Z

/tag-and-rerun-ci

yuan-luo · 2026-03-13T06:56:29Z

/tag-and-rerun-ci

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/model_loader/utils.py

test/registered/models/test_transformers_backend_eval.py

…lang into transformers-backend

yuan-luo · 2026-03-18T02:40:39Z

/rerun-failed-ci

yuan-luo · 2026-03-18T07:27:16Z

@JustinTong0323 Could you please take another look? LGTM now.

JustinTong0323

PR Review Summary

Reviewed with specialized agents focusing on: code quality, silent failures, test coverage, and code simplification. Overall this is a well-structured feature addition with a clean mixin-based architecture — but there are several issues worth addressing before merge.

Critical: 3 | High: 5 | Simplification suggestions: 5

See inline comments for details on each issue.

Test Coverage

The PR adds ~2074 lines of implementation but only 43 lines of tests (a single GSM8K eval). Key untested areas include:

_is_moe_model / _get_transformers_backend_arch routing logic (pure functions, trivially unit-testable)
AutoWeightsLoader weight dispatch
TransformersFusedMoE expert replacement
MultiModalMixin + TransformersAutoMultimodalProcessor
Pipeline parallel layer slicing
Embedding/classification model classes

Simplification Opportunities

Consolidate duplicated helpers: _first_attr (in transformers_auto.py) and _getattr_first (in transformers.py) are identical. Same for _uses_mrope / _uses_mrope_positions.
Extract tensor-list flattening helper from _to_tensor_output — same logic appears twice in the same method.
Simplify replace_rms_norm_class Gemma detection — instantiating a dummy module to check zero weights is fragile; checking class name would be simpler.
Split TransformersBase.__init__ (~100+ lines) into focused setup methods for readability.

Strengths

Clean mixin-based architecture composing TP, PP, MoE, VLM capabilities
Elegant WeightsMapper composition via __or__ and __init_subclass__
Comprehensive manual testing across 25 configurations
Proper hidden_states.clone() in MoE custom op to prevent mutation

python/sglang/srt/models/utils.py

python/sglang/srt/models/transformers.py

python/sglang/srt/managers/multimodal_processor.py

python/sglang/srt/models/transformers.py

python/sglang/srt/model_loader/utils.py

yuan-luo · 2026-03-19T14:25:00Z

/rerun-failed-ci

adarshxs · 2026-03-20T04:16:43Z

/rerun-failed-ci

JustinTong0323 · 2026-03-22T01:57:10Z

/rerun-failed-ci

yuan-luo · 2026-03-30T14:19:41Z

please resolve the conflict.

adarshxs · 2026-03-31T07:13:28Z

/rerun-failed-ci

adarshxs · 2026-04-02T06:05:39Z

/rerun-failed-ci

* [AMD] Fix AMD CI monitor GitHub API rate limit exhaustion (sgl-project#21527) * [CI] Register missing jit_kernel test files (sgl-project#21547) * [diffusion] fix: return None instead of raising RuntimeError when no model info found (sgl-project#21319) Co-authored-by: Mick <mickjagger19@icloud.com> * [rl][sgl] fix tensor mismatch after pause (sgl-project#21514) * [Hicache & JIT_kernel] Support page first layout & mla jit kernel (sgl-project#18311) * test: point DSV3 int8 MLA CI models to lmsys Hugging Face org (sgl-project#21561) * [CI] Relax several thresholds in flaky CIs (sgl-project#21562) * feat: add gc_threshold arg (sgl-project#21481) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Fix flaky test_pp_single_node (sgl-project#21564) * Split workflow for releasing runtime docker (sgl-project#21563) * fix tp capture in vit cuda graph (sgl-project#17255) * [1/n] lora support - Auto detect lora target modules (sgl-project#21439) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [fix] qwen3.5 fuse_moe_triton_tune bug (sgl-project#20232) * Remove sync when enabling return_logprob (sgl-project#20972) * Scope streaming backlog coalescing to incremental_streaming_output mode (sgl-project#21037) Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * docs: flesh out MAINTAINER.md oncall lists and link GitHub profiles (sgl-project#21575) * [NVIDIA] Enable automatic NUMA configuration (sgl-project#19452) * [diffusion] UX: aggregate expected dtype-cast logs during weight loading (sgl-project#21552) * [diffusion] refactor: Unify `TeaCacheParams` and `WanTeaCacheParams` (sgl-project#20706) Co-authored-by: Mick <mickjagger19@icloud.com> * [diffusion] chore: remove redundant identity preprocess_text functions(sgl-project#20633) Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com> * Update CODEOWNERS for transformers.py and docs (sgl-project#21555) Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> * reduce CPU peak memory in multimodal tensor hashing (sgl-project#21123) * Fix HFRunner hang when subprocess dies during init (sgl-project#21582) * Fix Piecewise CUDA Graph crash with `-enable-mixed-chunk` (sgl-project#20441) Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> * [CI] Replace upload/download-artifact with job outputs in release-docker workflow (sgl-project#21579) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Patch transformers is_base_mistral in CI to avoid HF 429 rate limiting (sgl-project#21586) * [CI] Move v32 cp test to deepep running suite (sgl-project#21585) * [AMD] Add GLM-4.7-FP8 accuracy CI test for MI35x (sgl-project#21534) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [Clean] Remove deprecated environs (sgl-project#21536) * [diffusion] fix: fix Flux2-Klein prompt tokenization length to 512 and add regression coverage (sgl-project#21407) * [CI] hot-fix ci lint (sgl-project#21608) * [diffusion] feat: support overlay model materialization (sgl-project#21600) * [VLM] Optimize ShmPointerMMData for multi-pickle safety and deferred unwrap (sgl-project#21465) * feat: enable CUDA graph and timestamp for the whisper model(sgl-project#21190) * [NPU] Update quantization&CI documentation (sgl-project#21100) Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> * Skip ci for .md files (sgl-project#21482) * Support skip-softmax attention (sgl-project#19089) * fix: piecewise_cuda_graph get correct qo_indptr (sgl-project#21452) Co-authored-by: Avery Huang <averyh@nvidia.com> * fix bench_serving sglang backend to support image dataset (sgl-project#21294) * [AMD] Add peft>=0.18.0 to diffusion_hip deps for transformers 5.x compat for AMD diffusion model (sgl-project#21442) Co-authored-by: HaiShaw <hixiao@gmail.com> * [GDN] Fuse GDN kkt + solve_tril into one kernel (sgl-project#21411) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [Diffusion] Align diffusion benchmark skill presets with nightly comparison cases (sgl-project#21616) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Clean up detokenizer and remove dead multimodal_gen code (sgl-project#21588) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Skip flaky elastic EP test (sgl-project#21619) * feat(ci): add GB300 nightly benchmark test suites (sgl-project#21487) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Lossen test_return_routed_experts threshold (sgl-project#21270) * Add subprocess liveness monitor to detect scheduler crashes (sgl-project#18582) Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com> * fix: scheduler launch hang when non-current rank dies (sgl-project#20287) * Wrap IPv6 addresses in gRPC, bench_serving, and log messages (sgl-project#21236) Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * [HiCache] fix: graceful shutdown of pending async tasks in bench_mix.py (sgl-project#20276) * Clean up _wait_for_scheduler_ready implementation (sgl-project#21626) * fix cuda graph capturing error in sm120 mxfp8 triton path (sgl-project#19835) * [sgl] disable piecewise cuda graph when a model doesn't have layers (sgl-project#21565) * [Feature] Optimizations for JPEG input on NVIDIA GPU (sgl-project#19749) * [VLM] perf: optimize CUDA IPC for multimodal transfer by caching IPC pool handles (sgl-project#21418) * [Fix] SGLANG_USE_CUDA_IPC_TRANSPORT=1 and SGLANG_ENABLE_MM_SPLITTING=1 do not work at the same time. (sgl-project#19915) * [Fix] Remove redundant allreduce fusion block and skip TP=1 (sgl-project#20621) * Simplify routed experts test and move base64 encoding to tokenizer manager (sgl-project#21634) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Cleanup] Remove unused BatchMultimodalOutput and BatchMultimodalDecodeReq (sgl-project#21640) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clean up TokenizerManager: remove dead code and improve rid validation (sgl-project#21639) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * README: coding agent sponsorship for long-term contributors (sgl-project#21642) * Fix circular reference in CustomTestCase.__init_subclass__ (sgl-project#21650) Co-authored-by: wan4ch <wan4ch@gmail.com> * [Fix] Fix Qwen3.5 MoE model loading and Mamba cache sharding in PP mode (sgl-project#21448) Co-authored-by: zhangxiaolei123456 <zhangxiaolei.666@bytedance.com> * [diffusion] CI: fix dashboard chart (nightly) display issues (sgl-project#21653) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * Update sponsorship details in README.md (sgl-project#21658) * [Fix] Handle pre-release tags in nightly wheel version parsing (sgl-project#21656) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Intel GPU] Enable DeepSeek R1 inference on XPU (sgl-project#18461) Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> * [Doc] Update tips for developer new-comers (sgl-project#21659) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CI] [FlashInfer v0.6.7] Use offline quantized checkpoint for MXFP8 Gemm tests (sgl-project#21625) * MFU metrics in Prometheus (sgl-project#19395) * fix topk softmax performance issue (sgl-project#14702) * [CPU] add kernel apply_rotary_pos_emb_cpu for Qwen3-VL and Qwen3-Omni (sgl-project#13121) Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> * [CPU] Implement MXFP4 Gemm kernels for intel AMX to support GPT OSS series. (sgl-project#14385) * [AMD] Fused rope kv store (sgl-project#21315) Co-authored-by: wunhuang <wunhuang@amd.com> * [NPU] Update DeepSeek-V3.2 model deployment instructions in documentation (sgl-project#21468) Co-authored-by: wuxue (C) <w00964934@china.huawei.com> * [AMD] Support AMD MXFP4 Qwen3.5-397B-A17B model (sgl-project#21234) * [Fix] Fix weight_loader property assignment for qwen3-next FP8 models (sgl-project#21662) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix mamba cache leak when adder fails to add a matched req. (sgl-project#21404) * fix: Mistral Small 4 fails to start due to config/weight format mismatch (sgl-project#21620) Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [diffusion] feat: enhance overlay mechanism (sgl-project#21648) * [diffusion] CI: relax pr-test threshold (sgl-project#21682) * [NPU][Diffusion] fix sp modulate for qwen-image-edit (sgl-project#20974) Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> * [NPU] fix eagle3 accept rate (sgl-project#21255) * DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication (sgl-project#14162) Co-authored-by: undefined <zhouchen.arrebol@jd.com> * [NPU] GLM-5 optimize with fused kernels (sgl-project#18617) * [NPU][diffusion]: support parallel decoding of qwen-image (sgl-project#20757) Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> * [diffusion] [NPU] support ring attention on NPU with FA (sgl-project#21383) * [diffusion][doc]: add ring sp performance benchmark page (sgl-project#20998) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [GLM-V and GLM-4.7] Cast to FP32 before gate projection for GLM model. (sgl-project#21660) * fix nemotron capture for non attention layers (sgl-project#21436) * [Bugfix][NPU] Skip FRACTAL_NZ format for MoE weights with unaligned dimensions (sgl-project#21209) Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> * [AMD] Add SGLANG_DISAGGREGATION_NUM_PRE_ALLOCATE_REQS env var for configurable KV transfer overlap (sgl-project#20410) Co-authored-by: HaiShaw <hixiao@gmail.com> * [AMD][MoRI] bump MoRI to v0.1.0 (sgl-project#21673) * [AMD] fix performance regression issue when run gpt-oss with "--context-length 13824" (sgl-project#21691) * Remove flashinfer wheel cache cleanup that deletes other versions (sgl-project#21711) Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local> * [misc] multiprocess compilation to speed up test (sgl-project#21483) * Fix human-eval CI install on 5090 runners (sgl-project#21714) Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net> * Revert "DeepSeek-R1-0528-w4a8: DeepEP Low Latency Dispatch Adopts FP8 Communication" (sgl-project#21719) * [Fix] Update supported custom_mem_pool types for mooncake (sgl-project#21728) Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> * [Perf]Remove H2D for Qwen3.5 SpecV2 (sgl-project#20864) * [AMD] Fix CI multimodal-gen-test-1-gpu-amd for gen model (sgl-project#21621) * [diffusion] fix: fix Flux.2 with tp(sgl-project#21664) * Add explicit disable flag for FlashInfer allreduce fusion (sgl-project#21446) * [NPU] fix conflict between empty_cache and use_mem_pool (sgl-project#21507) * [AMD] Use tgemm.mm for MoEGate router gemm in deepseek_v2.py (sgl-project#21657) * [CI]Remove msgm-en and mmlu tests which cause timeout (sgl-project#21733) * Fix disaggregation hybrid attention ci (sgl-project#21745) * Rename rerun-ut to rerun-test (sgl-project#21747) * bugfix(model):fix deepstack index out of range error (sgl-project#21727) Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com> * [diffusion] fix: fix typo (sgl-project#21746) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * [CI] Fix rerun-test suite detection to skip commented registrations (sgl-project#21753) * [PD] Refactor Disagg Conn and Fix Hang with total_request/total_tokens Balancing (sgl-project#21299) Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal> * [CI] Fix ring test timeout (sgl-project#21751) * Enable evict swa with piecewise cuda graph (sgl-project#21754) * Fix kimi-linear launch server error (sgl-project#21752) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * [PD] Tiny cleanup after KVReceiver refactor (sgl-project#21760) Signed-off-by: Shangming Cai <csmthu@gmail.com> * Fix remote weight info nnode>1 and dp>1 (sgl-project#17389) * [diffusion] UX: replace deprecated ORJSONResponse with orjson_response (sgl-project#21755) Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * [diffusion] fix: fix Wan2.2-I2V-A14B video max size issue(sgl-project#21390) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Mick <mickjagger19@icloud.com> * [HiMambaTree]: Optimize mamba host lock mechanism (sgl-project#21750) * [AMD] Fix Handle missing rope_theta in get_rope_config for Grok-1 (sgl-project#21518) * [bugfix] Fix rope theta config for MiniMax after transformers v5 update (sgl-project#21241) * Fix ineffective is_base_mistral CI patch for HF API rate limiting (sgl-project#21729) * [2/n] lora - Shared outer experts and support qwen3_30b_a3b_instruct (sgl-project#21466) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * Fix cuda graph max bs capture upper bound (sgl-project#21005) * [Fix] Fall back to triton MOE for GPT-OSS on Blackwell with driver >= 595 (sgl-project#21780) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Cache nvidia wheels locally to skip repeated 830 MB downloads in CI (sgl-project#21778) * Add Trivy vulnerability scanning to nightly dev Docker builds (sgl-project#21772) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Remove more redundant PCG tests (sgl-project#21554) * [moe] add customized option to moe-a2a-backend (sgl-project#21786) * Add CompletionSampler for non-chat eval in run_eval (sgl-project#21785) * Remove redundant test_moe_eval_accuracy_large (sgl-project#21787) * Increase hicache eval to 200 examples (sgl-project#21791) * Switch MooncakeSpec to EAGLE3 + Llama-3.1 (sgl-project#21794) * Reduce redundant speculative decoding CI tests (sgl-project#21779) * Fix killall.py crash when sglang is not yet installed (sgl-project#21797) * Remove obsolete sgl-kernel legacy paths (sgl-project#21528) * [jit_kernel] Optimize fused_qknorm_rope: deduplicate sincosf for interleave RoPE (sgl-project#21654) * CUTLASS NVFP4 GEMM improvement of SM120 (sgl-project#21314) * [gRPC] Preserve original ImportError in grpc_server.py (sgl-project#21801) Signed-off-by: Chang Su <chang.s.su@oracle.com> * [Misc] Tiny: Add test network timeouts and dynamic max-parallel for 5090/2-gpu runners (sgl-project#21800) * Fix draft extend cuda graph when spec_step=1 (sgl-project#21709) * [Diffusion] Add `--uvicorn-access-log-exclude-prefixes` to suppress noisy access logs (sgl-project#20379) * Add latency and throughput metrics to run_eval (sgl-project#21793) * [diffusion] CI: improve ci reliability (sgl-project#21763) * [bugfix]GLM-4V model (sgl-project#17122) * Fix CVEs in Docker image: pillow, linux-libc-dev, and broken sgl-model-gateway build (sgl-project#21789) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: only showing recent runners from ci failure analysis (sgl-project#21015) * [MPS] Fix Triton stub sub-module imports on Python 3.12+ (sgl-project#21551) Co-authored-by: karanb192 <karan@example.com> Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> * [KDA] Fuse scaled_dot_kkt + solve_tril + recompute_w_u for KDA (sgl-project#21604) Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> * chore: bump flashinfer version to 0.6.7 (sgl-project#21422) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [3/n] lora moe - Support Qwen3-VL-30B-A3B-Instruct (sgl-project#21469) Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> * [Feature Restoration] repetition_penalty is essential for GLM-V models (sgl-project#21258) Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> * VLM: change default mm-attention backend from triton_attn to fa4 (on blackwell) (sgl-project#21595) * Fix added tokens config with sensible filter (sgl-project#17905) * [AMD] Optimize Qwen3-VL decode - fuse QK-norm + 3D mRoPE + KV cache write (sgl-project#21458) Co-authored-by: Bingxu Chen <bingxche@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> * [Bugfix] Fix PP tied embeddings weight loading for qwen3.5 4B dense model (sgl-project#21347) * [CI] Fix lint that was not applied in sgl-project#21458 (sgl-project#21818) * Bug fix for llama eagle3 (sgl-project#21397) * glm_interleave for GLM-V (sgl-project#21671) * style refinement for hisparse (sgl-project#21198) * [Bug][VLM] Fix shared memory race condition in ShmPointerMMData broadcast for multi-GPU VLM serving (sgl-project#21655) * [Bugfix] Fix effective_mamba_size over-allocation (sgl-project#20858) Co-authored-by: Shangming Cai <csmthu@gmail.com> * Fix in-place mode in pause generation (sgl-project#21705) * [diffusion] fix: respect --prompt-path (sgl-project#21756) * [NPU] update ascend docs (sgl-project#21807) * [VLM] remove AsyncMMDataProcessor wrapper (sgl-project#21651) * Use CustomTestCase for TestSessionControl to enable CI retry (sgl-project#21830) * [NPU]Add a full test pipeline on NPU, resolve issues in the NPU test architecture (sgl-project#20751) * [diffusion][CI]: Add individual component accuracy CI for diffusion models (sgl-project#18709) Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> * [Feature] JIT rmsnorm update (with claude) (sgl-project#21834) * [Diffusion][NPU] add ring sp performance benchmark page in npu (sgl-project#21811) * fix(MiMo-V2-Flash): add mimo reasoning parser (sgl-project#21414) * [diffusion] hardware: support FA3 attention backend on MUSA (attn backend, 14/N) (sgl-project#18648) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Mick <mickjagger19@icloud.com> * fix: pre-init tokenizer_manager to avoid AttributeError in shutdown (sgl-project#21824) * [FlashInver v0.6.7] Integrate flashinfer_trtllm mxfp8 gemm (sgl-project#21576) * [Misc] Add network timeout to eval dataset downloads (sgl-project#21873) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [refactor] Clean up duplicate flashinfer trtllm moe code (sgl-project#21233) * [DSA] Support trtllm sparse mla kernel for prefill batches (sgl-project#21783) * [Disagg] GPU staging buffer with dynamic ring allocator for heterogeneous TP KV transfer (sgl-project#19890) * Add merge prohibition policy during CI maintenance mode (sgl-project#21882) * [Misc] Fix comparator e2e tests: add polars dep + fix dp-attention test (sgl-project#21804) Co-authored-by: Alison Shao <alison.shao@mac.lan> * revert: remove TTL-based hard pin from HiRadixCache (sgl-project#21884) * Unify GSM8K eval path to Chat API for regression CI readiness (sgl-project#21667) * [HiCache] fix: Clone host indices to avoid memory leak (sgl-project#21624) Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [HiCache & PD]Fixed detailed cache hit breakdown in PD scenarios. (sgl-project#21764) * [CI] Add Llama 3.1 8B Instruct FP4 CI test on SM120 (sgl-project#20648) * [CI] Add Per-Tensor, Blockwise FP8 Tests on SM120 (sgl-project#20717) Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> * Allow /rerun-test to checkout fork PR branch for trusted users (sgl-project#21890) * Direct model loading from object storage with Runai Model Streamer (sgl-project#17948) Signed-off-by: Noa Neria <noa@run.ai> * fix pcg torch dynamo recompile in mxfp8 Triton path (sgl-project#21888) Co-authored-by: Hanlin Bi <hanlinbi@umich.edu> * chore: bump mooncake version to 0.3.10.post1 (sgl-project#21844) * [VLM] Add VLM TP=4 per-commit CI test and improve MMMU eval prompt/parser (sgl-project#21841) * fix(ci): update est_time for 57 tests based on runtime analysis (sgl-project#21896) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Increase multimodal server test timeout from 60 to 90 minutes (sgl-project#21897) * [CI] Remove crashing Kimi K2.5 EAGLE3/MTP variants, keep TP8 and TP8+DP8 (sgl-project#21898) * [diffusion] CI: add initial nvfp4 ci test for b200 (sgl-project#21767) Co-authored-by: Mick <mickjagger19@icloud.com> * Migrate all callers from /get_server_info to /server_info (sgl-project#21463) * Support PP key for file backend (sgl-project#21901) * Enable multi-thread weight loading by default (sgl-project#20289) * Skip Go stdlib and NVIDIA tool CVEs in Trivy scan (sgl-project#21905) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Fuse temperature + softmax in sampling for decode speedup (sgl-project#20501) * Multi tool streaming fix (sgl-project#20004) * Return HTTP 400 for streaming validation errors (sgl-project#21900) * [Spec][Ngram] 4/N: Remove `max_match_window_size` and `min_match_window_size`, matching all suffixes of the Trie (sgl-project#21225) * Fix ngram doc for speculative_num_draft_tokens default (sgl-project#21910) * [NVIDIA] Enable fp8 flashinfer_trtllm_routed MoE for MiniMax-M2.5 (sgl-project#20394) * scheduler: add prefill-only update in merge batch (sgl-project#21840) * [DSA] Set trtllm kernels as nsa default for Blackwell (sgl-project#21914) * Revert "Rollback flashmla to older version [1/2]" (sgl-project#21922) * test: add manual init test for mooncake transfer engine (sgl-project#21842) Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com> * Fix spec v2 + logprob when max_num_token is set (sgl-project#20799) * Migrate ngram corpus from torch cpp_extension to TVM FFI jit_kernel (sgl-project#21920) Co-authored-by: DarkSharpness <2040703891@qq.com> * [NPU] Support GLM-4.7-Flash on NPU (sgl-project#21408) * [CI] Fix gpu deps import in cpu test (sgl-project#21950) * [Parallel State Refactor 1/n] Remove stream of PyNCCL (sgl-project#20866) * [diffusion] chore: fix stage profiler for multi-stage denoising (sgl-project#21955) * [CI] [Tracing] Add ci for tracing and fix bugs (sgl-project#21740) * Remove logging for subprocess watchdog start (sgl-project#21968) * [4/n] Support gpt oss 20b lora (sgl-project#21570) * [MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine) (sgl-project#17985) Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> * [Feature] Stronger transformers modeling backend with TP, PP, MoE, VLMs, and torch compile (sgl-project#19163) * [CI] Remove stale Ascend suite entries from test/srt/run_suite.py (sgl-project#21978) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Skip broken AutoModel mapping entries when resolving Llava submodules (sgl-project#21892) * [CI] Add timeouts to Slack upload urlopen and WebClient (sgl-project#21903) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Diffusion][NPU] Add support for MOVA (sgl-project#21633) Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com> * Remove maxItems=1 restriction when tool_choice is specified (sgl-project#20208) * [Feature] NVFP4 Marlin fallback for non-Blackwell GPUs (SM75+) (sgl-project#19652) * [PP] qwen3 vl skip layer id for pp (sgl-project#19135) * [VLM] Enable per-image MM splitting by default and remove MULTI_IMAGES modality (sgl-project#21899) * [Bugfix] Fix incorrect dp-attention parallel info in bench_one_batch (sgl-project#21519) * Revert "[MUSA][9/N] Add FA3 attention backend support through MATE (MUSA AI Tensor Engine)" (sgl-project#22002) * [NPU] Optimized the wording in the npu docs (sgl-project#21998) * [Parallel State Refactor 2/n] Unify code path of AMD deterministic all reduce (sgl-project#20871) * [AMD] Resolve the performance degression when launch server with "--enable-aiter-allreduce-fusion" (sgl-project#21947) Co-authored-by: wunhuang <wunhuang@amd.com> * chore: bump sgl-kernel version to 0.4.1 (sgl-project#21447) Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> * [Workflow] Avoid triggering nightly tests in kernel bump workflow (sgl-project#22010) * [Workflow] Fix kernel release jobs skipped on push events (sgl-project#22011) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [PD]: Add support for HiSparse to directly transfer the cache from Prefill to Decode DRAM. (sgl-project#21591) Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> * [Misc] Update CI permission (sgl-project#22014) * [ROCM][RL] Shuffle Weight In-Place to Preserve Parameter Attributes (sgl-project#21825) * [CI] Fix duplicate job names that bypass branch protection (sgl-project#22001) * fix: remove duplicate words in comments (sgl-project#22007) * [PD] Tiny register info field cleanup for mooncake backend (sgl-project#22016) * [NPU] optimize glm4.7 (sgl-project#19246) * [AMD] Enable FP8 KV cache and FP8 attention kernel for NSA on MI300/MI355 with TileLang backend (sgl-project#21511) * [AMD] Add MiniMax-M2.5 nightly perf benchmarks for MI30x and MI35x (sgl-project#21524) --------- Signed-off-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Signed-off-by: P V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Signed-off-by: Shangming Cai <csmthu@gmail.com> Signed-off-by: Chang Su <chang.s.su@oracle.com> Signed-off-by: Noa Neria <noa@run.ai> Co-authored-by: Bingxu Chen <bingxche@amd.com> Co-authored-by: Xiaoyu Zhang <35585791+BBuf@users.noreply.github.com> Co-authored-by: yang1002378395-cmyk <yang1002378395@gmail.com> Co-authored-by: Mick <mickjagger19@icloud.com> Co-authored-by: Bi Xue <bi@thinkingmachines.ai> Co-authored-by: huangtingwei <141888744+huangtingwei9988@users.noreply.github.com> Co-authored-by: Lianmin Zheng <lianminzheng@gmail.com> Co-authored-by: Baizhou Zhang <sobereddiezhang@gmail.com> Co-authored-by: Muqi Li <muqi1029@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Qiaolin Yu <liin1211@outlook.com> Co-authored-by: narutolhy <582909902@qq.com> Co-authored-by: Ethan (Yusheng) Su <yushengsu.thu@gmail.com> Co-authored-by: zhangxiaolei <zhangxiaolei.666@bytedance.com> Co-authored-by: Vladislav Nosivskoy <vladnosiv@gmail.com> Co-authored-by: Trevor Morris <tmorris@nvidia.com> Co-authored-by: Eitan Turok <150733043+eitanturok@users.noreply.github.com> Co-authored-by: Fengyuan Yu <Yuandao151112@163.com> Co-authored-by: Fengyuan Yu <15fengyuan@gmail.com> Co-authored-by: Adarsh Shirawalmath <114558126+adarshxs@users.noreply.github.com> Co-authored-by: Yuhao Yang <47235274+yhyang201@users.noreply.github.com> Co-authored-by: Liangsheng Yin <hnyls2002@gmail.com> Co-authored-by: Jianying <53503712+jianyingzhu@users.noreply.github.com> Co-authored-by: jianyingzhu <joeyzhu@nvidia.com> Co-authored-by: Kangyan-Zhou <zky314343421@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jacob0226 <jacchang@amd.com> Co-authored-by: Aditya Sharma <89210949+adityavaid@users.noreply.github.com> Co-authored-by: Yuan Luo <yuan.luo@hotmail.com> Co-authored-by: Xinyuan Tong <115166877+JustinTong0323@users.noreply.github.com> Co-authored-by: Артем Савкин <58187114+OrangeRedeng@users.noreply.github.com> Co-authored-by: Tamir Baydasov <41994229+TamirBaydasov@users.noreply.github.com> Co-authored-by: Shu Wang <shuw@nvidia.com> Co-authored-by: eigen <52445717+yyihuang@users.noreply.github.com> Co-authored-by: Avery Huang <averyh@nvidia.com> Co-authored-by: jacky.cheng <yichiche@amd.com> Co-authored-by: HaiShaw <hixiao@gmail.com> Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com> Co-authored-by: Shangming Cai <csmthu@gmail.com> Co-authored-by: Junrong Lin <33685709+ocss884@users.noreply.github.com> Co-authored-by: Simon (Jiyou) Li <Simon-Li@users.noreply.github.com> Co-authored-by: 继优 <jiyou.ljy@alibaba-inc.com> Co-authored-by: shuwenn <47200617+alphabetc1@users.noreply.github.com> Co-authored-by: psaab <ps@meta.com> Co-authored-by: hnyls2002 <lsyincs@gmail.com> Co-authored-by: Hanlin Bi <52993433+wolfcomos@users.noreply.github.com> Co-authored-by: wili <98001977+wili-65535@users.noreply.github.com> Co-authored-by: saatwiknagpal <saatwiknagpal@gmail.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: wan4ch <wan4ch@gmail.com> Co-authored-by: Feng Su <sufeng@linux.alibaba.com> Co-authored-by: Ying Sheng <sqy1415@gmail.com> Co-authored-by: Polisetty V R K Jyothendra Varma <polisetty.v.r.k.jyothendra.varma@intel.com> Co-authored-by: Ziang Li <ziangli@umich.edu> Co-authored-by: Aishwarya Ramasethu <56765596+aramasethu@users.noreply.github.com> Co-authored-by: Ma Mingfei <mingfei.ma@intel.com> Co-authored-by: blzheng <beilei.zheng@intel.com> Co-authored-by: kk <43161300+kkHuang-amd@users.noreply.github.com> Co-authored-by: wunhuang <wunhuang@amd.com> Co-authored-by: Michelle Wu <michellewu351@gmail.com> Co-authored-by: wuxue (C) <w00964934@china.huawei.com> Co-authored-by: Hubert Lu <55214931+hubertlu-tw@users.noreply.github.com> Co-authored-by: strgrb <zhangkaihong.zkh@antgroup.com> Co-authored-by: LiYomi <106872109+LiYomi@users.noreply.github.com> Co-authored-by: mengxiancheng03 <mengxiancheng03@kuaishou.com> Co-authored-by: GXIN <37653830+gxxx-hum@users.noreply.github.com> Co-authored-by: 高鑫 <gaoxin@gaoxindeMacBook-Pro.local> Co-authored-by: heziiop <q_m_p@qq.com> Co-authored-by: xieminghe1 <141820649+xieminghe1@users.noreply.github.com> Co-authored-by: undefined <zhouchen.arrebol@jd.com> Co-authored-by: Makcum888e <79456407+Makcum888e@users.noreply.github.com> Co-authored-by: yuefeng Wu <33725817+ChefWu551@users.noreply.github.com> Co-authored-by: Yuxuan Zhang <2448370773@qq.com> Co-authored-by: Vedant V Jhaveri <vedantjh2@gmail.com> Co-authored-by: ronnie_zheng <zl19940307@163.com> Co-authored-by: Zhai Feiyue <80079571+ZhaiFeiyue@users.noreply.github.com> Co-authored-by: jhchouuu <jiahzhou@amd.com> Co-authored-by: Alison Shao <54658187+alisonshao@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@MacBook-Pro-D2W773R9CD.local> Co-authored-by: DarkSharpness <76582120+DarkSharpness@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@Mac.attlocal.net> Co-authored-by: Lewis <63569348+TTThanos@users.noreply.github.com> Co-authored-by: 百麒 <yaozhong.lyz@alibaba-inc.com> Co-authored-by: Jincong Chen <jincong.cjc@ant-intl.com> Co-authored-by: xiazhahe <86939755+xiazhahe@users.noreply.github.com> Co-authored-by: Thomas Wang <thomawan@amd.com> Co-authored-by: Ke Bao <ispobaoke@gmail.com> Co-authored-by: xiaoqi <xq25478@qq.com> Co-authored-by: xiaoqi.31 <xiaoqi.31@jd.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com> Co-authored-by: weireweire <weiliangl@nvidia.com> Co-authored-by: Weiliangl User <weiliangl@login-node.hosted.internal> Co-authored-by: JD <jaedon.guo@gmail.com> Co-authored-by: Zhangheng <hzh0425@apache.org> Co-authored-by: Michael <13900043+michaelzhang-ai@users.noreply.github.com> Co-authored-by: Yilong Zhao <74357408+happierpig@users.noreply.github.com> Co-authored-by: Johnsonms <lizhaofu@gmail.com> Co-authored-by: Brayden Zhong <b8zhong@uwaterloo.ca> Co-authored-by: Chang Su <chang.s.su@oracle.com> Co-authored-by: KnightLTC <56717110+KnightLTC@users.noreply.github.com> Co-authored-by: Douglas Yang <dyang@college.harvard.edu> Co-authored-by: Karan Bansal <karanb192@users.noreply.github.com> Co-authored-by: karanb192 <karan@example.com> Co-authored-by: R0CKSTAR <yeahdongcn@gmail.com> Co-authored-by: sglang-bot <sglangbot@gmail.com> Co-authored-by: sglang-bot <sglang-bot@users.noreply.github.com> Co-authored-by: Xinyuan Tong <xinyuantong.cs@gmail.com> Co-authored-by: sbeurnier <sbeurnier@together.ai> Co-authored-by: YC Yen-Ching Tseng <yctseng@amd.com> Co-authored-by: Wenyao Gao <105094497+edwingao28@users.noreply.github.com> Co-authored-by: Alex Nails <alex.nails@radixark.ai> Co-authored-by: khalilzhk <khalilzhk@gmail.com> Co-authored-by: Zhiqiang Xie <xiezhq@stanford.edu> Co-authored-by: yudian0504 <138860534+yudian0504@users.noreply.github.com> Co-authored-by: yunkchen <chenyunkuo.cyk@alibaba-inc.com> Co-authored-by: wduan-hai <wduan@humansand.ai> Co-authored-by: amote-i <49533125+amote-i@users.noreply.github.com> Co-authored-by: Cherry_ming <136634645@qq.com> Co-authored-by: Ratish P <114130421+Ratish1@users.noreply.github.com> Co-authored-by: YAMY <74099316+YAMY1234@users.noreply.github.com> Co-authored-by: Alison Shao <alison.shao@mac.lan> Co-authored-by: ishandhanani <82981111+ishandhanani@users.noreply.github.com> Co-authored-by: Derek Yu <81697272+DerekY2@users.noreply.github.com> Co-authored-by: Noa Neria <noa@run.ai> Co-authored-by: Hanlin Bi <hanlinbi@umich.edu> Co-authored-by: Prozac614 <dwt614707404@163.com> Co-authored-by: David Cheung <d7cheung@gmail.com> Co-authored-by: Mook <68294499+Godmook@users.noreply.github.com> Co-authored-by: Khoa Pham <khoa.pham@radixark.ai> Co-authored-by: foraxe <73625538+foraxe@users.noreply.github.com> Co-authored-by: yunzhi <ningyunxiao.nyx@antgroup.com> Co-authored-by: DarkSharpness <2040703891@qq.com> Co-authored-by: Todobe <43903496+Todobe@users.noreply.github.com> Co-authored-by: ori <39351881+froststeam@users.noreply.github.com> Co-authored-by: Thomas <zs033@qq.com> Co-authored-by: zhangshuai (S) <z00836796@china.huawei.com> Co-authored-by: lviy <142899752+lviy@users.noreply.github.com> Co-authored-by: Tingwei Huang <huangtingwei9988@gmail.com> Co-authored-by: Yuzhen Zhou <82826991+zyzshishui@users.noreply.github.com> Co-authored-by: Ricardo-M-L <69202550+Ricardo-M-L@users.noreply.github.com> Co-authored-by: Kelon <kelonlu@163.com> Co-authored-by: cen121212 <luochen23@huawei.com>

Add transformers modeling backend with TP, PP, MoE, multimodal, and t…

c5472cd

…orch.compile support

adarshxs requested review from Fridge003, JustinTong0323, Ying1123, hnyls2002, ispobock, merrymercy, mickqian, xiezhq-hermann and yhyang201 as code owners February 22, 2026 20:01

github-actions bot added the Multi-modal multi-modal language model label Feb 22, 2026

adarshxs changed the title ~~Stronger transformers modeling backend with TP, PP, MoE, VLMs, and torch compile~~ [Feature] Stronger transformers modeling backend with TP, PP, MoE, VLMs, and torch compile Feb 22, 2026

gemini-code-assist bot reviewed Feb 22, 2026

View reviewed changes

Merge branch 'main' into transformers-backend

031580e

JustinTong0323 self-assigned this Feb 25, 2026

JustinTong0323 reviewed Feb 25, 2026

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

JustinTong0323 reviewed Feb 25, 2026

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

fix

7e1d1d5

JustinTong0323 reviewed Mar 1, 2026

View reviewed changes

python/sglang/srt/multimodal/processors/transformers_auto.py Outdated Show resolved Hide resolved

adarshxs added 2 commits March 3, 2026 14:07

fix offset calculation

65a2b96

fix conflict

25d2906

adarshxs assigned yuan-luo Mar 3, 2026

github-actions bot added the run-ci label Mar 9, 2026

yuan-luo reviewed Mar 13, 2026

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

yuan-luo reviewed Mar 13, 2026

View reviewed changes

python/sglang/srt/model_loader/utils.py Outdated Show resolved Hide resolved

Merge branch 'main' into transformers-backend

bbb852e

yuan-luo reviewed Mar 17, 2026

View reviewed changes

test/registered/models/test_transformers_backend_eval.py Outdated Show resolved Hide resolved

adarshxs and others added 3 commits March 17, 2026 09:37

add compile server args to ci test

a3527ac

Merge branch 'transformers-backend' of https://github.com/adarshxs/sg…

432fde2

…lang into transformers-backend

Merge branch 'main' into transformers-backend

629eb1c

JustinTong0323 reviewed Mar 18, 2026

View reviewed changes

python/sglang/srt/models/transformers.py Show resolved Hide resolved

JustinTong0323 reviewed Mar 18, 2026

View reviewed changes

python/sglang/srt/models/transformers.py Show resolved Hide resolved

JustinTong0323 reviewed Mar 18, 2026

View reviewed changes

python/sglang/srt/model_loader/utils.py Show resolved Hide resolved

address comments

819366a

adarshxs requested review from ByronHsu and ShangmingCai as code owners March 19, 2026 10:08

fix conflict

eb9367d

Merge branch 'main' into transformers-backend

96cc693

adarshxs added 3 commits March 30, 2026 20:10

Merge origin/main into transformers-backend

d492862

Merge sgl-project/main into transformers-backend

b9fea2f

Merge branch 'main' into transformers-backend

66d87c5

Merge branch 'main' into transformers-backend

20c5094

Fridge003 added 2 commits April 2, 2026 00:31

Merge branch 'main' into transformers-backend

ba2c38d

Merge branch 'main' into transformers-backend

eabd3ea

Fridge003 merged commit 34ddf13 into sgl-project:main Apr 2, 2026
158 of 169 checks passed

	for prefix, group in itertools.groupby(weights_by_parts, key=lambda x: x[0][0]):
	for prefix, group in itertools.groupby(sorted(weights_by_parts, key=lambda x: x[0][0]), key=lambda x: x[0][0]):

Conversation

adarshxs commented Feb 22, 2026

Motivation

Modifications

Testing

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Feb 22, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Mar 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adarshxs commented Mar 8, 2026

Usage Instructions

Basic Usage (Transformers Backend)

Supported --model-impl Values

Accuracy Eval

Uh oh!

yuan-luo commented Mar 9, 2026

Uh oh!

yuan-luo commented Mar 13, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Mar 18, 2026

Uh oh!

yuan-luo commented Mar 18, 2026

Uh oh!

JustinTong0323 left a comment

Choose a reason for hiding this comment

PR Review Summary

Test Coverage

Simplification Opportunities

Strengths

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Mar 19, 2026

Uh oh!

adarshxs commented Mar 20, 2026

Uh oh!

JustinTong0323 commented Mar 22, 2026

Uh oh!

yuan-luo commented Mar 30, 2026

Uh oh!

adarshxs commented Mar 31, 2026

Uh oh!

adarshxs commented Apr 2, 2026

Uh oh!

yuan-luo commented Mar 8, 2026 •

edited

Loading

Supported `--model-impl` Values