Skip to content

[MODEL] Adding Support for Qwen3.5 Models#18489

Merged
mickqian merged 48 commits intosgl-project:mainfrom
zju-stu-lizheng:dev/qwen3.5
Feb 9, 2026
Merged

[MODEL] Adding Support for Qwen3.5 Models#18489
mickqian merged 48 commits intosgl-project:mainfrom
zju-stu-lizheng:dev/qwen3.5

Conversation

@zju-stu-lizheng
Copy link
Contributor

@zju-stu-lizheng zju-stu-lizheng commented Feb 9, 2026

Purpose

This PR adds model support for the upcoming Qwen3.5 models, including both dense and MoE variants.

Special thanks to @cao1zhg, @yizhang2077, and @attack204 for their review, and to @hnyls2002 and @mickqian from the SGLang team for their valuable review and support.

Reference HF implementation - huggingface/transformers#43830

Main Changes

  1. Model Support: Added support for two new model classes:

    • Qwen3_5MoeForConditionalGeneration
    • Qwen3_5ForConditionalGeneration
  2. Attention Layer: Introduced a new linear attention layer Qwen3_5GatedDeltaNet to optimize the fix_qkvz reordering operation.

Launch Server

Hardware setup: 8 × H200 GPUs

Without MTP

TP_SIZE=8

python -m sglang.launch_server \
    --model $vlmoe \
    --tp-size ${TP_SIZE} \
    --enable-multimodal \
    --max-mamba-cache-size 128 \
    --max-running-requests 128 \
    --chunked-prefill-size 2048 \
    --mamba-ssm-dtype bfloat16

With MTP

Add the following arguments to the above command:

    --speculative-algo NEXTN    \
    --speculative-num-steps 3     \
    --speculative-eagle-topk 1     \
    --speculative-num-draft-tokens 4 \

Check our cookbook for detailed deployment instruction: https://cookbook.sglang.io/autoregressive/Qwen/Qwen3.5

瑀澈 added 30 commits January 22, 2026 00:15
use rope_parameters rather than rope_scaling
1. multimodal inputs for mtp
2. dense model tie embedding for mtp
1. rename Qwen3_vl_next to Qwen3_5 in config
2. merge qwen3_5.py and qwen3_5_moe.py
2. rename Qwen3_5LLMModel to Qwen3_5ForCausalLM
use rope_parameters rather than rope_scaling
1. multimodal inputs for mtp
2. dense model tie embedding for mtp
1. rename Qwen3_vl_next to Qwen3_5 in config
2. merge qwen3_5.py and qwen3_5_moe.py
2. rename Qwen3_5LLMModel to Qwen3_5ForCausalLM
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @zju-stu-lizheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces support for the Qwen3.5 series of models, encompassing both dense and MoE variants. It includes a new attention layer, configuration updates, and code modifications to ensure seamless integration and optimized performance within the existing framework. The changes enhance the system's ability to handle advanced language models and improve overall processing efficiency.

Highlights

  • Model Support: Adds support for Qwen3.5 models, including both dense and MoE variants, by introducing new model classes: Qwen3_5MoeForConditionalGeneration and Qwen3_5ForConditionalGeneration.
  • Attention Layer: Implements a new linear attention layer Qwen3_5GatedDeltaNet to optimize the fix_qkvz reordering operation, enhancing attention mechanisms.
  • Configuration Files: Introduces new configuration files (qwen3_5.py) and modifies existing ones (__init__.py, model_config.py) to integrate Qwen3.5 models into the system.
  • Code Modifications: Updates several files to ensure compatibility and proper handling of the new Qwen3.5 model architectures, including changes to logits processing, rotary embeddings, and multimodal utilities.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • benchmark/kernels/fused_moe_triton/common_utils.py
    • Added 'Qwen3_5MoeForConditionalGeneration' to the list of supported MoE models.
  • python/sglang/srt/configs/init.py
    • Imported and exposed Qwen3_5Config and Qwen3_5MoeConfig.
  • python/sglang/srt/configs/model_config.py
    • Modified draft model configuration to include Qwen3_5 architectures.
    • Added Qwen3_5 architectures to the list of generation models.
  • python/sglang/srt/configs/qwen3_5.py
    • Added new configuration files for Qwen3_5 and Qwen3_5Moe models, defining their architecture and parameters.
  • python/sglang/srt/layers/logits_processor.py
    • Added mm_input_embeds to LogitsProcessorOutput and LogitsMetadata to handle multimodal inputs.
  • python/sglang/srt/layers/rotary_embedding.py
    • Modified get_rope_index to support Qwen3_5 models when using video grid THW.
  • python/sglang/srt/managers/mm_utils.py
    • Modified general_mm_embed_routine to handle mm_input_embeds in forward_batch.
  • python/sglang/srt/model_executor/forward_batch_info.py
    • Added mm_input_embeds to ForwardBatch to support multimodal inputs.
  • python/sglang/srt/model_executor/model_runner.py
    • Imported Qwen3_5Config and Qwen3_5MoeConfig.
    • Modified hybrid_gdn_config to include Qwen3_5 configurations.
    • Modified model_is_mrope to check for rope_parameters or rope_scaling.
  • python/sglang/srt/models/qwen3_5.py
    • Added new model classes Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM, including attention and decoder layers.
  • python/sglang/srt/models/qwen3_5_mtp.py
    • Added Qwen3_5ForCausalLMMTP to support Multi Token Prediction.
  • python/sglang/srt/models/qwen3_next.py
    • Modified rope_scaling to check for rope_parameters or rope_scaling.
  • python/sglang/srt/multimodal/processors/qwen_vl.py
    • Added support for Qwen3_5 models in QwenVLImageProcessor.
    • Modified process_mm_data_async to handle video metadata for Qwen3_5 models.
  • python/sglang/srt/server_args.py
    • Modified _handle_model_specific_adjustments to include Qwen3_5 models.
  • python/sglang/srt/speculative/eagle_worker.py
    • Modified forward_draft_extend to include mm_input_embeds.
  • python/sglang/srt/utils/common.py
    • Modified load_video to support video file as list, tuple, torch.Tensor, or np.ndarray.
  • python/sglang/srt/utils/hf_transformers_utils.py
    • Imported Qwen3_5Config and Qwen3_5MoeConfig.
Activity
  • The pull request introduces support for Qwen3.5 models, including both dense and MoE variants.
  • A new linear attention layer Qwen3_5GatedDeltaNet is implemented to optimize the fix_qkvz reordering operation.
  • Configuration files are added and modified to integrate Qwen3.5 models into the system.
  • Code modifications are made to ensure compatibility and proper handling of the new Qwen3.5 model architectures.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the new Qwen3.5 models, including both dense and MoE variants. The changes are extensive, introducing new model and configuration files, and updating various parts of the system to accommodate these new models. The implementation appears to be well-integrated with the existing codebase. My review focuses on a few opportunities for code simplification and correctness improvements.

Comment on lines +429 to +432
if hasattr(config, "rope_parameters"):
self.rope_scaling = getattr(config, "rope_parameters", None)
else:
self.rope_scaling = getattr(config, "rope_scaling", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for setting self.rope_scaling can be simplified to a single line. Using getattr with a default value is more concise and readable.

        self.rope_scaling = getattr(config, "rope_parameters", getattr(config, "rope_scaling", None))

forward_batch=forward_batch,
)
else:
raise ("not implementation for other mtp layers[self.num_mtp_layers > 1]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Raising a string literal does not raise a proper exception. It's better to raise a specific exception type like NotImplementedError for unimplemented features.

Suggested change
raise ("not implementation for other mtp layers[self.num_mtp_layers > 1]")
raise NotImplementedError("not implementation for other mtp layers[self.num_mtp_layers > 1]")

Comment on lines +620 to +623
if "rope_parameters" in config:
self.rope_scaling = getattr(config, "rope_parameters", None)
else:
self.rope_scaling = getattr(config, "rope_scaling", None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for setting self.rope_scaling can be simplified. Using getattr with a default value is more concise. Also, using hasattr is generally preferred over the in operator for checking attributes on an object.

        self.rope_scaling = getattr(config, "rope_parameters", getattr(config, "rope_scaling", None))

@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Feb 9, 2026
@ispobock ispobock mentioned this pull request Feb 9, 2026
2 tasks
@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

2 similar comments
@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

@mickqian mickqian merged commit 27c4476 into sgl-project:main Feb 9, 2026
260 of 287 checks passed
@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

8 similar comments
@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

@mickqian
Copy link
Collaborator

mickqian commented Feb 9, 2026

/rerun-failed-ci

@mickqian
Copy link
Collaborator

/rerun-failed-ci

@mickqian
Copy link
Collaborator

/rerun-failed-ci

@mickqian
Copy link
Collaborator

/rerun-failed-ci

@mickqian
Copy link
Collaborator

/rerun-failed-ci

Johnsonms pushed a commit to Johnsonms/sglang that referenced this pull request Feb 14, 2026
Co-authored-by: 瑀澈 <yuche.lz@alibaba-inc.com>
@Huixxi
Copy link
Contributor

Huixxi commented Feb 25, 2026

Emm, may I ask why qwen3.5's moe sparse block reuse qwen2's not qwen3's? @zju-stu-lizheng

@cao1zhg
Copy link
Contributor

cao1zhg commented Feb 26, 2026

Emm, may I ask why qwen3.5's moe sparse block reuse qwen2's not qwen3's? @zju-stu-lizheng

for shared_expert

@Cppowboy
Copy link

Cppowboy commented Mar 3, 2026

Which tool call paser should be used for qwen3.5 series model?

@Huixxi
Copy link
Contributor

Huixxi commented Mar 4, 2026

Which tool call paser should be used for qwen3.5 series model?

Seems qwen3_coder from https://www.modelscope.cn/models/Qwen/Qwen3.5-35B-A3B

@riZZZhik
Copy link

riZZZhik commented Mar 4, 2026

Hello,
Why does half of the docs suggest setting --speculative-algorithm to EAGLE (e.g., cookbook) and the other half to NEXTN (e.g., this and another PR descriptions)

Which one is correct?

@riZZZhik
Copy link

riZZZhik commented Mar 4, 2026

And why is expert parallelism not suggested for this model?
It's not supported yet or has worse performance?

magicYang1573 pushed a commit to magicYang1573/sglang that referenced this pull request Mar 9, 2026
Co-authored-by: 瑀澈 <yuche.lz@alibaba-inc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants