Skip to content

[Feature] Add Qwen3.5 Model Support for DFlash#19952

Open
EanWang211123 wants to merge 58 commits intosgl-project:mainfrom
EanWang211123:dflash-qwen3_5
Open

[Feature] Add Qwen3.5 Model Support for DFlash#19952
EanWang211123 wants to merge 58 commits intosgl-project:mainfrom
EanWang211123:dflash-qwen3_5

Conversation

@EanWang211123
Copy link
Copy Markdown

@EanWang211123 EanWang211123 commented Mar 5, 2026

Motivation

Add DFlash speculative decoding support for Qwen3.5 models, following the implementation approach in #16818

Modifications

Similar to #18387:

  • In dflash_worker.py, save the global server_args before creating the draft worker and restore it afterward
  • Added set_dflash_layers_to_capture interface for better layer management

Tests

Successfully trained and tested with Qwen3.5-27B model using DFlash in SpecForge:

  • Dataset: eaglechat
  • Training steps: 6 epochs with 8k data points

Test Environment:

Test Command:

SGLANG_DISABLE_CUDNN_CHECK=1 \
CUDA_VISIBLE_DEVICES=4,5,6,7 \
python benchmark_sglang.py \
--tp-size 4 \
--target-model /models/Qwen/Qwen3___5-27B/ \
--draft-model /draft-model-path/ \
--concurrencies  1,8,32 \
--dataset-name humaneval  \
--attention-backends fa3 \
--max-running-requests 32

Performance Results

Baseline Output (tok/s)

concurrencies 1 4 32
tok/s 103.61 361.46 1,860.84

DFLASH Output (tok/s)

concurrencies 1 4 32
tok/s 207.19 684.67 2,115.39

Speedup (DFLASH / Baseline)

concurrencies 1 4 32
speedup 2.000x 1.894x 1.137x

DFLASH Acceptance Length

concurrencies 1 4 32
acceptance length 3.233 3.216 3.219

Summary

The implementation shows significant performance improvements with DFlash enabled, achieving up to 2x speedup for single concurrent requests and maintaining good acceleration across different concurrency levels.

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@EanWang211123 EanWang211123 changed the title [Feature]Add Qwen3.5 Model Support for DFlash [Feature] Add Qwen3.5 Model Support for DFlash Mar 5, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates DFLASH speculative decoding into the SGLang framework, primarily targeting Qwen3.5 models to enhance inference performance. It introduces a dedicated DFLASH draft model, optimizes KV cache operations with a fused Triton kernel, and updates the core speculative decoding logic to manage DFLASH-specific requirements and ensure compatibility with existing features. The changes aim to provide substantial speedups for model generation.

Highlights

  • DFLASH Speculative Decoding Support: Introduced comprehensive support for DFLASH speculative decoding, enabling significant performance improvements for model inference.
  • Qwen3.5 Model Integration: Added specific model definitions and configurations to support Qwen3.5 models with DFLASH, including handling auxiliary hidden states and KV cache scaling.
  • Optimized KV Cache Materialization: Implemented a fused Triton kernel for efficient KV cache materialization in the DFLASH draft model, combining RMSNorm and RoPE operations.
  • Enhanced Speculative Decoding Framework: Integrated DFLASH into the SGLang speculative decoding framework, including new worker logic, updated CUDA graph capture, and compatibility checks for sampling parameters.
  • Performance Benchmarking: Provided benchmark results demonstrating up to 2x speedup for single concurrent requests and good acceleration across various concurrency levels using DFLASH.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • benchmark/dflash/bench_dflash_gsm8k_sweep.py
    • Added a new benchmark script to evaluate DFLASH performance against a baseline on the GSM8K dataset across various concurrency and tensor parallelism configurations.
  • python/sglang/srt/layers/attention/flashinfer_backend.py
    • Modified custom mask handling in init_forward_metadata_capture_cuda_graph to prevent FlashInfer from misinterpreting zero buffers as real masks for DFLASH draft models.
    • Adjusted causal attention logic in forward_extend to correctly handle ENCODER_ONLY attention types used by DFLASH.
  • python/sglang/srt/managers/scheduler.py
    • Added checks to handle_generate_request to abort DFLASH requests that attempt to use return_logprob or grammar-constrained decoding, as these features are not yet supported.
  • python/sglang/srt/model_executor/cuda_graph_runner.py
    • Updated CUDA graph capture logic to include DFLASH, allowing TARGET_VERIFY mode for draft workers and handling input embeddings.
    • Integrated set_dflash_layers_to_capture for models supporting DFLASH auxiliary hidden state capture.
  • python/sglang/srt/model_executor/model_runner.py
    • Introduced DFLASH-specific initialization, including parsing draft model configuration and resolving target layers for auxiliary hidden state capture.
    • Updated _should_run_flashinfer_autotune and _dummy_run to account for DFLASH speculative algorithm.
  • python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py
    • Modified profile_max_num_token to scale KV cell size per token for DFLASH, accounting for both target and draft model layers.
  • python/sglang/srt/models/dflash.py
    • Added a new DFLASH draft model implementation, including DFlashAttention, DFlashMLP, and DFlashDecoderLayer modules.
    • Implemented kv_proj_only and project_target_hidden methods for efficient KV materialization and hidden state projection.
  • python/sglang/srt/models/gpt_oss.py
    • Added get_input_embeddings method and set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
  • python/sglang/srt/models/llama.py
    • Added set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
  • python/sglang/srt/models/qwen3.py
    • Added set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
  • python/sglang/srt/models/qwen3_5.py
    • Removed is_last_layer parameter from Qwen3DecoderLayer and Qwen3VLChatDecoderLayer initializers.
    • Added layers_to_capture attribute to Qwen3Model for DFLASH/EAGLE3 hidden-state capture.
    • Modified forward method of Qwen3Model to return auxiliary hidden states when captured.
    • Added set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
  • python/sglang/srt/models/qwen3_moe.py
    • Added set_dflash_layers_to_capture to Qwen3MoeDecoder and Qwen3MoeForCausalLM for DFLASH auxiliary hidden state capture.
  • python/sglang/srt/models/qwen3_next.py
    • Added set_dflash_layers_to_capture to Qwen3NextDecoder and Qwen3NextForCausalLM for DFLASH auxiliary hidden state capture.
    • Added get_input_embeddings method to Qwen3NextForCausalLM.
  • python/sglang/srt/models/qwen3_vl.py
    • Adjusted forward method of Qwen3VLChatForCausalLM to correctly handle and pass aux_hidden_states to the logits processor.
  • python/sglang/srt/server_args.py
    • Added speculative_dflash_block_size argument to ServerArgs.
    • Implemented DFLASH-specific validation and defaulting logic for speculative decoding parameters, including speculative_num_steps, speculative_eagle_topk, and speculative_num_draft_tokens.
    • Disabled overlap scheduling and mixed chunked prefill for DFLASH speculative decoding.
  • python/sglang/srt/speculative/dflash_info.py
    • Added DFlashDraftInput and DFlashVerifyInput dataclasses to manage DFLASH speculative decoding state.
    • Implemented _compute_paged_keep_slots for managing KV cache allocation in paged mode.
    • Provided prepare_for_verify and verify methods for DFLASH verification logic, including handling greedy and non-greedy sampling.
  • python/sglang/srt/speculative/dflash_utils.py
    • Added utility functions for DFLASH, including scale_kv_cell_size_per_token_for_dflash for KV cache sizing, resolve_dflash_verify_mask_policy for attention mask handling, and compute_dflash_accept_len_and_bonus for verification.
    • Introduced parse_dflash_draft_config to extract DFLASH-specific configurations from HF model configs.
    • Implemented build_target_layer_ids for selecting target layers for hidden state capture.
  • python/sglang/srt/speculative/dflash_worker.py
    • Added a new DFlashWorker class to manage the DFLASH speculative decoding process, including initializing a separate draft model runner.
    • Implemented _prepare_for_speculative_decoding to handle draft generation and verification steps.
    • Provided _greedy_sample_from_vocab_parallel_head for TP-safe greedy sampling and _append_target_hidden_to_draft_kv for materializing hidden states into the draft KV cache.
    • Integrated fused KV materialization using a Triton kernel for performance optimization.
  • python/sglang/srt/speculative/spec_info.py
    • Added DFLASH to the SpeculativeAlgorithm enum.
    • Updated is_draft_input and is_verify_input methods to include DFLASH types.
    • Configured create_worker to return DFlashWorker for DFLASH algorithm and enforced no overlap scheduling.
  • python/sglang/srt/speculative/triton_ops/init.py
    • Added __init__.py to export FusedKVMaterializeHelper.
  • python/sglang/srt/speculative/triton_ops/fused_kv_materialize.py
    • Added a new Triton kernel (_fused_norm_rope_kernel) and a FusedKVMaterializeHelper class for fused RMSNorm + RoPE operations during DFLASH KV materialization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces DFlash speculative decoding support for Qwen3.5 models, which is a significant feature enhancement. The implementation is comprehensive, adding new core components for DFlash (dflash_worker.py, dflash.py, etc.), a new benchmark script, and integrating DFlash support across various parts of the system including the scheduler, model runner, and attention backends. The code quality is high, with robust configuration handling and optimizations like fused Triton kernels. My review includes one suggestion for a minor refactoring to improve code maintainability.

Comment on lines +1583 to +1601
if self.spec_algorithm.is_dflash() and req.return_logprob:
req.set_finish_with_abort(
"DFLASH speculative decoding does not support return_logprob yet."
)
self.init_req_max_new_tokens(req)
self._add_request_to_queue(req)
return
if self.spec_algorithm.is_dflash() and (
req.sampling_params.json_schema is not None
or req.sampling_params.regex is not None
or req.sampling_params.ebnf is not None
or req.sampling_params.structural_tag is not None
):
req.set_finish_with_abort(
"DFLASH speculative decoding does not support grammar-constrained decoding yet."
)
self.init_req_max_new_tokens(req)
self._add_request_to_queue(req)
return
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The two if blocks for checking DFLASH unsupported features are repetitive. They can be combined to reduce code duplication and improve readability.

Suggested change
if self.spec_algorithm.is_dflash() and req.return_logprob:
req.set_finish_with_abort(
"DFLASH speculative decoding does not support return_logprob yet."
)
self.init_req_max_new_tokens(req)
self._add_request_to_queue(req)
return
if self.spec_algorithm.is_dflash() and (
req.sampling_params.json_schema is not None
or req.sampling_params.regex is not None
or req.sampling_params.ebnf is not None
or req.sampling_params.structural_tag is not None
):
req.set_finish_with_abort(
"DFLASH speculative decoding does not support grammar-constrained decoding yet."
)
self.init_req_max_new_tokens(req)
self._add_request_to_queue(req)
return
if self.spec_algorithm.is_dflash():
unsupported_reason = None
if req.return_logprob:
unsupported_reason = "return_logprob"
elif (
req.sampling_params.json_schema is not None
or req.sampling_params.regex is not None
or req.sampling_params.ebnf is not None
or req.sampling_params.structural_tag is not None
):
unsupported_reason = "grammar-constrained decoding"
if unsupported_reason:
req.set_finish_with_abort(
f"DFLASH speculative decoding does not support {unsupported_reason} yet."
)
self.init_req_max_new_tokens(req)
self._add_request_to_queue(req)
return

@liyucheng09
Copy link
Copy Markdown

hi @EanWang211123 yiheng, impressive results. I tried to reproduce the accept len results, but failed with eaglechat & specforge.

Does your accept len include the 1 token produced at the end of target extend?

@EanWang211123
Copy link
Copy Markdown
Author

你好@EanWang211123yiheng,结果令人印象深刻。我尝试复现 accept len 的结果,但使用 EagleChat 和 Specforge 都失败了。

您的 accept len 是否包含 target extend 结束时生成的 1 个 token?

@liyucheng09 I think so. I only tried training with a small amount of data for verification.

@liyucheng09
Copy link
Copy Markdown

@EanWang211123 how's the accept len: in sglang log? I tried to reproduce your results but mine number is rather low.
image

@EanWang211123
Copy link
Copy Markdown
Author

@EanWang211123 how's the accept len: in sglang log? I tried to reproduce your results but mine number is rather low. image

@liyucheng09 My training dataset was created by sampling the EagleChat subset (English 4k + Chinese 4k), and the model was trained using SGLang as the training backend.
May I ask whether your training setup, test set, and other configurations are the same as mine?

@moehanabi
Copy link
Copy Markdown

moehanabi commented Apr 3, 2026

Update on 4.7:
the reason seems that this version of sglang will output repeatedly which causes the draft not accepted. It works well after merge newest main commits

Update on 4.8:
have create a pr at EanWang211123#2
merged from v0.5.10 tag & fix memory_pool_config

Reply at first time:
Hi! Thanks for your great work!
For qwen3 30b a3b:
I met the same low accept len result, with my own 150k train data.
and I found the first accept len was high (still lower than eagle3), then it will drop after following requests, but the test data items are all similar (and work well in eagle3)
accept len

use TP=2 to ensure enough kvcache space:

accept len

For qwen3.5 35b a3b:
I met accept len = 1.0. maybe somewhere went wrong, i'm still checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants