[Feature] Add Qwen3.5 Model Support for DFlash by EanWang211123 · Pull Request #19952 · sgl-project/sglang

EanWang211123 · 2026-03-05T13:01:43Z

Motivation

Add DFlash speculative decoding support for Qwen3.5 models, following the implementation approach in #16818

Modifications

Similar to #18387:

In dflash_worker.py, save the global server_args before creating the draft worker and restore it afterward
Added set_dflash_layers_to_capture interface for better layer management

Tests

Successfully trained and tested with Qwen3.5-27B model using DFlash in SpecForge:

Dataset: eaglechat
Training steps: 6 epochs with 8k data points

Test Environment:

Hardware: H100 (4x GPUs)
Tensor Parallel size: 4
Repository: https://github.com/z-lab/dflash

Test Command:

SGLANG_DISABLE_CUDNN_CHECK=1 \
CUDA_VISIBLE_DEVICES=4,5,6,7 \
python benchmark_sglang.py \
--tp-size 4 \
--target-model /models/Qwen/Qwen3___5-27B/ \
--draft-model /draft-model-path/ \
--concurrencies  1,8,32 \
--dataset-name humaneval  \
--attention-backends fa3 \
--max-running-requests 32

Performance Results

Baseline Output (tok/s)

concurrencies	1	4	32
tok/s	103.61	361.46	1,860.84

DFLASH Output (tok/s)

concurrencies	1	4	32
tok/s	207.19	684.67	2,115.39

Speedup (DFLASH / Baseline)

concurrencies	1	4	32
speedup	2.000x	1.894x	1.137x

DFLASH Acceptance Length

concurrencies	1	4	32
acceptance length	3.233	3.216	3.219

Summary

The implementation shows significant performance improvements with DFlash enabled, achieving up to 2x speedup for single concurrent requests and maintaining good acceleration across different concurrency levels.

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…ock-size to server args

Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>

gemini-code-assist · 2026-03-05T13:05:36Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates DFLASH speculative decoding into the SGLang framework, primarily targeting Qwen3.5 models to enhance inference performance. It introduces a dedicated DFLASH draft model, optimizes KV cache operations with a fused Triton kernel, and updates the core speculative decoding logic to manage DFLASH-specific requirements and ensure compatibility with existing features. The changes aim to provide substantial speedups for model generation.

Highlights

DFLASH Speculative Decoding Support: Introduced comprehensive support for DFLASH speculative decoding, enabling significant performance improvements for model inference.
Qwen3.5 Model Integration: Added specific model definitions and configurations to support Qwen3.5 models with DFLASH, including handling auxiliary hidden states and KV cache scaling.
Optimized KV Cache Materialization: Implemented a fused Triton kernel for efficient KV cache materialization in the DFLASH draft model, combining RMSNorm and RoPE operations.
Enhanced Speculative Decoding Framework: Integrated DFLASH into the SGLang speculative decoding framework, including new worker logic, updated CUDA graph capture, and compatibility checks for sampling parameters.
Performance Benchmarking: Provided benchmark results demonstrating up to 2x speedup for single concurrent requests and good acceleration across various concurrency levels using DFLASH.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

benchmark/dflash/bench_dflash_gsm8k_sweep.py
- Added a new benchmark script to evaluate DFLASH performance against a baseline on the GSM8K dataset across various concurrency and tensor parallelism configurations.
python/sglang/srt/layers/attention/flashinfer_backend.py
- Modified custom mask handling in init_forward_metadata_capture_cuda_graph to prevent FlashInfer from misinterpreting zero buffers as real masks for DFLASH draft models.
- Adjusted causal attention logic in forward_extend to correctly handle ENCODER_ONLY attention types used by DFLASH.
python/sglang/srt/managers/scheduler.py
- Added checks to handle_generate_request to abort DFLASH requests that attempt to use return_logprob or grammar-constrained decoding, as these features are not yet supported.
python/sglang/srt/model_executor/cuda_graph_runner.py
- Updated CUDA graph capture logic to include DFLASH, allowing TARGET_VERIFY mode for draft workers and handling input embeddings.
- Integrated set_dflash_layers_to_capture for models supporting DFLASH auxiliary hidden state capture.
python/sglang/srt/model_executor/model_runner.py
- Introduced DFLASH-specific initialization, including parsing draft model configuration and resolving target layers for auxiliary hidden state capture.
- Updated _should_run_flashinfer_autotune and _dummy_run to account for DFLASH speculative algorithm.
python/sglang/srt/model_executor/model_runner_kv_cache_mixin.py
- Modified profile_max_num_token to scale KV cell size per token for DFLASH, accounting for both target and draft model layers.
python/sglang/srt/models/dflash.py
- Added a new DFLASH draft model implementation, including DFlashAttention, DFlashMLP, and DFlashDecoderLayer modules.
- Implemented kv_proj_only and project_target_hidden methods for efficient KV materialization and hidden state projection.
python/sglang/srt/models/gpt_oss.py
- Added get_input_embeddings method and set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
python/sglang/srt/models/llama.py
- Added set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
python/sglang/srt/models/qwen3.py
- Added set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
python/sglang/srt/models/qwen3_5.py
- Removed is_last_layer parameter from Qwen3DecoderLayer and Qwen3VLChatDecoderLayer initializers.
- Added layers_to_capture attribute to Qwen3Model for DFLASH/EAGLE3 hidden-state capture.
- Modified forward method of Qwen3Model to return auxiliary hidden states when captured.
- Added set_dflash_layers_to_capture for DFLASH auxiliary hidden state capture.
python/sglang/srt/models/qwen3_moe.py
- Added set_dflash_layers_to_capture to Qwen3MoeDecoder and Qwen3MoeForCausalLM for DFLASH auxiliary hidden state capture.
python/sglang/srt/models/qwen3_next.py
- Added set_dflash_layers_to_capture to Qwen3NextDecoder and Qwen3NextForCausalLM for DFLASH auxiliary hidden state capture.
- Added get_input_embeddings method to Qwen3NextForCausalLM.
python/sglang/srt/models/qwen3_vl.py
- Adjusted forward method of Qwen3VLChatForCausalLM to correctly handle and pass aux_hidden_states to the logits processor.
python/sglang/srt/server_args.py
- Added speculative_dflash_block_size argument to ServerArgs.
- Implemented DFLASH-specific validation and defaulting logic for speculative decoding parameters, including speculative_num_steps, speculative_eagle_topk, and speculative_num_draft_tokens.
- Disabled overlap scheduling and mixed chunked prefill for DFLASH speculative decoding.
python/sglang/srt/speculative/dflash_info.py
- Added DFlashDraftInput and DFlashVerifyInput dataclasses to manage DFLASH speculative decoding state.
- Implemented _compute_paged_keep_slots for managing KV cache allocation in paged mode.
- Provided prepare_for_verify and verify methods for DFLASH verification logic, including handling greedy and non-greedy sampling.
python/sglang/srt/speculative/dflash_utils.py
- Added utility functions for DFLASH, including scale_kv_cell_size_per_token_for_dflash for KV cache sizing, resolve_dflash_verify_mask_policy for attention mask handling, and compute_dflash_accept_len_and_bonus for verification.
- Introduced parse_dflash_draft_config to extract DFLASH-specific configurations from HF model configs.
- Implemented build_target_layer_ids for selecting target layers for hidden state capture.
python/sglang/srt/speculative/dflash_worker.py
- Added a new DFlashWorker class to manage the DFLASH speculative decoding process, including initializing a separate draft model runner.
- Implemented _prepare_for_speculative_decoding to handle draft generation and verification steps.
- Provided _greedy_sample_from_vocab_parallel_head for TP-safe greedy sampling and _append_target_hidden_to_draft_kv for materializing hidden states into the draft KV cache.
- Integrated fused KV materialization using a Triton kernel for performance optimization.
python/sglang/srt/speculative/spec_info.py
- Added DFLASH to the SpeculativeAlgorithm enum.
- Updated is_draft_input and is_verify_input methods to include DFLASH types.
- Configured create_worker to return DFlashWorker for DFLASH algorithm and enforced no overlap scheduling.
python/sglang/srt/speculative/triton_ops/init.py
- Added __init__.py to export FusedKVMaterializeHelper.
python/sglang/srt/speculative/triton_ops/fused_kv_materialize.py
- Added a new Triton kernel (_fused_norm_rope_kernel) and a FusedKVMaterializeHelper class for fused RMSNorm + RoPE operations during DFLASH KV materialization.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces DFlash speculative decoding support for Qwen3.5 models, which is a significant feature enhancement. The implementation is comprehensive, adding new core components for DFlash (dflash_worker.py, dflash.py, etc.), a new benchmark script, and integrating DFlash support across various parts of the system including the scheduler, model runner, and attention backends. The code quality is high, with robust configuration handling and optimizations like fused Triton kernels. My review includes one suggestion for a minor refactoring to improve code maintainability.

gemini-code-assist · 2026-03-05T13:11:33Z

python/sglang/srt/managers/scheduler.py

+        if self.spec_algorithm.is_dflash() and req.return_logprob:
+            req.set_finish_with_abort(
+                "DFLASH speculative decoding does not support return_logprob yet."
+            )
+            self.init_req_max_new_tokens(req)
+            self._add_request_to_queue(req)
+            return
+        if self.spec_algorithm.is_dflash() and (
+            req.sampling_params.json_schema is not None
+            or req.sampling_params.regex is not None
+            or req.sampling_params.ebnf is not None
+            or req.sampling_params.structural_tag is not None
+        ):
+            req.set_finish_with_abort(
+                "DFLASH speculative decoding does not support grammar-constrained decoding yet."
+            )
+            self.init_req_max_new_tokens(req)
+            self._add_request_to_queue(req)
+            return


The two if blocks for checking DFLASH unsupported features are repetitive. They can be combined to reduce code duplication and improve readability.

Suggested change

if self.spec_algorithm.is_dflash() and req.return_logprob:

req.set_finish_with_abort(

"DFLASH speculative decoding does not support return_logprob yet."

)

self.init_req_max_new_tokens(req)

self._add_request_to_queue(req)

return

if self.spec_algorithm.is_dflash() and (

req.sampling_params.json_schema is not None

or req.sampling_params.regex is not None

or req.sampling_params.ebnf is not None

or req.sampling_params.structural_tag is not None

):

req.set_finish_with_abort(

"DFLASH speculative decoding does not support grammar-constrained decoding yet."

)

self.init_req_max_new_tokens(req)

self._add_request_to_queue(req)

return

if self.spec_algorithm.is_dflash():

unsupported_reason = None

if req.return_logprob:

unsupported_reason = "return_logprob"

elif (

req.sampling_params.json_schema is not None

or req.sampling_params.regex is not None

or req.sampling_params.ebnf is not None

or req.sampling_params.structural_tag is not None

):

unsupported_reason = "grammar-constrained decoding"

if unsupported_reason:

req.set_finish_with_abort(

f"DFLASH speculative decoding does not support {unsupported_reason} yet."

)

self.init_req_max_new_tokens(req)

self._add_request_to_queue(req)

return

liyucheng09 · 2026-03-11T07:54:46Z

hi @EanWang211123 yiheng, impressive results. I tried to reproduce the accept len results, but failed with eaglechat & specforge.

Does your accept len include the 1 token produced at the end of target extend?

EanWang211123 · 2026-03-11T08:06:03Z

你好@EanWang211123yiheng，结果令人印象深刻。我尝试复现 accept len 的结果，但使用 EagleChat 和 Specforge 都失败了。

您的 accept len 是否包含 target extend 结束时生成的 1 个 token？

@liyucheng09 I think so. I only tried training with a small amount of data for verification.

liyucheng09 · 2026-03-11T08:55:08Z

@EanWang211123 how's the accept len: in sglang log? I tried to reproduce your results but mine number is rather low.

EanWang211123 · 2026-03-11T09:19:11Z

@EanWang211123 how's the accept len: in sglang log? I tried to reproduce your results but mine number is rather low.

@liyucheng09 My training dataset was created by sampling the EagleChat subset (English 4k + Chinese 4k), and the model was trained using SGLang as the training backend.
May I ask whether your training setup, test set, and other configurations are the same as mine?

moehanabi · 2026-04-03T07:55:07Z

Update on 4.7:
the reason seems that this version of sglang will output repeatedly which causes the draft not accepted. It works well after merge newest main commits

Update on 4.8:
have create a pr at EanWang211123#2
merged from v0.5.10 tag & fix memory_pool_config

Reply at first time:
Hi! Thanks for your great work!
For qwen3 30b a3b:
I met the same low accept len result, with my own 150k train data.
and I found the first accept len was high (still lower than eagle3), then it will drop after following requests, but the test data items are all similar (and work well in eagle3)

use TP=2 to ensure enough kvcache space:

For qwen3.5 35b a3b:
I met accept len = 1.0. maybe somewhere went wrong, i'm still checking.

dcw02 added 30 commits January 6, 2026 23:24

starting dflash impl

10e563f

fix verify mismatch

289e748

add gsm8k bench

f1efc03

support more backends, investigate accuracy

e807216

native sglang backend

99e140a

remove hf backend

2c64b0e

dflash support flashinfer

f1a4262

remove manual management of dflash kv pool

2c5b346

add cuda graph

6a38e63

add cuda graph to draft worker

40a81af

update test

510bf0c

fix flashinfer backend

c54f336

initial radix cache support

8c8ee9c

tp_size > 1 support

0edea3f

add optional dflash_config for overrides, add --speculative-dflash-bl…

f23555b

…ock-size to server args

fix OOMs with default settings

63c0b9a

clean up

9309764

clean up dflash load_weights

644ab29

attention selection logic

ff6876a

Merge remote-tracking branch 'upstream/main' into dflash

32c3dd0

decouple context feature count K from draft num layers

d808ac9

clean up naming

e589ac1

performance optimizations

074efb2

skip Q, fused mlp

fcc9bf7

reuse buffers for decode

a79264f

optimize greedy sampling

ad5adbf

preallocate for tp>1

37fc3f1

more buffers for tp>1

72cbd9d

dflash gsm8k benchmark sweep

5a577a3

fix benchmark

d968532

dcw02 and others added 10 commits February 28, 2026 21:18

fix auto memory oom, cleanup

f62e5de

clean up

2cc5f07

Merge branch 'sgl-project:main' into dflash

760870c

Merge branch 'main' of github.com:sgl-project/sglang into dflash

3b66746

Merge branch 'main' of github.com:sgl-project/sglang into dflash

6913603

initial fa4 support to dflash, clean up benchmarking script

26441b8

clean up

74814de

only run baseline once

e493353

[feat] add qwen3-5 dflash support

619b59c

Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>

[fix] fix offset of layer-ids

a4b1e1e

Signed-off-by: EanWang211123 <wangyiheng@sangfor.com.cn>

EanWang211123 requested review from Fridge003, HaiShaw, Qiaolin-Yu, Ying1123, hebiao064, hnyls2002, ispobock, merrymercy and xiezhq-hermann as code owners March 5, 2026 13:01

EanWang211123 changed the title ~~[Feature]Add Qwen3.5 Model Support for DFlash~~ [Feature] Add Qwen3.5 Model Support for DFlash Mar 5, 2026

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

This was referenced Mar 7, 2026

[Feature] Support DFlash speculative decoding for more models (e.g., Qwen3.5 / Qwen3-VL) #19759

Open

[Feature] Support DFlash Speculative Decoding Training for Qwen3.5 Models sgl-project/SpecForge#495

Open

Merge branch 'main' into dflash-qwen3_5

5106aa5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add Qwen3.5 Model Support for DFlash#19952

[Feature] Add Qwen3.5 Model Support for DFlash#19952
EanWang211123 wants to merge 58 commits intosgl-project:mainfrom
EanWang211123:dflash-qwen3_5

EanWang211123 commented Mar 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Mar 5, 2026

Uh oh!

liyucheng09 commented Mar 11, 2026

Uh oh!

EanWang211123 commented Mar 11, 2026

Uh oh!

liyucheng09 commented Mar 11, 2026

Uh oh!

EanWang211123 commented Mar 11, 2026

Uh oh!

moehanabi commented Apr 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

EanWang211123 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Tests

Performance Results

Baseline Output (tok/s)

DFLASH Output (tok/s)

Speedup (DFLASH / Baseline)

DFLASH Acceptance Length

Summary

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

liyucheng09 commented Mar 11, 2026

Uh oh!

EanWang211123 commented Mar 11, 2026

Uh oh!

liyucheng09 commented Mar 11, 2026

Uh oh!

EanWang211123 commented Mar 11, 2026

Uh oh!

moehanabi commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

EanWang211123 commented Mar 5, 2026 •

edited

Loading

moehanabi commented Apr 3, 2026 •

edited

Loading