[Feature] Add DFlash Speculative Decoding Support for Qwen3-VL Model by EanWang211123 · Pull Request #18387 · sgl-project/sglang

EanWang211123 · 2026-02-07T02:08:05Z

Motivation

This PR adds DFlash speculative decoding support for the Qwen3-VL model. It depends on #16818.

DFlash Speculative Decoding:

The DFlash draft model only requires the target model's hidden states as input. This allows a DFlash draft model adapted for a corresponding target base model to be used with a multimodal version of that target model (e.g., Qwen3-VL-8B-Instruct + Qwen3-8B-Dflash-b16).
During testing, due to DFlash's generalization capability, no special training on multimodal data was required to achieve an average acceptance length of over 2 tokens.
SpecForge issue for VL model DFlash adaptation: [Feature] [RFC] DFlash Training Adaptation for Qwen VL Models SpecForge#461

Modifications

New Files

benchmark\dflash\bench_dflash_mmstar.py: MMStar benchmark, outputs throughput and acceptance length.

Changed Files

python\sglang\srt\models\qwen3_vl.py: Added set_dflash_layers_to_capture interface.

Key Features

Multimodal Adaptation:
Follows the standalone-style multimodal speculative decoding adaptation approach (e.g., Qwen3-8B-VL + Qwen3-0.6B), using the same MRoPE adaptation logic.

Restore global server_args after DFlash worker initialization to prevent SHM feature decoding failure:
When launching DFlash speculative decoding with Qwen3-VL (tp_size=2), the first image request triggers TypeError: object supporting the buffer API required. VLM running alone or with other speculative decoding methods works fine.

Root Cause

In single-node SGLang deployment, the tokenizer process transfers image feature tensors to scheduler via shared memory (SHM). The sender wraps data with ShmPointerMMData, and the receiver unwraps using unwrap_shm_features, which depends on the global server_args.skip_tokenizer_init to determine whether unwrapping is needed:

def unwrap_shm_features(obj):
    if ... or get_global_server_args().skip_tokenizer_init:
        return obj  # Skip unwrapping

When initializing DFlash's draft worker, it deepcopys server_args and sets skip_tokenizer_init=True (since text-only draft models don't require a tokenizer). During ModelRunner.__init__, the draft worker calls set_global_server_args_for_scheduler(draft_server_args), overwriting the global server_args with the draft version.
As a result: the tokenizer properly wraps features, but the scheduler skips unwrapping due to polluted global variables, passing the raw ShmPointerMMData object directly to hashlib.sha256(), causing a TypeError.
Other speculative decoding methods like EAGLE remain unaffected because they pass the original server_args directly (without deepcopy or modification), so global variables remain unchanged.

Fix

In dflash_worker.py, save the global server_args before creating the draft worker and restore it afterward:

saved_server_args = get_global_server_args()
self.draft_worker = TpModelWorker(server_args=draft_server_args, ...)
set_global_server_args_for_scheduler(saved_server_args)

Risk Assessment

Changes are isolated within dflash_worker.py and do not affect other speculative decoding methods
Global variables remain properly set during draft worker initialization, so the draft's own initialization logic is unaffected
The restored object is the same one previously set by the target worker, preserving all modified fields (e.g., use_mla_backend)
The entire operation completes before the scheduler event loop starts, eliminating concurrency risks

Tests

Environment: 4090D
Models: Qwen3-VL-8B-Instruct, Qwen3-8B-DFlash-b16
Test Dataset: MMStar

Test Commands

# Baseline (no speculative decoding)
SGLANG_DISABLE_CUDNN_CHECK=1 \
CUDA_VISIBLE_DEVICES=0,1 \
python -m sglang.launch_server \
--model-path /models/Qwen3-VL-8B-Instruct/ \
--tp-size 2 \
--dtype bfloat16 \
--mem-fraction-static 0.65 \
--cuda-graph-max-bs 32 --context-length 40960 --port 30000

# With DFlash speculative decoding
SGLANG_DISABLE_CUDNN_CHECK=1 \
CUDA_VISIBLE_DEVICES=0,1 \
python -m sglang.launch_server \
--model-path /models/Qwen3-VL-8B-Instruct \
--speculative-algorithm DFLASH \
--speculative-draft-model-path /models/Qwen3-8B-DFlash-b16 \
--tp-size 2 \
--dtype bfloat16 \
--mem-fraction-static 0.65 \
--cuda-graph-max-bs 32 --context-length 40960

# Run benchmark
python benchmark/dflash/bench_dflash_mmstar.py --port 30000 \
--dataset-path /datasets/mmstar \
--num-samples 10 --concurrency 1  \
--max-completion-tokens 2048 --temperature 0.0

Test Results

Concurrency = 1:

Metric	DFlash	Baseline
Throughput (tok/s)	45.48	23.60
Acceptance Length	2.8	N/A

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

…ock-size to server args

gemini-code-assist

Code Review

This pull request introduces support for DFlash speculative decoding, which is a significant feature. The changes are extensive, touching many parts of the system from server configuration and model execution to attention backends and model implementations. The implementation appears robust and well-integrated with the existing speculative decoding framework.

Key changes include:

A new DFlashWorker and associated data structures (DFlashDraftInput, DFlashVerifyInput) to manage the DFlash-specific workflow.
A new dflash.py model implementation for the draft model, which correctly omits embedding and LM head layers.
Modifications to attention backends (flashinfer, trtllm_mha) to support DFlash's requirements, including a critical correctness fix in the trtllm_mha backend.
Integration with CUDA graph capture for performance.
New benchmark scripts for validation.

The code is well-structured, and the changes are generally clear and well-commented. I have one suggestion for improving the exception handling in the server argument parsing logic to make it more robust. Overall, this is a high-quality contribution.

python/sglang/srt/server_args.py

Pr16818

…vent SHM feature decoding failure”

…lang into vlm-dflash-test

dcw02 added 30 commits January 6, 2026 23:24

starting dflash impl

10e563f

fix verify mismatch

289e748

add gsm8k bench

f1efc03

support more backends, investigate accuracy

e807216

native sglang backend

99e140a

remove hf backend

2c64b0e

dflash support flashinfer

f1a4262

remove manual management of dflash kv pool

2c5b346

add cuda graph

6a38e63

add cuda graph to draft worker

40a81af

update test

510bf0c

fix flashinfer backend

c54f336

initial radix cache support

8c8ee9c

tp_size > 1 support

0edea3f

add optional dflash_config for overrides, add --speculative-dflash-bl…

f23555b

…ock-size to server args

fix OOMs with default settings

63c0b9a

clean up

9309764

clean up dflash load_weights

644ab29

attention selection logic

ff6876a

Merge remote-tracking branch 'upstream/main' into dflash

32c3dd0

decouple context feature count K from draft num layers

d808ac9

clean up naming

e589ac1

performance optimizations

074efb2

skip Q, fused mlp

fcc9bf7

reuse buffers for decode

a79264f

optimize greedy sampling

ad5adbf

preallocate for tp>1

37fc3f1

more buffers for tp>1

72cbd9d

dflash gsm8k benchmark sweep

5a577a3

fix benchmark

d968532

gemini-code-assist bot reviewed Feb 7, 2026

View reviewed changes

python/sglang/srt/server_args.py Show resolved Hide resolved

dcw02 and others added 10 commits February 12, 2026 08:38

Merge pull request sgl-project#15 from yilian49/pr16818

5f8d0ec

Pr16818

guards for fused path

189f177

add support for gpt oss

0841db6

clean up

56477b9

Merge upstream sgl-project/main -> dflash and fix conflicts

9c0242d

add qwen3-coder-next support (mamba)

d9c68a1

add page size > 1 support

7e189bd

non greedy

7a739f8

Merge branch 'main' of github.com:sgl-project/sglang into dflash

8a6bec9

Merge branch 'pr-dflash-16818' into vlm-dflash-test

d21afd5

EanWang211123 requested review from HaiShaw, hanming-lu and yizhang2077 as code owners February 26, 2026 03:48

EanWang211123 added 6 commits February 26, 2026 11:49

Merge branch 'main' into vlm-dflash-test

b046451

“restore global server_args after DFlash worker initialization to pre…

72cd001

…vent SHM feature decoding failure”

Merge branch 'vlm-dflash-test' of https://github.com/EanWang211123/sg…

71a1879

…lang into vlm-dflash-test

Merge branch 'main' into vlm-dflash-test

a1fab19

Merge branch 'vlm-dflash-test' of https://github.com/EanWang211123/sg…

5b3a56e

…lang into vlm-dflash-test

Merge branch 'main' into vlm-dflash-test

5557c2e

EanWang211123 mentioned this pull request Feb 26, 2026

vlm上可以用这个吗 z-lab/dflash#14

Closed

Merge branch 'main' into vlm-dflash-test

6c3e70e

EanWang211123 mentioned this pull request Mar 3, 2026

[Feature] Support DFlash speculative decoding for more models (e.g., Qwen3.5 / Qwen3-VL) #19759

Open

2 tasks

EanWang211123 changed the title ~~[Feature] Add DFlash Speculative Decoding Support for Qwen-VL Model~~ [Feature] Add DFlash Speculative Decoding Support for Qwen3-VL Model Mar 4, 2026

EanWang211123 mentioned this pull request Mar 5, 2026

[Feature] Add Qwen3.5 Model Support for DFlash #19952

Open

5 tasks

EanWang211123 closed this Mar 13, 2026

EanWang211123 deleted the vlm-dflash-test branch March 13, 2026 08:45

EanWang211123 restored the vlm-dflash-test branch March 13, 2026 08:56

EanWang211123 reopened this Mar 24, 2026

EanWang211123 requested a review from hzh0425 as a code owner March 24, 2026 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Add DFlash Speculative Decoding Support for Qwen3-VL Model#18387

[Feature] Add DFlash Speculative Decoding Support for Qwen3-VL Model#18387
EanWang211123 wants to merge 54 commits intosgl-project:mainfrom
EanWang211123:vlm-dflash-test

EanWang211123 commented Feb 7, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

EanWang211123 commented Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

New Files

Changed Files

Key Features

Root Cause

Fix

Risk Assessment

Tests

Test Commands

Test Results

Checklist

Review Process

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

EanWang211123 commented Feb 7, 2026 •

edited

Loading