[Feature] Add DFlash Speculative Decoding Support for Qwen3-VL Model#18387
Open
EanWang211123 wants to merge 54 commits intosgl-project:mainfrom
Open
[Feature] Add DFlash Speculative Decoding Support for Qwen3-VL Model#18387EanWang211123 wants to merge 54 commits intosgl-project:mainfrom
EanWang211123 wants to merge 54 commits intosgl-project:mainfrom
Conversation
…ock-size to server args
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces support for DFlash speculative decoding, which is a significant feature. The changes are extensive, touching many parts of the system from server configuration and model execution to attention backends and model implementations. The implementation appears robust and well-integrated with the existing speculative decoding framework.
Key changes include:
- A new
DFlashWorkerand associated data structures (DFlashDraftInput,DFlashVerifyInput) to manage the DFlash-specific workflow. - A new
dflash.pymodel implementation for the draft model, which correctly omits embedding and LM head layers. - Modifications to attention backends (
flashinfer,trtllm_mha) to support DFlash's requirements, including a critical correctness fix in thetrtllm_mhabackend. - Integration with CUDA graph capture for performance.
- New benchmark scripts for validation.
The code is well-structured, and the changes are generally clear and well-commented. I have one suggestion for improving the exception handling in the server argument parsing logic to make it more robust. Overall, this is a high-quality contribution.
…vent SHM feature decoding failure”
Closed
Open
2 tasks
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
This PR adds DFlash speculative decoding support for the Qwen3-VL model. It depends on #16818.
DFlash Speculative Decoding:
Modifications
New Files
benchmark\dflash\bench_dflash_mmstar.py: MMStar benchmark, outputs throughput and acceptance length.Changed Files
python\sglang\srt\models\qwen3_vl.py: Addedset_dflash_layers_to_captureinterface.Key Features
Multimodal Adaptation:
Follows the standalone-style multimodal speculative decoding adaptation approach (e.g., Qwen3-8B-VL + Qwen3-0.6B), using the same MRoPE adaptation logic.
Restore global server_args after DFlash worker initialization to prevent SHM feature decoding failure:
When launching DFlash speculative decoding with Qwen3-VL (tp_size=2), the first image request triggers
TypeError: object supporting the buffer API required. VLM running alone or with other speculative decoding methods works fine.Root Cause
In single-node SGLang deployment, the tokenizer process transfers image feature tensors to scheduler via shared memory (SHM). The sender wraps data with
ShmPointerMMData, and the receiver unwraps usingunwrap_shm_features, which depends on the globalserver_args.skip_tokenizer_initto determine whether unwrapping is needed:When initializing DFlash's draft worker, it
deepcopysserver_argsand setsskip_tokenizer_init=True(since text-only draft models don't require a tokenizer). DuringModelRunner.__init__, the draft worker callsset_global_server_args_for_scheduler(draft_server_args), overwriting the globalserver_argswith the draft version.As a result: the tokenizer properly wraps features, but the scheduler skips unwrapping due to polluted global variables, passing the raw
ShmPointerMMDataobject directly tohashlib.sha256(), causing a TypeError.Other speculative decoding methods like EAGLE remain unaffected because they pass the original
server_argsdirectly (without deepcopy or modification), so global variables remain unchanged.Fix
In
dflash_worker.py, save the globalserver_argsbefore creating the draft worker and restore it afterward:Risk Assessment
dflash_worker.pyand do not affect other speculative decoding methodsuse_mla_backend)Tests
Environment: 4090D
Models: Qwen3-VL-8B-Instruct, Qwen3-8B-DFlash-b16
Test Dataset: MMStar
Test Commands
Test Results
Concurrency = 1:
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci