-
Notifications
You must be signed in to change notification settings - Fork 5.3k
[Feature] Add Qwen3.5 Model Support for DFlash #19952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
EanWang211123
wants to merge
58
commits into
sgl-project:main
Choose a base branch
from
EanWang211123:dflash-qwen3_5
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
58 commits
Select commit
Hold shift + click to select a range
10e563f
starting dflash impl
dcw02 289e748
fix verify mismatch
dcw02 f1efc03
add gsm8k bench
dcw02 e807216
support more backends, investigate accuracy
dcw02 99e140a
native sglang backend
dcw02 2c64b0e
remove hf backend
dcw02 f1a4262
dflash support flashinfer
dcw02 2c5b346
remove manual management of dflash kv pool
dcw02 6a38e63
add cuda graph
dcw02 40a81af
add cuda graph to draft worker
dcw02 510bf0c
update test
dcw02 c54f336
fix flashinfer backend
dcw02 8c8ee9c
initial radix cache support
dcw02 0edea3f
tp_size > 1 support
dcw02 f23555b
add optional dflash_config for overrides, add --speculative-dflash-bl…
dcw02 63c0b9a
fix OOMs with default settings
dcw02 9309764
clean up
dcw02 644ab29
clean up dflash load_weights
dcw02 ff6876a
attention selection logic
dcw02 32c3dd0
Merge remote-tracking branch 'upstream/main' into dflash
dcw02 d808ac9
decouple context feature count K from draft num layers
dcw02 e589ac1
clean up naming
dcw02 074efb2
performance optimizations
dcw02 fcc9bf7
skip Q, fused mlp
dcw02 a79264f
reuse buffers for decode
dcw02 ad5adbf
optimize greedy sampling
dcw02 37fc3f1
preallocate for tp>1
dcw02 72cbd9d
more buffers for tp>1
dcw02 5a577a3
dflash gsm8k benchmark sweep
dcw02 d968532
fix benchmark
dcw02 3e4177d
use device tensors for ctx_lens/draft_seq_lens, vectorize kv append a…
dcw02 b5b4bd6
precommit fixes
dcw02 ed8b16d
feat(dflash): add fused KV materialization kernel and optimize D2H
xiaomin-D f2a6dbc
add support for qwen3_moe
dcw02 117352d
support dflash_config.mask_token_id
dcw02 5ba316c
add llama3.1 support and fix config block_size logic
dcw02 5f8d0ec
Merge pull request #15 from yilian49/pr16818
dcw02 189f177
guards for fused path
dcw02 0841db6
add support for gpt oss
dcw02 56477b9
clean up
dcw02 9c0242d
Merge upstream sgl-project/main -> dflash and fix conflicts
dcw02 d9c68a1
add qwen3-coder-next support (mamba)
dcw02 7e189bd
add page size > 1 support
dcw02 7a739f8
non greedy
dcw02 8a6bec9
Merge branch 'main' of github.com:sgl-project/sglang into dflash
dcw02 0fe389d
rope rotation support
dcw02 a134f0a
clean up schedule_batch.py
dcw02 f62e5de
fix auto memory oom, cleanup
dcw02 2cc5f07
clean up
dcw02 760870c
Merge branch 'sgl-project:main' into dflash
dcw02 3b66746
Merge branch 'main' of github.com:sgl-project/sglang into dflash
dcw02 6913603
Merge branch 'main' of github.com:sgl-project/sglang into dflash
dcw02 26441b8
initial fa4 support to dflash, clean up benchmarking script
dcw02 74814de
clean up
dcw02 e493353
only run baseline once
dcw02 619b59c
[feat] add qwen3-5 dflash support
EanWang211123 a4b1e1e
[fix] fix offset of layer-ids
EanWang211123 5106aa5
Merge branch 'main' into dflash-qwen3_5
EanWang211123 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two
ifblocks for checking DFLASH unsupported features are repetitive. They can be combined to reduce code duplication and improve readability.