-
-
Notifications
You must be signed in to change notification settings - Fork 16.4k
DeepSeek V4 support on SM12x with Triton sparse MLA fallback #40899
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Changes from all commits
Commits
Show all changes
70 commits
Select commit
Hold shift + click to select a range
030c1ec
feat: support deepseek v4
zyongye 30faa6a
chore: pass mypy
ivanium 1df2a80
fix: update cuda requirements
ivanium 8779f9d
fix: config
ivanium 0ff736a
Integrate MegaMoE kernel (#232)
WoosukKwon 8188e4a
Prototype SM120 DeepSeek V4 reference attention
jasl 3856a3f
Allow DeepGEMM to build for SM120 with CUDA 13
jasl 5241faf
Split SM120 sparse attention reference into LSE merge stages
jasl 4034d5d
Add SM120 FP8 indexer logits fallback
jasl 359e334
Register SM120 reference attention env vars
jasl 34b6717
Pin DeepGEMM SM120 prototype dependency
jasl dd18dc2
Prototype DeepSeek V4 pipeline parallelism
jasl ca911d7
Generalize sparse MLA reference fallback controls
jasl 0d45c12
Let sparse MLA dump control override legacy alias
jasl b15fe88
Avoid pinning DeepGEMM SM120 fork
jasl 88ace9e
Keep DeepGEMM SM120 prototype pin
jasl bf42dc8
Add DeepSeek V4 sparse MLA reference tests
jasl 916b19f
Extract sparse MLA reference helpers
jasl 7674619
Move sparse MLA prefill reference into helper
jasl ba0771a
Share sparse MLA fallback env handling
jasl 63ac6e7
Use workspace for DeepSeek V4 einsum output
jasl f2bde65
Add sparse MLA env helper tests
jasl 7b38dd3
Use Triton for sparse MLA sink merge
jasl d78fc65
Use Triton for sparse MLA subset accumulation
jasl f21b837
Fuse fp8_ds_mla sparse MLA decode accumulation
jasl 1ac48e4
Fuse fp8_ds_mla paged SWA decode accumulation
jasl f28fe76
Fuse fp8_ds_mla SWA-only decode fallback
jasl c918470
Use Triton indexed accumulation for sparse MLA prefill
jasl 9afedca
Fix sparse MLA ruff import ordering
jasl 68f3236
Fuse sparse MLA finish with sink merge
jasl 59172ac
Use multi-head fp8 sparse MLA accumulation
jasl eeeeee8
Optimize SM12x sparse MLA decode kernels
jasl 1894bda
Fix sparse MLA padded-head state launches
jasl 4ddd9e8
Handle padded sparse MLA output heads
jasl 58f4ee5
Accept padded sparse MLA attention sinks
jasl a75f327
Drop stale sparse MLA dummy workspace reservation
jasl 6249f9e
Stabilize SM12x DeepSeek V4 sparse MLA fallback
jasl fad559c
Allow opt-in cudagraphs for SM12x sparse MLA
jasl 617788a
Fuse SM12x sparse MLA decode fallback
jasl f6302ab
Fix sparse MLA env test import order after refresh
jasl 026e6cb
Update sparse MLA env default test after refresh
jasl f681984
Update DeepGEMM SM120 pin for HC kernel
jasl 3cca90b
Skip unsupported FlashInfer sparse MLA tests on SM12x
jasl 4c0983e
Tune SM12x sparse MLA decode head grouping
jasl 3fe9199
Fix DeepSeek V4 FP8 einsum config on SM12x
jasl 227b15e
Default SM12x sparse MLA runtime knobs
jasl 53be759
Reject CUTLASS block FP8 scaled MM on SM12x
jasl 6ea0d61
Add SM12x Triton FP8 einsum for DeepSeek V4
jasl 7b158e1
Bump FlashInfer CUDA packages to 0.6.9
jasl 4a08ccb
Add sparse MLA head block tuning
jasl eab717d
Bump DeepGEMM SM120 reference
jasl b037165
Add DeepGEMM SM120 paged MQA toggle
jasl 6d6d6f7
Update DeepGEMM SM120 pin
jasl cc06fe8
temporary disable persistent topk for 1024
zyongye 0ad4dea
Support dummy loading
WoosukKwon 0e8b532
free up unused weights
WoosukKwon 310ac26
Fix DeepSeek V4 MegaMoE test fixture
jasl 6130a6d
Address DeepSeek V4 review nits
jasl b56e53c
Tune SM12x sparse MLA decode head grouping
jasl 5958f37
Use short-row topk on SM120 indexer
jasl 43ce641
[Kernel] Marlin MoE: include SM 12.x in default arch list
3e3a30a
[Kernel] Tune default fp8 block-scaled Triton config for M<=8 decode
8573555
Guard low-M FP8 Triton stages on ROCm
jasl 4d5335d
Speed up SM12x sparse MLA decode with matmul path
jasl 681a817
Reduce SM12x sparse MLA decode KV staging
jasl 3a2dd99
Fuse SM12x sparse MLA decode mask build
jasl 9ffda0c
Extend SM12x low-M FP8 block config
jasl b7a70b9
Reduce SM12x long-prefill sparse MLA memory
jasl 6652949
Clean up SM12x sparse MLA review issues
jasl 8d0ebb7
Restore SM12x sparse MLA MTP decode fallback
jasl File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need also 12.1a here for DGX Spark?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
12.0f means for all 12.x family
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see in log
Was thinking, is it problem on my side with env, not sure, but would like to see there 121f :) okay, will continue with testing...