-
Notifications
You must be signed in to change notification settings - Fork 15.7k
CUDA: experimental native mxfp4 support for blackwell #17906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+426
−18
Merged
Changes from all commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
4bc93a6
CUDA: experimental native mxfp4 support for blackwell
38648a4
optimize load_tiles
f8d20c5
optimize quantize_mxfp4
ae617ac
cleanup
b0e3c62
first pass review: formatting
9ebe043
use interleaved layout for mma
d7edade
mmq: add assert for size
9746071
use __nv_fp4x4_e2m1
d6cb832
use iter_k as 512, cleanup
3fca15c
Use 1200 as blackwell instead of 1000
41329d3
address review comments
b391991
mmq: fix stride
404054c
quantize.cu: use reference impl of e8m0 scale
5d3780d
address review comments
dc04da5
add 120f-virtual + minor fixes
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we start adding Blackwell-specific features we should start adding Blackwell to the virtual architectures that are being built for a non-native build. To my knowledge the FP4 instructions are forward-compatible so we should build the lowest possible virtual architecture for it.
We may also consider building the real architectures for convenience but this is a different matter. The tradeoff is that users won't have to do JIT compilation on the first run but additional CPU time is needed for the builds, particularly in our CI. I am not managing the CI so input from @ggerganov would be appreciated on this matter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to understand what are the implications of more native, virtual and real CUDA builds and will try to provide input. Atm, I'm not familiar with the specifics and the tradeoffs to comment on this topic. I think we can decide this later, not necessary for this PR, correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The FP4 matrix multiply instructions are not forward compatible and are very likely to change with future architectures. See 9.7.14.5.14. Multiply-and-Accumulate Instruction: mma of the PTX ISA manual.
Note that how things are done currently already ships with very broken performance on Hopper and datacenter Blackwell today. Maybe the path forward is to wait for cuTile C++ to ship, that should help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TL;DR on ISA compatibility: sm_120 has the baseline forward compatible feature set, sm_120a PTX is only compatible with compute_120 and sm_120f PTX is compatible with compute_12x.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ggerganov which real architectures to ship can be decided on independently of this PR, it's only relevant for conveience of users vs. convenience of our CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd recommend going with 120f (whether virtual + real or only virtual is up to debate), unless we need ISA specific to
12xaversions (which I presume we don't). Also, we should bump our forward compatible PTX to 120