UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP]#511
UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP]#511
Conversation
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #511Analysis Scope: Comparison of version c61be74f-dd72-460d-860d-190c12fbb769 against baseline 0ccf3297-383d-42ee-a160-41f01785828a for llama.cpp project. SummaryThis PR introduces experimental NVIDIA Blackwell native FP4 tensor core support for MXFP4 quantization. The implementation adds 455 lines across 7 CUDA files, targeting compute capability 100 with CUDA 12.8+. Performance analysis shows zero measurable impact on compiled binaries, with power consumption remaining stable across all components. The changes are conditionally compiled and only activate on Blackwell hardware, leaving existing inference paths unaffected. Key FindingsPerformance Metrics AnalysisNo function-level performance data was available for comparison between the two versions. The summary report returned no functions meeting the specified filters, indicating the changes have not yet been compiled into the analyzed binaries or the analysis predates the code modifications. Power Consumption AnalysisPower consumption measurements across all 16 binaries show negligible to zero change: Binaries with Measurable Change:
Unchanged Binaries: Code Implementation AnalysisThe PR implements Blackwell-specific matrix multiplication paths: New Components:
Dispatch Logic: Impact on Inference PerformanceTokens Per Second: No impact on current inference performance. The tokenization and inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes in the analyzed binaries. The MXFP4 implementation is hardware-gated and does not execute on the analyzed system configuration. Impacted Functions: None in the current analysis. The conditional compilation guards prevent MXFP4 code paths from affecting existing inference pipelines on non-Blackwell hardware. Binary-Level ChangesThe stable power consumption across all binaries indicates the code changes have not altered the compiled instruction sequences for the analyzed build configuration. This is consistent with the conditional compilation approach where Blackwell-specific code is excluded when targeting older compute capabilities. |
b29e20d to
0e7b989
Compare
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary - PR #511Analysis Scope: CUDA experimental native MXFP4 support for Blackwell architecture across 7 modified files (425 additions, 15 deletions). OverviewPerformance analysis shows effectively zero measurable impact across all binaries in the current build. Power consumption changes range from 0.0% to 0.001%, with the largest absolute change being 1.52 nJ reduction in build.bin.llama-run. No function-level performance data was available, indicating either identical binary outputs or changes not yet active in the measured configuration. Code ChangesThe PR introduces Blackwell GPU (compute capability 10.0) support for native FP4 tensor cores, adding:
Implementation maintains backward compatibility through fallback to existing Q8_1 path on non-Blackwell hardware. Key FindingsPower Consumption:
Inference Impact: Technical Context: The implementation adds 47% shared memory efficiency (0.56 vs 1.06 bytes per value) and introduces E8M0 block scaling for tensor cores. Changes are confined to ggml-cuda module with no modifications to core llama inference functions. |
4733ac4 to
18c8a27
Compare
c39aef9 to
a014a6b
Compare
Mirrored from ggml-org/llama.cpp#17906
Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile
-DCMAKE_CUDA_ARCHITECTURES="120a"is required.Blackwell has a
m16n8k64instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is actually ~ 10% slower than master. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures intest-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)TODO: