Skip to content

UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP]#511

Open
loci-dev wants to merge 13 commits intomainfrom
upstream-PR17906-branch_am17an-mxfp4
Open

UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP]#511
loci-dev wants to merge 13 commits intomainfrom
upstream-PR17906-branch_am17an-mxfp4

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17906

Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

Blackwell has a m16n8k64 instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is actually ~ 10% slower than master. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures in test-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)

TODO:

  • Figure out why it is slower
  • Address NMSE error b/w q8_0 and mxfp4

@loci-review
Copy link

loci-review bot commented Dec 10, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #511

Analysis Scope: Comparison of version c61be74f-dd72-460d-860d-190c12fbb769 against baseline 0ccf3297-383d-42ee-a160-41f01785828a for llama.cpp project.


Summary

This PR introduces experimental NVIDIA Blackwell native FP4 tensor core support for MXFP4 quantization. The implementation adds 455 lines across 7 CUDA files, targeting compute capability 100 with CUDA 12.8+. Performance analysis shows zero measurable impact on compiled binaries, with power consumption remaining stable across all components. The changes are conditionally compiled and only activate on Blackwell hardware, leaving existing inference paths unaffected.

Key Findings

Performance Metrics Analysis

No function-level performance data was available for comparison between the two versions. The summary report returned no functions meeting the specified filters, indicating the changes have not yet been compiled into the analyzed binaries or the analysis predates the code modifications.

Power Consumption Analysis

Power consumption measurements across all 16 binaries show negligible to zero change:

Binaries with Measurable Change:

  • build.bin.llama-tts: 254782 nJ baseline to 254780 nJ target (2 nJ reduction)
  • build.bin.llama-run: 220435 nJ baseline to 220435 nJ target (negligible)
  • build.bin.llama-cvector-generator: 250359 nJ baseline to 250359 nJ target (negligible)
  • build.bin.libllama.so: 195276 nJ baseline to 195276 nJ target (negligible)

Unchanged Binaries:
All GGML core libraries (libggml-base.so, libggml-cpu.so, libggml.so), utility binaries (llama-bench, llama-quantize, llama-tokenize), and multimodal CLI tools show zero power consumption change.

Code Implementation Analysis

The PR implements Blackwell-specific matrix multiplication paths:

New Components:

  • block_fp4_mmq structure: 68 bytes per block (50% reduction vs block_q8_1_mmq)
  • quantize_mmq_mxfp4 kernel: Converts FP32 activations to 4-bit E2M1 format with E8M0 scaling
  • vec_dot_mxfp4_mxfp4_mma: Utilizes m16n8k64 tensor core instruction
  • mma_block_scaled: PTX wrapper for native FP4 matrix multiplication

Dispatch Logic:
Runtime checks via blackwell_mma_available(cc) ensure MXFP4 path activates only on compute capability 100 hardware. Non-Blackwell systems continue using existing Q8_1 quantization paths without modification.

Impact on Inference Performance

Tokens Per Second: No impact on current inference performance. The tokenization and inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes in the analyzed binaries. The MXFP4 implementation is hardware-gated and does not execute on the analyzed system configuration.

Impacted Functions: None in the current analysis. The conditional compilation guards prevent MXFP4 code paths from affecting existing inference pipelines on non-Blackwell hardware.

Binary-Level Changes

The stable power consumption across all binaries indicates the code changes have not altered the compiled instruction sequences for the analyzed build configuration. This is consistent with the conditional compilation approach where Blackwell-specific code is excluded when targeting older compute capabilities.

@loci-dev loci-dev force-pushed the main branch 2 times, most recently from b29e20d to 0e7b989 Compare December 10, 2025 16:11
@loci-review
Copy link

loci-review bot commented Dec 10, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #511

Analysis Scope: CUDA experimental native MXFP4 support for Blackwell architecture across 7 modified files (425 additions, 15 deletions).

Overview

Performance analysis shows effectively zero measurable impact across all binaries in the current build. Power consumption changes range from 0.0% to 0.001%, with the largest absolute change being 1.52 nJ reduction in build.bin.llama-run. No function-level performance data was available, indicating either identical binary outputs or changes not yet active in the measured configuration.

Code Changes

The PR introduces Blackwell GPU (compute capability 10.0) support for native FP4 tensor cores, adding:

  • New quantization path: quantize_mmq_mxfp4_cuda for activation quantization to MXFP4 format
  • Tensor core instruction: mma.sync.aligned.kind::mxf4.block_scale for m16n8k64 operations
  • FP4 E2M1 conversion: ggml_cuda_float_to_fp4_e2m1 with 8-value lookup table
  • Data structure: block_fp4_mmq (72 bytes per 128 values vs 136 bytes for Q8_1)
  • Conditional compilation guards: BLACKWELL_MMA_AVAILABLE with runtime detection

Implementation maintains backward compatibility through fallback to existing Q8_1 path on non-Blackwell hardware.

Key Findings

Power Consumption:

  • build.bin.llama-run: -1.52 nJ (-0.001%)
  • build.bin.libllama.so: -0.16 nJ (-0.0%)
  • build.bin.llama-tts: -0.77 nJ (-0.0%)
  • All other binaries: 0.0% change

Inference Impact:
No functions in the tokenization/inference pipeline (llama_decode, llama_encode, llama_tokenize) show measurable changes. The CUDA-specific modifications are isolated to GPU matrix multiplication paths and do not affect CPU-based inference measured in this analysis. Tokens per second remains unchanged.

Technical Context:
The PR author reports 10% regression in separate testing, suggesting the Blackwell-specific code paths require compilation with -DCMAKE_CUDA_ARCHITECTURES="120a" and Blackwell hardware to activate. The measured build likely compiled without these flags, resulting in inactive code paths and zero performance impact.

The implementation adds 47% shared memory efficiency (0.56 vs 1.06 bytes per value) and introduces E8M0 block scaling for tensor cores. Changes are confined to ggml-cuda module with no modifications to core llama inference functions.

@loci-dev loci-dev force-pushed the main branch 23 times, most recently from 4733ac4 to 18c8a27 Compare December 13, 2025 16:09
@loci-dev loci-dev force-pushed the main branch 17 times, most recently from c39aef9 to a014a6b Compare December 18, 2025 08:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant