UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP] by loci-dev · Pull Request #511 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-10T11:35:21Z

Currently WIP, trying to add native fp4 support for blackwell and beyond. To compile -DCMAKE_CUDA_ARCHITECTURES="120a" is required.

Blackwell has a m16n8k64 instruction for 4 bit (mxfp4, nvfp4 and int4) which advertises 2x throughput compared to int8 tensor cores. However at the moment this PR is actually ~ 10% slower than master. The other issue is that we quantize activation to mxfp4 instead of q8, which lead to failures in test-backend-ops, however PPL tests are okay with this change (though not ruling out correctness issues)

TODO:

Figure out why it is slower
Address NMSE error b/w q8_0 and mxfp4

loci-review · 2025-12-10T12:25:05Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #511

Analysis Scope: Comparison of version c61be74f-dd72-460d-860d-190c12fbb769 against baseline 0ccf3297-383d-42ee-a160-41f01785828a for llama.cpp project.

Summary

This PR introduces experimental NVIDIA Blackwell native FP4 tensor core support for MXFP4 quantization. The implementation adds 455 lines across 7 CUDA files, targeting compute capability 100 with CUDA 12.8+. Performance analysis shows zero measurable impact on compiled binaries, with power consumption remaining stable across all components. The changes are conditionally compiled and only activate on Blackwell hardware, leaving existing inference paths unaffected.

Key Findings

Performance Metrics Analysis

No function-level performance data was available for comparison between the two versions. The summary report returned no functions meeting the specified filters, indicating the changes have not yet been compiled into the analyzed binaries or the analysis predates the code modifications.

Power Consumption Analysis

Power consumption measurements across all 16 binaries show negligible to zero change:

Binaries with Measurable Change:

build.bin.llama-tts: 254782 nJ baseline to 254780 nJ target (2 nJ reduction)
build.bin.llama-run: 220435 nJ baseline to 220435 nJ target (negligible)
build.bin.llama-cvector-generator: 250359 nJ baseline to 250359 nJ target (negligible)
build.bin.libllama.so: 195276 nJ baseline to 195276 nJ target (negligible)

Unchanged Binaries:
All GGML core libraries (libggml-base.so, libggml-cpu.so, libggml.so), utility binaries (llama-bench, llama-quantize, llama-tokenize), and multimodal CLI tools show zero power consumption change.

Code Implementation Analysis

The PR implements Blackwell-specific matrix multiplication paths:

New Components:

block_fp4_mmq structure: 68 bytes per block (50% reduction vs block_q8_1_mmq)
quantize_mmq_mxfp4 kernel: Converts FP32 activations to 4-bit E2M1 format with E8M0 scaling
vec_dot_mxfp4_mxfp4_mma: Utilizes m16n8k64 tensor core instruction
mma_block_scaled: PTX wrapper for native FP4 matrix multiplication

Dispatch Logic:
Runtime checks via blackwell_mma_available(cc) ensure MXFP4 path activates only on compute capability 100 hardware. Non-Blackwell systems continue using existing Q8_1 quantization paths without modification.

Impact on Inference Performance

Tokens Per Second: No impact on current inference performance. The tokenization and inference functions (llama_decode, llama_encode, llama_tokenize) show no response time or throughput changes in the analyzed binaries. The MXFP4 implementation is hardware-gated and does not execute on the analyzed system configuration.

Impacted Functions: None in the current analysis. The conditional compilation guards prevent MXFP4 code paths from affecting existing inference pipelines on non-Blackwell hardware.

Binary-Level Changes

The stable power consumption across all binaries indicates the code changes have not altered the compiled instruction sequences for the analyzed build configuration. This is consistent with the conditional compilation approach where Blackwell-specific code is excluded when targeting older compute capabilities.

loci-review · 2025-12-10T18:25:38Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #511

Analysis Scope: CUDA experimental native MXFP4 support for Blackwell architecture across 7 modified files (425 additions, 15 deletions).

Overview

Performance analysis shows effectively zero measurable impact across all binaries in the current build. Power consumption changes range from 0.0% to 0.001%, with the largest absolute change being 1.52 nJ reduction in build.bin.llama-run. No function-level performance data was available, indicating either identical binary outputs or changes not yet active in the measured configuration.

Code Changes

The PR introduces Blackwell GPU (compute capability 10.0) support for native FP4 tensor cores, adding:

New quantization path: quantize_mmq_mxfp4_cuda for activation quantization to MXFP4 format
Tensor core instruction: mma.sync.aligned.kind::mxf4.block_scale for m16n8k64 operations
FP4 E2M1 conversion: ggml_cuda_float_to_fp4_e2m1 with 8-value lookup table
Data structure: block_fp4_mmq (72 bytes per 128 values vs 136 bytes for Q8_1)
Conditional compilation guards: BLACKWELL_MMA_AVAILABLE with runtime detection

Implementation maintains backward compatibility through fallback to existing Q8_1 path on non-Blackwell hardware.

Key Findings

Power Consumption:

build.bin.llama-run: -1.52 nJ (-0.001%)
build.bin.libllama.so: -0.16 nJ (-0.0%)
build.bin.llama-tts: -0.77 nJ (-0.0%)
All other binaries: 0.0% change

Inference Impact:
No functions in the tokenization/inference pipeline (llama_decode, llama_encode, llama_tokenize) show measurable changes. The CUDA-specific modifications are isolated to GPU matrix multiplication paths and do not affect CPU-based inference measured in this analysis. Tokens per second remains unchanged.

Technical Context:
The PR author reports 10% regression in separate testing, suggesting the Blackwell-specific code paths require compilation with -DCMAKE_CUDA_ARCHITECTURES="120a" and Blackwell hardware to activate. The measured build likely compiled without these flags, resulting in inactive code paths and zero performance impact.

The implementation adds 47% shared memory efficiency (0.56 vs 1.06 bytes per value) and introduces E8M0 block scaling for tensor cores. Changes are confined to ggml-cuda module with no modifications to core llama inference functions.

loci-dev temporarily deployed to PROD__AL_DEMO December 10, 2025 11:35 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from c481809 to 92b887d Compare December 10, 2025 12:15

loci-dev force-pushed the main branch 2 times, most recently from b29e20d to 0e7b989 Compare December 10, 2025 16:11

loci-dev temporarily deployed to PROD__AL_DEMO December 10, 2025 17:39 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 23 times, most recently from 4733ac4 to 18c8a27 Compare December 13, 2025 16:09

loci-dev force-pushed the main branch 17 times, most recently from c39aef9 to a014a6b Compare December 18, 2025 08:12

Aman Gupta added 13 commits December 20, 2025 03:27

CUDA: experimental native mxfp4 support for blackwell

4bc93a6

optimize load_tiles

38648a4

optimize quantize_mxfp4

f8d20c5

cleanup

ae617ac

first pass review: formatting

b0e3c62

use interleaved layout for mma

9ebe043

mmq: add assert for size

d7edade

use __nv_fp4x4_e2m1

9746071

use iter_k as 512, cleanup

d6cb832

Use 1200 as blackwell instead of 1000

3fca15c

address review comments

41329d3

mmq: fix stride

b391991

quantize.cu: use reference impl of e8m0 scale

404054c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP]#511

UPSTREAM PR #17906: CUDA: experimental native mxfp4 support for blackwell [WIP]#511
loci-dev wants to merge 13 commits intomainfrom
upstream-PR17906-branch_am17an-mxfp4

loci-dev commented Dec 10, 2025

Uh oh!

loci-review bot commented Dec 10, 2025

Uh oh!

loci-review bot commented Dec 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

loci-dev commented Dec 10, 2025

Uh oh!

loci-review bot commented Dec 10, 2025

Performance Analysis Summary - PR #511

Summary

Key Findings

Performance Metrics Analysis

Power Consumption Analysis

Code Implementation Analysis

Impact on Inference Performance

Binary-Level Changes

Uh oh!

loci-review bot commented Dec 10, 2025

Performance Analysis Summary - PR #511

Overview

Code Changes

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant