Skip to content

UPSTREAM PR #17817: HIP: fix RDNA3 FP16/BF16 matrix multiplication#467

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17817-branch_JohannesGaessler-hip-fix-rdna3-mmf
Open

UPSTREAM PR #17817: HIP: fix RDNA3 FP16/BF16 matrix multiplication#467
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17817-branch_JohannesGaessler-hip-fix-rdna3-mmf

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Dec 6, 2025

Mirrored from ggml-org/llama.cpp#17817

Fixes ggml-org/llama.cpp#17797 by simply adding an explicit RDNA4 requirement to MMF. @jiachengjason as outlined in https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#pull-requests-for-contributors--collaborators , please test changes to the CUDA/HIP backend for correctness using test-backend-ops.

@loci-review
Copy link

loci-review bot commented Dec 6, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #467

Overview

PR #467 introduces a hardware-specific correctness fix for AMD RDNA3 GPUs in the HIP backend, restricting FP16/BF16 WMMA operations to RDNA4 architecture only. The change modifies ggml_cuda_should_use_mmf() in ggml/src/ggml-cuda/mmf.cu with 2 line additions and 2 deletions.

Performance Impact Assessment

Function-Level Analysis:
No function-level performance data available for the specified version comparison. The summary report returned no measurable changes in response time or throughput metrics across analyzed functions.

Power Consumption Analysis:
All 16 binaries show 0.0% change in estimated power consumption between versions:

  • build.bin.libllama.so: 193,963 nJ (no change)
  • build.bin.llama-tts: 253,823 nJ (no change)
  • build.bin.llama-cvector-generator: 249,105 nJ (no change)
  • build.bin.llama-run: 218,706 nJ (no change)
  • All other binaries: unchanged

Inference Performance:
No impact detected on core inference functions (llama_decode, llama_encode, llama_tokenize). The change affects GPU kernel selection logic within the GGML backend, specifically for AMD RDNA3 hardware. Since the analysis shows 0.0% power consumption change and no function-level metric variations, tokens per second remains unaffected in the measured test environment.

Hardware-Specific Behavior:
The modification adds an architecture check GGML_CUDA_CC_IS_RDNA4(cc) to AMD WMMA availability conditions for FP16 and BF16 types. This prevents RDNA3 GPUs from using WMMA instructions that produce incorrect results, forcing fallback to alternative matrix multiplication implementations. The change is isolated to AMD GPU code paths and does not affect NVIDIA GPU execution or CPU backend operations.

Conclusion:
The analysis indicates functional equivalence between versions with no measurable performance differences. The code change addresses a correctness issue on specific AMD hardware without impacting the measured test environment, which likely uses NVIDIA GPUs or CPU-only execution where this modification has no effect.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from 6d9272a to 4ca17fb Compare December 9, 2025 10:10
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from adf9533 to 7103504 Compare December 14, 2025 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants