UPSTREAM PR #17817: HIP: fix RDNA3 FP16/BF16 matrix multiplication by loci-dev · Pull Request #467 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-06T07:33:41Z

Fixes ggml-org/llama.cpp#17797 by simply adding an explicit RDNA4 requirement to MMF. @jiachengjason as outlined in https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md#pull-requests-for-contributors--collaborators , please test changes to the CUDA/HIP backend for correctness using test-backend-ops.

loci-review · 2025-12-06T08:21:43Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #467

Overview

PR #467 introduces a hardware-specific correctness fix for AMD RDNA3 GPUs in the HIP backend, restricting FP16/BF16 WMMA operations to RDNA4 architecture only. The change modifies ggml_cuda_should_use_mmf() in ggml/src/ggml-cuda/mmf.cu with 2 line additions and 2 deletions.

Performance Impact Assessment

Function-Level Analysis:
No function-level performance data available for the specified version comparison. The summary report returned no measurable changes in response time or throughput metrics across analyzed functions.

Power Consumption Analysis:
All 16 binaries show 0.0% change in estimated power consumption between versions:

build.bin.libllama.so: 193,963 nJ (no change)
build.bin.llama-tts: 253,823 nJ (no change)
build.bin.llama-cvector-generator: 249,105 nJ (no change)
build.bin.llama-run: 218,706 nJ (no change)
All other binaries: unchanged

Inference Performance:
No impact detected on core inference functions (llama_decode, llama_encode, llama_tokenize). The change affects GPU kernel selection logic within the GGML backend, specifically for AMD RDNA3 hardware. Since the analysis shows 0.0% power consumption change and no function-level metric variations, tokens per second remains unaffected in the measured test environment.

Hardware-Specific Behavior:
The modification adds an architecture check GGML_CUDA_CC_IS_RDNA4(cc) to AMD WMMA availability conditions for FP16 and BF16 types. This prevents RDNA3 GPUs from using WMMA instructions that produce incorrect results, forcing fallback to alternative matrix multiplication implementations. The change is isolated to AMD GPU code paths and does not affect NVIDIA GPU execution or CPU backend operations.

Conclusion:
The analysis indicates functional equivalence between versions with no measurable performance differences. The code change addresses a correctness issue on specific AMD hardware without impacting the measured test environment, which likely uses NVIDIA GPUs or CPU-only execution where this modification has no effect.

HIP: fix RDNA3 FP16/BF16 matrix multiplication

38d1f02

loci-dev temporarily deployed to PROD__AL_DEMO December 6, 2025 07:33 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 9612097 to c217e38 Compare December 6, 2025 08:10

loci-dev force-pushed the main branch 26 times, most recently from 6d9272a to 4ca17fb Compare December 9, 2025 10:10

loci-dev force-pushed the main branch 30 times, most recently from adf9533 to 7103504 Compare December 14, 2025 14:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17817: HIP: fix RDNA3 FP16/BF16 matrix multiplication#467

UPSTREAM PR #17817: HIP: fix RDNA3 FP16/BF16 matrix multiplication#467
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17817-branch_JohannesGaessler-hip-fix-rdna3-mmf

loci-dev commented Dec 6, 2025

Uh oh!

loci-review bot commented Dec 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Dec 6, 2025

Uh oh!

loci-review bot commented Dec 6, 2025

Performance Analysis Summary - PR #467

Overview

Performance Impact Assessment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants