UPSTREAM PR #17502: HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 by loci-dev · Pull Request #326 · auroralabs-loci/llama.cpp

loci-dev · 2025-11-25T20:37:54Z

Mirrored from ggml-org/llama.cpp#17502

Patched failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4 (verified all test cases passing when running ./build/bin/test-backend-ops test -o MUL_MAT
Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162

for ggml-org/llama.cpp#17156

…76,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4

… and bfloat162

loci-review · 2025-11-25T21:20:16Z

Explore the complete analysis inside the Version Insights

Performance Review Summary: PR #326

Analysis Scope: Comparing version 6b1d7254-fefa-44ba-8e76-256501ca6ef9 against baseline aab9b31c-ad35-48ba-b9fe-4c0fd3dc2df2

Condition Assessment: Condition 1 applies - no measurable performance changes detected.

PR #326 introduces AMD RDNA4 WMMA support through two targeted code changes in CUDA kernel files. Analysis of 15 performance-critical functions across 16 binaries shows zero measurable impact on Response Time, Throughput Time, and power consumption. All functions report is_modified: false, indicating the changes are conditionally compiled and inactive in the analyzed build configuration.

Code Changes:

ggml/src/ggml-cuda/mma.cuh: Added type-based branching for FP16/BF16 tile loading using ggml_cuda_memcpy_1
ggml/src/ggml-cuda/mmq.cuh: Added amd_wmma_available(cc) to shared memory calculation

Performance Metrics:

Power consumption: 12 of 16 binaries unchanged; 4 binaries show sub-nanojoule deltas
Core inference functions unchanged: llama_decode (44752296 ns), llama_encode (11253996 ns), llama_tokenize (899199 ns)
Tokens per second: No impact - inference functions show identical response times

Conclusion: Changes are architecture-specific correctness fixes with no runtime impact on current build. RDNA4-targeted builds would benefit from enabled WMMA functionality.

jiachengjason added 2 commits November 25, 2025 14:43

patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=5…

8cec0c4

…76,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4

Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2…

8cc265c

… and bfloat162

loci-dev temporarily deployed to PROD__AL_DEMO November 25, 2025 20:37 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 26 times, most recently from 89ba2e9 to e4a4e1d Compare November 30, 2025 00:39

loci-dev force-pushed the main branch 30 times, most recently from 47d1dc9 to 297c352 Compare December 4, 2025 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17502: HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 #326

UPSTREAM PR #17502: HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 #326
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17502-branch_jiachengjason-feat/jiachengjason/enable_mmq_kernels_for_RDNA4

loci-dev commented Nov 25, 2025

Uh oh!

loci-review bot commented Nov 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Nov 25, 2025

Uh oh!

loci-review bot commented Nov 25, 2025

Performance Review Summary: PR #326

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants