Skip to content

UPSTREAM PR #17502: HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 #326

Open
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17502-branch_jiachengjason-feat/jiachengjason/enable_mmq_kernels_for_RDNA4
Open

UPSTREAM PR #17502: HIP: Patch failed testcase in WMMA-MMQ kernels for RDNA 4 #326
loci-dev wants to merge 2 commits intomainfrom
upstream-PR17502-branch_jiachengjason-feat/jiachengjason/enable_mmq_kernels_for_RDNA4

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17502

  1. Patched failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4 (verified all test cases passing when running ./build/bin/test-backend-ops test -o MUL_MAT

  2. Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162

for ggml-org/llama.cpp#17156

@loci-review
Copy link

loci-review bot commented Nov 25, 2025

Explore the complete analysis inside the Version Insights

Performance Review Summary: PR #326

Analysis Scope: Comparing version 6b1d7254-fefa-44ba-8e76-256501ca6ef9 against baseline aab9b31c-ad35-48ba-b9fe-4c0fd3dc2df2

Condition Assessment: Condition 1 applies - no measurable performance changes detected.

PR #326 introduces AMD RDNA4 WMMA support through two targeted code changes in CUDA kernel files. Analysis of 15 performance-critical functions across 16 binaries shows zero measurable impact on Response Time, Throughput Time, and power consumption. All functions report is_modified: false, indicating the changes are conditionally compiled and inactive in the analyzed build configuration.

Code Changes:

  • ggml/src/ggml-cuda/mma.cuh: Added type-based branching for FP16/BF16 tile loading using ggml_cuda_memcpy_1
  • ggml/src/ggml-cuda/mmq.cuh: Added amd_wmma_available(cc) to shared memory calculation

Performance Metrics:

  • Power consumption: 12 of 16 binaries unchanged; 4 binaries show sub-nanojoule deltas
  • Core inference functions unchanged: llama_decode (44752296 ns), llama_encode (11253996 ns), llama_tokenize (899199 ns)
  • Tokens per second: No impact - inference functions show identical response times

Conclusion: Changes are architecture-specific correctness fixes with no runtime impact on current build. RDNA4-targeted builds would benefit from enabled WMMA functionality.

@loci-dev loci-dev force-pushed the main branch 26 times, most recently from 89ba2e9 to e4a4e1d Compare November 30, 2025 00:39
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 47d1dc9 to 297c352 Compare December 4, 2025 11:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants