Skip to content

Comments

UPSTREAM PR #17990: HIP: Refactor mma for RDNA and CDNA#548

Open
loci-dev wants to merge 7 commits intomainfrom
upstream-PR17990-branch_zhang-hui-yulo-refactor_mma_for_rdna
Open

UPSTREAM PR #17990: HIP: Refactor mma for RDNA and CDNA#548
loci-dev wants to merge 7 commits intomainfrom
upstream-PR17990-branch_zhang-hui-yulo-refactor_mma_for_rdna

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17990

Refactor mma.cuh for RDNA and CDNA, clean up row-major and colum-major matrix for future development like FA, add dual matrix type for RDNA3.

CDNA isn't tested as I don't have a GPU, @JohannesGaessler could you help to do a raw test on your MI GPU? Thank you. Honestly, I probably need your coding help to fix the bug on CDNA as I don't have a GPU, thank you.

@loci-review
Copy link

loci-review bot commented Dec 13, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #548

Overview

PR #548 refactors AMD GPU matrix multiply-accumulate infrastructure for RDNA3, RDNA4, and CDNA architectures. The changes introduce a data layout abstraction system replacing architecture-specific conditional compilation with template-based approaches. Analysis shows zero performance impact as the compared versions (50ec7e39 vs 7d958d88) are functionally identical binaries.

Performance Metrics

All analyzed functions show 0% change in both response time and throughput:

  • main (llama-run): 216,113,040 ns response time, 283 ns throughput - no change
  • llama_decode: 733,766 ns response time, 70 ns throughput - no change
  • llama_new_context_with_model: 1,380,883 ns response time, 53 ns throughput - no change
  • llama_sampler_sample: 13,658 ns response time, 309 ns throughput - no change
  • ggml_backend_sched_graph_compute: 186,157 ns response time, 22 ns throughput - no change

Power consumption analysis across 16 binaries shows negligible variation (< 0.001%), with the largest change being +1.35 nJ in llama-run.

Code Changes

The PR modifies GPU-specific matrix operation primitives in three files (mma.cuh, mmf.cuh, mmq.cuh). Key changes include:

  • New data layout enumerations (DATA_LAYOUT_J_MAJOR, DATA_LAYOUT_I_MAJOR_DUAL) for AMD architecture requirements
  • Corrected 16x16 tile indexing (swapped get_i/get_j implementations)
  • Unified memory loading logic with layout-aware dispatch
  • Template-parameterized MMA functions for compile-time layout selection

These changes affect AMD GPU code paths only and do not modify CPU inference or tokenization functions.

Inference Impact

Tokens per second: No impact. The tokenization and inference functions (llama_decode, llama_encode, llama_tokenize) show zero response time changes. Since the reference model experiences 7% tokens/sec reduction with 2 ms llama_decode slowdown, and the measured change is 0 ns, no inference performance degradation occurs.

Impacted functions for tokens/sec: None - llama_decode, llama_encode, and llama_tokenize maintain identical performance.

Power consumption: All binaries show stable power profiles with changes below measurement noise. Impacted binaries: build.bin.llama-run (+0.001%), build.bin.libllama.so (-0.0%), all others (0.0%).

@loci-dev loci-dev force-pushed the main branch 20 times, most recently from 799183f to 26e8fe3 Compare December 16, 2025 07:11
@loci-dev loci-dev force-pushed the main branch 19 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant