UPSTREAM PR #17990: HIP: Refactor mma for RDNA and CDNA by loci-dev · Pull Request #548 · auroralabs-loci/llama.cpp

loci-dev · 2025-12-13T11:33:17Z

Refactor mma.cuh for RDNA and CDNA, clean up row-major and colum-major matrix for future development like FA, add dual matrix type for RDNA3.

CDNA isn't tested as I don't have a GPU, @JohannesGaessler could you help to do a raw test on your MI GPU? Thank you. Honestly, I probably need your coding help to fix the bug on CDNA as I don't have a GPU, thank you.

loci-review · 2025-12-13T12:32:01Z

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #548

Overview

PR #548 refactors AMD GPU matrix multiply-accumulate infrastructure for RDNA3, RDNA4, and CDNA architectures. The changes introduce a data layout abstraction system replacing architecture-specific conditional compilation with template-based approaches. Analysis shows zero performance impact as the compared versions (50ec7e39 vs 7d958d88) are functionally identical binaries.

Performance Metrics

All analyzed functions show 0% change in both response time and throughput:

main (llama-run): 216,113,040 ns response time, 283 ns throughput - no change
llama_decode: 733,766 ns response time, 70 ns throughput - no change
llama_new_context_with_model: 1,380,883 ns response time, 53 ns throughput - no change
llama_sampler_sample: 13,658 ns response time, 309 ns throughput - no change
ggml_backend_sched_graph_compute: 186,157 ns response time, 22 ns throughput - no change

Power consumption analysis across 16 binaries shows negligible variation (< 0.001%), with the largest change being +1.35 nJ in llama-run.

Code Changes

The PR modifies GPU-specific matrix operation primitives in three files (mma.cuh, mmf.cuh, mmq.cuh). Key changes include:

New data layout enumerations (DATA_LAYOUT_J_MAJOR, DATA_LAYOUT_I_MAJOR_DUAL) for AMD architecture requirements
Corrected 16x16 tile indexing (swapped get_i/get_j implementations)
Unified memory loading logic with layout-aware dispatch
Template-parameterized MMA functions for compile-time layout selection

These changes affect AMD GPU code paths only and do not modify CPU inference or tokenization functions.

Inference Impact

Tokens per second: No impact. The tokenization and inference functions (llama_decode, llama_encode, llama_tokenize) show zero response time changes. Since the reference model experiences 7% tokens/sec reduction with 2 ms llama_decode slowdown, and the measured change is 0 ns, no inference performance degradation occurs.

Impacted functions for tokens/sec: None - llama_decode, llama_encode, and llama_tokenize maintain identical performance.

Power consumption: All binaries show stable power profiles with changes below measurement noise. Impacted binaries: build.bin.llama-run (+0.001%), build.bin.libllama.so (-0.0%), all others (0.0%).

zhang hui added 7 commits December 13, 2025 13:42

mma.cuh for rdna4

318cb5b

mma for rdna3

074b931

mmq for rdna4

98846cb

mmq for rdna3

62e4954

align i-major and j-major

8b26bc3

cdna

afb0e3d

fix cuda error

6b8ed41

loci-dev temporarily deployed to PROD__AL_DEMO December 13, 2025 11:33 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from b28744d to 4733ac4 Compare December 13, 2025 12:13

loci-dev force-pushed the main branch 20 times, most recently from 799183f to 26e8fe3 Compare December 16, 2025 07:11

loci-dev force-pushed the main branch 19 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 8 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17

loci-dev force-pushed the main branch 3 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

UPSTREAM PR #17990: HIP: Refactor mma for RDNA and CDNA#548

UPSTREAM PR #17990: HIP: Refactor mma for RDNA and CDNA#548
loci-dev wants to merge 7 commits intomainfrom
upstream-PR17990-branch_zhang-hui-yulo-refactor_mma_for_rdna

loci-dev commented Dec 13, 2025

Uh oh!

loci-review bot commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

loci-dev commented Dec 13, 2025

Uh oh!

loci-review bot commented Dec 13, 2025

Performance Analysis Summary: PR #548

Overview

Performance Metrics

Code Changes

Inference Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant