Skip to content

UPSTREAM PR #17156: HIP: WMMA-MMQ kernels for RDNA 4#163

Open
DajanaV wants to merge 14 commits intomainfrom
upstream-PR17156-branch_jiachengjason-feat/jiachengjason/enable_mmq_kernels_for_RDNA4
Open

UPSTREAM PR #17156: HIP: WMMA-MMQ kernels for RDNA 4#163
DajanaV wants to merge 14 commits intomainfrom
upstream-PR17156-branch_jiachengjason-feat/jiachengjason/enable_mmq_kernels_for_RDNA4

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 10, 2025

Mirrored from ggml-org/llama.cpp#17156

Enabled WMMA-MMQ kernels for RDNA 4 architecture on AMD GPUs

Following similar approach to ggml-org/llama.cpp#14624

Using ./build/bin/llama-bench to collect the following performance results

Performance results with ggml/llama.cpp master commit up to/includes 5b180c3

Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_CUDA_FORCE_MMQ=OFF -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=OFF -DGPU_TARGETS="gfx1201" -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32

image

Build command for the following performance results:
HIPCXX="$(hipconfig -l)/clang" HIP_PATH="$(hipconfig -R)" cmake -S . -B build -DGGML_HIP=ON -DGGML_HIP_UMA=OFF -DGGML_HIP_ROCWMMA_FATTN=ON -DGPU_TARGETS=gfx1201 -DGGML_HIP_GRAPHS=OFF -DLLAMA_CURL=OFF -DGGML_CUDA_FORCE_CUBLAS=OFF -DCMAKE_BUILD_TYPE=Release && cmake --build build --config Release -- -j 32

image

@loci-review
Copy link

loci-review bot commented Nov 10, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

  • Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
  • Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
  • Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
  • Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
  • Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
  • Overall power consumption remains stable across inference binaries

Code Analysis:

  • Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
  • CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
  • GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

  • Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
  • Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

2 similar comments
@loci-review
Copy link

loci-review bot commented Nov 10, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

  • Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
  • Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
  • Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
  • Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
  • Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
  • Overall power consumption remains stable across inference binaries

Code Analysis:

  • Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
  • CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
  • GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

  • Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
  • Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

@loci-review
Copy link

loci-review bot commented Nov 10, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: WMMA-MMQ Kernels for RDNA 4

Overview

Analysis of PR #163 shows minimal performance impact on CPU inference paths, with changes focused on GPU acceleration for AMD RDNA 4 architecture. The highest measured performance changes are within statistical noise levels and unrelated to the core GPU optimizations introduced.

Key Findings

Performance Metrics:

  • Highest Response Time change: std::pow function (-0.066%, 107.60 ns vs 107.68 ns baseline)
  • Highest Throughput change: std::_Optional_base constructor (-0.170%, 23.52 ns vs 23.56 ns baseline)
  • Core inference functions unaffected: No changes detected in llama_decode, llama_encode, or llama_tokenize
  • Token throughput impact: Zero impact on CPU inference performance as critical tokenization/inference functions show no measurable changes

Power Consumption Analysis:

  • build.bin.llama-cvector-generator: Complete elimination (-100%, binary removed/disabled)
  • Core binaries (libllama.so, llama-run, llama-tts): Negligible changes (<0.001%)
  • Overall power consumption remains stable across inference binaries

Code Analysis:

  • Flame Graph: std::pow shows simple two-level execution (107 ns total, 7 ns PLT overhead) with no complexity changes
  • CFG Comparison: Identical control flow and assembly code between versions, confirming performance variations are environmental rather than code-related
  • GitHub Review: Well-structured GPU optimization adding WMMA support for RDNA 4 without affecting CPU paths

Impact Assessment:
The PR successfully adds RDNA 4 WMMA acceleration through conditional compilation (#if defined(AMD_WMMA_AVAILABLE)), ensuring zero impact on non-RDNA 4 systems. Changes span GPU kernel implementations in ggml/src/ggml-cuda/ without touching core CPU inference logic.

Actionable Recommendations:

  • Monitor RDNA 4 GPU performance validation to ensure WMMA kernels deliver expected improvements
  • Verify build system changes don't introduce configuration complexity for non-AMD builds

The analysis confirms this is a targeted GPU optimization with no CPU performance regressions and proper architectural isolation between GPU and CPU code paths.

@loci-review
Copy link

loci-review bot commented Nov 11, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 97b1911f compared to baseline 05d8e46a reveals minimal performance variations with no meaningful impact on core inference functionality. The changes primarily involve GPU optimization infrastructure for AMD RDNA 4 architecture through WMMA kernel enablement.

Key Findings

Performance Metrics:

  • Highest Response Time change: llm_graph_input_out_ids::can_reuse() with -0.096% improvement (65.164 ns → 65.101 ns)
  • Highest Throughput change: std::make_unique<llm_graph_input_pos_bucket>() with +0.117% degradation (104.328 ns → 104.450 ns)
  • Neither function affects core inference paths (llama_decode, llama_encode, llama_tokenize)

Core Function Impact:
No changes detected in critical inference functions. The measured variations occur in graph optimization utilities that do not directly impact token processing throughput. Based on the reference model performance (7% tokens/second reduction per 2ms llama_decode slowdown), the observed nanosecond-level changes have negligible impact on inference performance.

Power Consumption Analysis:
Minimal power consumption changes across all binaries:

  • build.bin.libllama.so: 0.0005% increase (280,852.58 nJ → 280,853.99 nJ)
  • build.bin.llama-run: 0.001% increase (268,045.53 nJ → 268,046.99 nJ)
  • All other binaries show zero measurable change

Flame Graph and CFG Analysis:
The can_reuse() function exhibits identical assembly code between versions with a simple linear execution pattern (single basic block, 20 instructions). The 0.063 ns improvement represents measurement variance rather than algorithmic optimization, as confirmed by identical control flow graphs and instruction sequences.

Code Review Insights:
The GitHub PR introduces WMMA-MMQ kernel support for AMD RDNA 4 GPUs, adding 428 lines focused on GPU acceleration. The implementation includes proper conditional compilation guards and maintains backward compatibility. No functional regressions identified in the GPU optimization code.

Conclusion:
The analysis reveals no performance impact on core inference functionality. Observed variations fall within measurement precision limits and do not affect token processing throughput.

@DajanaV DajanaV force-pushed the main branch 22 times, most recently from b50c0de to 8798da8 Compare November 13, 2025 20:09
@loci-review
Copy link

loci-review bot commented Nov 23, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #163

Assessment

Condition 1 applies: No meaningful performance changes detected between versions.

The analysis shows zero measurable performance impact across all 16 binaries in the LLaMA.cpp project. Power consumption analysis reveals negligible variations: libllama.so (+0.57 nJ, +0.0003%), llama-cvector-generator (-1.29 nJ, -0.0005%), and llama-run (-0.17 nJ, -0.0001%). All other binaries show 0.0% change. No function-level performance data is available, indicating the versions are functionally identical from a performance perspective.

Code Changes Analysis

This PR implements AMD RDNA 4 GPU support by enabling WMMA (Wave Matrix Multiply-Accumulate) instructions for hardware-accelerated quantized matrix operations. The changes are entirely architecture-specific and preprocessor-guarded, affecting only RDNA 4 GPU execution paths:

Key Modifications:

  1. ggml-cuda/mma.cuh: Reorganized WMMA tile structures, added __builtin_amdgcn_wmma_i32_16x16x16_iu8_w32_gfx12 intrinsics, and implemented optimized 16x4/16x8 tile loading
  2. ggml-cuda/mmq.cu: Enabled MMQ kernel selection for RDNA 4 (GGML_CUDA_CC_IS_RDNA4)
  3. ggml-cuda/mmq.cuh: Added WMMA code paths for all quantization formats (Q4_0, Q4_1, Q5_0, Q5_1, Q8_0, Q2_K-Q6_K, IQ variants) with 16x4 tile operations

Performance Implications:

  • For RDNA 4 users: Expected 2-5x throughput improvement for quantized inference operations when GPU offloading is enabled
  • For all other platforms: Zero impact; changes are compile-time conditional
  • Core inference functions (llama_decode, llama_encode, llama_tokenize): Unaffected; changes are GPU backend-specific
  • Tokens per second: No change for CPU execution; potential significant gains for RDNA 4 GPU users

Conclusion:

The PR successfully adds RDNA 4 hardware acceleration without affecting existing functionality or performance on other platforms. The zero-delta in performance metrics confirms proper isolation of architecture-specific code paths.

@loci-review
Copy link

loci-review bot commented Nov 23, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #163 - WMMA-MMQ Kernels for RDNA 4

Assessment

No performance impact detected. The comparison between versions 16755d51-8712-48e2-af58-431ce36c40cd and d7e39df9-d9c9-48ca-9a8c-4ccb5d357361 shows zero measurable performance regression across all analyzed functions and binaries. All changes are within measurement noise (< 0.001%).

Code Changes Analysis

This PR enables WMMA (Wave Matrix Multiply-Accumulate) kernels for AMD RDNA 4 GPUs by:

  1. Architectural Separation: Distinguishes WMMA (RDNA4, 32-wide waves) from MFMA (CDNA/RDNA3, 64-wide waves) in ggml/src/ggml-cuda/mma.cuh
  2. Kernel Enablement: Modifies ggml_cuda_should_use_mmq() in mmq.cu to enable MMQ kernels for RDNA4
  3. Quantization Support: Adds WMMA implementations for all quantization formats (Q2_K through Q8_1) in mmq.cuh using 16x4 tiles

Performance Metrics:

  • llama_decode: +428 ns response time (+0.001%), 0 ns throughput change
  • llama_tokenize: +7 ns response time (+0.001%), 0 ns throughput change
  • Power consumption: No change across all binaries (< 0.0001% variation)

Flame Graph & CFG Analysis:

  • Control flow graphs are structurally identical between versions
  • No added branches, instructions, or memory access pattern changes
  • Assembly code functionally equivalent (only register allocation differences)

Code Review Findings:

  • Changes are additive (new RDNA4 code path)
  • Existing MFMA/CDNA/RDNA3 paths unchanged
  • Hardware-specific optimizations using gfx12 WMMA intrinsics
  • Comprehensive quantization format coverage maintained

Conclusion:
The PR introduces RDNA4-specific optimizations without affecting existing hardware paths. Performance improvements will manifest only on RDNA4 GPUs when executing quantized model inference. No regressions detected for current hardware configurations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants