Skip to content

UPSTREAM PR #18157: HIP: fix compile error on windows and merge I_MAJOR_DUAL to I_MAJOR_MIRRORED#611

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18157-branch_zhang-hui-yulo-fix_windows_for_rdna
Open

UPSTREAM PR #18157: HIP: fix compile error on windows and merge I_MAJOR_DUAL to I_MAJOR_MIRRORED#611
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18157-branch_zhang-hui-yulo-fix_windows_for_rdna

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18157

Fix compile error on windows. Passed on ROCm 7.1.1 Linux 9070XT, HIP 6.4.2 Windows 7900XTX and CUDA 12.9 Windows 3080.

Merge I_MAJOR_DUAL into I_MAJOR_MIRRORED, @JohannesGaessler could you help to do a quick test on Volta? Thank you.

@loci-review
Copy link

loci-review bot commented Dec 18, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #611

Analysis Scope: Single file modification (ggml/src/ggml-cuda/mma.cuh) consolidating HIP/CUDA matrix tile data layouts for RDNA3 and Volta architectures.

Performance Impact: No measurable changes detected. Power consumption analysis shows sub-nanosecond variations across all binaries (build.bin.libllama.so: -0.20 nJ, build.bin.llama-cvector-generator: -1.24 nJ, build.bin.llama-run: -0.05 nJ, build.bin.llama-tts: +0.92 nJ). All other binaries unchanged. Function-level analysis confirms zero throughput time changes and negligible response time variations (+480 ns in main, +1 ns in llama_decode).

Code Changes: Refactoring consolidates DATA_LAYOUT_I_MAJOR_DUAL into DATA_LAYOUT_I_MAJOR_MIRRORED, unifying memory layout handling for AMD RDNA3 and NVIDIA Volta tensor core operations. Changes include template specialization updates for half2 and nv_bfloat162 types, architecture-specific conditional compilation, and Windows HIP compilation fixes. No modifications to computational kernels or MMA instruction sequences.

Inference Impact: None. The llama_decode function shows +1 ns response time change, which is 0.0001% and falls within measurement noise. Using the reference model (smollm:135m on 12th Gen Intel i7-1255U), where 2 ms llama_decode degradation causes 7% tokens/second reduction, the observed 1 ns change translates to 0.0000035% tokens/second impact—effectively zero. No tokenization or inference functions (llama_decode, llama_encode, llama_tokenize) show meaningful performance changes.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from ac107ae to f002844 Compare December 21, 2025 19:06
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 1946e3d to de06f84 Compare December 28, 2025 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant