ggml-cuda : reduce redundant expf calls in GDN KDA kernel#20295
Closed
simonxluo wants to merge 5 commits intoggml-org:masterfrom
Closed
ggml-cuda : reduce redundant expf calls in GDN KDA kernel#20295simonxluo wants to merge 5 commits intoggml-org:masterfrom
simonxluo wants to merge 5 commits intoggml-org:masterfrom
Conversation
ggml-cuda: gdn use shared mem for HIP This PR optimizes GDN operations for AMD GPU (HIP) by using shared memory instead of registers, improving performance on HIP/ROCm platforms. Main changes: - Use shared memory for state data in HIP platform - Fix assertion in mamba-base.cpp (n_embd vs d_state) - Remove obsolete map_developer_role_to_system function - Enhance chat template tests
Pre-compute expf(g_t[i]) values once and reuse across multiple loops in the KDA (Key-Dependent Activation) branch. This reduces expf calls from 2*S_v to S_v per token (e.g., from 256 to 128 calls for S_v=128). AI assistance disclosure: Claude Code was used to identify the redundant expf() calls and suggest the pre-computation optimization strategy. The implementation was manually reviewed and tested on AMD AI Max 395 (ROCm 7.2). Tested on Qwen3.5-35B-A3B-heretic-v2-Q8_0: - Prefill (64000 tokens): 666.6 → 674.8 t/s (+1.23%) - Decode (256 tokens): 31.72 → 31.65 t/s (-0.25%, within noise) Fixes #
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pre-compute
expf(g_t[i])values once and reuse across multiple loops in the KDA (Key-Dependent Activation) branch of the GDN kernel.Changes
expf(g_t[i])values before main computationexpf()calls with pre-computed arrayg_exp[]2*S_vtoS_vper token (e.g., from 256 to 128 calls for S_v=128)Performance
Tested on AMD AI Max 395 (gfx1151, ROCm 7.2) with Qwen3.5-35B-A3B-heretic-v2-Q8_0:
The decode variation (-0.25%) is within measurement noise. The prefill improvement (+1.23%) is consistent and beneficial for long-context scenarios.
Testing
--no-mmap -fa 1 -c 8192 -ngl 999Notes
AI Assistance Disclosure
Claude Code was used to identify the redundant
expf()calls and suggest the pre-computation optimization strategy. The implementation was manually reviewed and tested.Fixes #