ggml-cuda : reduce redundant expf calls in GDN KDA kernel by simonxluo · Pull Request #20295 · ggml-org/llama.cpp

simonxluo · 2026-03-09T14:53:39Z

Summary

Pre-compute expf(g_t[i]) values once and reuse across multiple loops in the KDA (Key-Dependent Activation) branch of the GDN kernel.

Changes

Add pre-computation loop for expf(g_t[i]) values before main computation
Replace redundant expf() calls with pre-computed array g_exp[]
Reduces expf calls from 2*S_v to S_v per token (e.g., from 256 to 128 calls for S_v=128)

Performance

Tested on AMD AI Max 395 (gfx1151, ROCm 7.2) with Qwen3.5-35B-A3B-heretic-v2-Q8_0:

Phase	Tokens	Before (t/s)	After (t/s)	Change
Prefill	64000	666.604	674.82	+1.23%
Decode	256	31.724	31.645	-0.25%

The decode variation (-0.25%) is within measurement noise. The prefill improvement (+1.23%) is consistent and beneficial for long-context scenarios.

Testing

Model: Qwen3.5-35B-A3B-heretic-v2-Q8_0
GPU: AMD AI Max 395 (gfx1151)
ROCm: 7.2
Parameters: --no-mmap -fa 1 -c 8192 -ngl 999
Verified numerical correctness: output unchanged from baseline

Notes

This is a pure optimization with no algorithmic changes
Zero risk: computation is mathematically equivalent
Particularly beneficial for long-sequence prefilling

AI Assistance Disclosure

Claude Code was used to identify the redundant expf() calls and suggest the pre-computation optimization strategy. The implementation was manually reviewed and tested.

Fixes #

ggml-cuda: gdn use shared mem for HIP This PR optimizes GDN operations for AMD GPU (HIP) by using shared memory instead of registers, improving performance on HIP/ROCm platforms. Main changes: - Use shared memory for state data in HIP platform - Fix assertion in mamba-base.cpp (n_embd vs d_state) - Remove obsolete map_developer_role_to_system function - Enhance chat template tests

…red" This reverts commit ef56a76, reversing changes made to f76565d.

Pre-compute expf(g_t[i]) values once and reuse across multiple loops in the KDA (Key-Dependent Activation) branch. This reduces expf calls from 2*S_v to S_v per token (e.g., from 256 to 128 calls for S_v=128). AI assistance disclosure: Claude Code was used to identify the redundant expf() calls and suggest the pre-computation optimization strategy. The implementation was manually reviewed and tested on AMD AI Max 395 (ROCm 7.2). Tested on Qwen3.5-35B-A3B-heretic-v2-Q8_0: - Prefill (64000 tokens): 666.6 → 674.8 t/s (+1.23%) - Decode (256 tokens): 31.72 → 31.65 t/s (-0.25%, within noise) Fixes #

am17an and others added 5 commits March 9, 2026 10:49

ggml-cuda: gdn use shared mem for HIP

f5ae10e

const

2d6b28d

Revert "Merge pull request ggml-org#20282 from am17an:hip-gdn-use-sha…

637947c

…red" This reverts commit ef56a76, reversing changes made to f76565d.

simonxluo marked this pull request as draft March 9, 2026 14:57

simonxluo marked this pull request as ready for review March 9, 2026 14:57

simonxluo closed this Mar 9, 2026

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ggml-cuda : reduce redundant expf calls in GDN KDA kernel#20295

ggml-cuda : reduce redundant expf calls in GDN KDA kernel#20295
simonxluo wants to merge 5 commits intoggml-org:masterfrom
simonxluo:ggml-cuda-gdn-expf-opt

simonxluo commented Mar 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

simonxluo commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Performance

Testing

Notes

AI Assistance Disclosure

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

simonxluo commented Mar 9, 2026 •

edited

Loading