Skip to content

ggml-cuda : reduce redundant expf calls in GDN KDA kernel#20295

Closed
simonxluo wants to merge 5 commits intoggml-org:masterfrom
simonxluo:ggml-cuda-gdn-expf-opt
Closed

ggml-cuda : reduce redundant expf calls in GDN KDA kernel#20295
simonxluo wants to merge 5 commits intoggml-org:masterfrom
simonxluo:ggml-cuda-gdn-expf-opt

Conversation

@simonxluo
Copy link

@simonxluo simonxluo commented Mar 9, 2026

Summary

Pre-compute expf(g_t[i]) values once and reuse across multiple loops in the KDA (Key-Dependent Activation) branch of the GDN kernel.

Changes

  • Add pre-computation loop for expf(g_t[i]) values before main computation
  • Replace redundant expf() calls with pre-computed array g_exp[]
  • Reduces expf calls from 2*S_v to S_v per token (e.g., from 256 to 128 calls for S_v=128)

Performance

Tested on AMD AI Max 395 (gfx1151, ROCm 7.2) with Qwen3.5-35B-A3B-heretic-v2-Q8_0:

Phase Tokens Before (t/s) After (t/s) Change
Prefill 64000 666.604 674.82 +1.23%
Decode 256 31.724 31.645 -0.25%

The decode variation (-0.25%) is within measurement noise. The prefill improvement (+1.23%) is consistent and beneficial for long-context scenarios.

Testing

  • Model: Qwen3.5-35B-A3B-heretic-v2-Q8_0
  • GPU: AMD AI Max 395 (gfx1151)
  • ROCm: 7.2
  • Parameters: --no-mmap -fa 1 -c 8192 -ngl 999
  • Verified numerical correctness: output unchanged from baseline

Notes

  • This is a pure optimization with no algorithmic changes
  • Zero risk: computation is mathematically equivalent
  • Particularly beneficial for long-sequence prefilling

AI Assistance Disclosure

Claude Code was used to identify the redundant expf() calls and suggest the pre-computation optimization strategy. The implementation was manually reviewed and tested.

Fixes #

am17an and others added 5 commits March 9, 2026 10:49
ggml-cuda: gdn use shared mem for HIP

This PR optimizes GDN operations for AMD GPU (HIP) by using shared memory
instead of registers, improving performance on HIP/ROCm platforms.

Main changes:
- Use shared memory for state data in HIP platform
- Fix assertion in mamba-base.cpp (n_embd vs d_state)
- Remove obsolete map_developer_role_to_system function
- Enhance chat template tests
…red"

This reverts commit ef56a76, reversing
changes made to f76565d.
Pre-compute expf(g_t[i]) values once and reuse across multiple loops
in the KDA (Key-Dependent Activation) branch. This reduces expf calls
from 2*S_v to S_v per token (e.g., from 256 to 128 calls for S_v=128).

AI assistance disclosure: Claude Code was used to identify the redundant
expf() calls and suggest the pre-computation optimization strategy. The
implementation was manually reviewed and tested on AMD AI Max 395 (ROCm 7.2).

Tested on Qwen3.5-35B-A3B-heretic-v2-Q8_0:
- Prefill (64000 tokens): 666.6 → 674.8 t/s (+1.23%)
- Decode (256 tokens): 31.72 → 31.65 t/s (-0.25%, within noise)

Fixes #
@simonxluo simonxluo marked this pull request as draft March 9, 2026 14:57
@simonxluo simonxluo marked this pull request as ready for review March 9, 2026 14:57
@simonxluo simonxluo closed this Mar 9, 2026
@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants