hip: bypass memory pool for flash attention f16 temp buffers#22094
hip: bypass memory pool for flash attention f16 temp buffers#22094TheTom wants to merge 1 commit into
Conversation
|
Hi @TheTom, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
bccd1a0 to
341192b
Compare
341192b to
f4a6fdb
Compare
f4a6fdb to
30c3c23
Compare
Independent Verification: OOM Bug Confirmed, Fix VerifiedHardware: AMD Radeon RX 9070 XT (gfx1201, 16GB VRAM, VMM: no) Model: Phi-3.1-mini-128k-instruct Q4_K_M (2.23 GiB, 3.82B params) Baseline: e365e65 (master) OOM Reproduction (master, unfixed)
q8_0 KV should use less memory than f16, yet it OOMs first. This confirms the bug: pool-retained f16 temp buffers from quantized KV dequant push total VRAM past the limit. OOM Fix (PR #22094 applied)
Fix works. q8_0 now survives the same depth that killed it on master. Full Benchmark: Master (unfixed)f16 KV:
q8_0 KV:
Full Benchmark: With Fix (PR #22094)f16 KV:
q8_0 KV:
Performance Comparison at Matching Depth Pointspp512 (no depth):
tg128 (no depth):
pp512 @ d32768:
tg128 @ d32768:
* Note on variance: The large swings in d32768 numbers (especially decode) and the q8_0 pp512 no-depth difference suggest heavy VRAM pressure effects on master. When VRAM is near capacity, the HIP runtime thrashes. This explains the extremely low decode numbers on master (4.54 t/s f16 decode @ d32768). The fix, by releasing temp buffers, reduces VRAM pressure and performance stabilizes. The "improvement" at d32768 is really the fix eliminating thrashing. Control Test: Qwen2.5-7B Q4_K_M (4 GQA KV heads)This model has only 4 KV heads (GQA), so the FA temp buffer is small. OOM was not triggered. Performance is stable between master and fix.
No regression. f16 unaffected. q8_0 decode stable. Verdict
Fix resolves the reported issue with minimal code change (27 lines, HIP-only). Reproducer# master (e365e65) — shows the bug
llama-bench -m phi-3.1-mini-128k-instruct-Q4_K_M.gguf -ngl 99 -fa 1 -ctk q8_0 -ctv q8_0 -p 512 -n 128 -d 0,32768,40000 -r 1
llama-bench -m phi-3.1-mini-128k-instruct-Q4_K_M.gguf -ngl 99 -fa 1 -ctk f16 -ctv f16 -p 512 -n 128 -d 0,32768,40000 -r 1
# PR #22094 (30c3c23) — shows the fix
# same commands, q8_0 no longer OOMs |
30c3c23 to
0b05974
Compare
|
Additional finding: tested whether enabling HIP VMM could serve as an alternative fix. Built upstream master (e365e65) with This confirms the Tracked upstream at ROCm/rocm-systems#2516. |
The legacy memory pool (ggml_cuda_pool_leg) retains peak-sized allocations permanently. For quantized KV flash attention, the f16 dequant temp buffers (K_f16, V_f16) stay allocated in the pool after use, consuming more VRAM than the KV compression saves. This causes quantized KV (q8_0, q4_0) to OOM before f16 at equivalent context lengths on HIP/ROCm where VMM is unavailable. Root cause: ggml_cuda_pool_leg::free() stores buffers in buffer_pool[] for reuse and never calls cudaFree. On CUDA with VMM the OS can reclaim unused virtual memory. On HIP without VMM (all consumer RDNA 3/4 GPUs), the pool permanently consumes peak VRAM. Fix: on HIP, allocate f16 temp buffers with cudaMalloc and free with cudaFree (via RAII wrapper) instead of the pool. Memory is released after the FA kernel completes via cudaStreamSynchronize. Trade-off: one cudaStreamSynchronize per FA call (~5% overhead at 32K). Impact: CUDA/Metal unaffected (#ifdef GGML_USE_HIP only). Confirmed: gfx1100 (RX 7900 XT), gfx1201 (RX 9070 XT) Fixes: ggml-org#22107
0b05974 to
53fd02f
Compare
For Reviewers:
No overlap between them. #21119 is in draft and #21452 has been waiting on review for 2 weeks. Happy to prioritize review order if that helps. |
|
Not a resonable solution, the allocator should be improved instead, there is a pr open for that. |
|
Which PR are you referring to? Happy to test it. This fix exists because VMM is broken on consumer RDNA 3/4 today. I tested with Until VMM works on consumer cards, the legacy pool is the only path, and it leaks f16 temp buffers. This PR fixes that for users hitting OOM today. If a better allocator fix lands upstream, this becomes a no-op to remove. |
|
According to the llama.cpp AI usage policy:
|
|
@IMbackK Added crash logs here that @manocharahul requested: ROCm/rocm-systems#2516 (comment) Is #22155 the PR you are referring to? I tested it on the same hardware (RX 9070 XT, gfx1201, 16GB, HIP SDK 7.1). Both PRs fix the OOM. However, #22155 is 3x slower at depth because the flush-retry triggers on every FA call:
#22155 flushes the entire pool and retries on every layer's FA allocation at high context. This PR prevents the pool from retaining those buffers in the first place, so no OOM occurs and no flush is needed. The two approaches are complementary, not competing. #22155 is a general safety net for any pool OOM. This PR targets the specific root cause for FA temp buffers. Both could coexist. I believe both can be applied as is since FA buffers never enter the pool (my fix) so no OOM, if any other large allocation causes pool OOM, #22155 catches it as a safety net. Thoughts? |
|
I am happy to help test with my 9070XT if helpful. I can say for sure that TheTom's solution works for me as-is. Obviously we want the core ggml_cuda_pool_leg allocator itself to be smart enough to release memory properly (which is what that other PR #22155 attempts to do). Architecturally, correct, fixing the root allocator is cleaner code than adding exceptions. However, as TheTom pointed out #22155 is a brute-force approach: it literally flushes the entire memory pool and retries the allocation if it fails. Hence the massive 3x performance drop at 40k tokens. While the upstream approach is architecturally pure, that kind of performance regression renders deep-context workflows practically unusable for end-users on AMD hardware. I'd love to see a solution that fixes the allocator without sacrificing the prefill performance we get in this PR. Always happy to test. Thanks! |
|
I honestly struggle to understand the decision to close this PR. It frequently feels like the real-world needs of AMD consumer GPU users are not being taken seriously. While I respect the maintainers' desire for a clean and architecturally flawless codebase, in practical engineering, working is sometimes far more important than perfect. We desperately need a functional way to run these models on our hardware today. By rejecting this implementation, AMD users are once again left waiting indefinitely for an idealized solution that might take months, while a perfectly usable fix is simply discarded. I sincerely hope the team can reconsider the actual pain points of the end-users. A practical, working solution right now is much more valuable to the community than waiting for a perfect one that doesn't exist yet. |
|
Because this is a hack, the solution here is not to ignore the allocator for this one case but to instead imporove the allocator. If the performance cost of fully flushing the allocator it to high you could for instance try reclaming only enough segments for the allocation to suceed and See if that improves performance. Over all this whole thing will be resolved in rocm 7.3 because there the vmm allocator will probubly Start to work given devolpments in clr. |
|
Refactored and opened a new PR with cleaner structure: #22185 Moved the inline struct to a reusable ggml_cuda_direct_alloc in common.cuh that mirrors the pool_alloc interface. Same fix, same test results. |
On HIP/ROCm, bypass the memory pool for flash attention f16 temp buffers to prevent OOM with quantized KV cache types.
Overview
The legacy memory pool (
ggml_cuda_pool_leg) retains peak-sized allocations permanently. During flash attention with quantized KV types (q4_0, q8_0), the f16 dequant temp buffers (K_f16,V_f16) grow proportional to KV cache length. After use, the pool retains them at peak size — consuming more VRAM than the KV compression saves.On CUDA with VMM, the OS can reclaim unused virtual memory. On HIP, VMM is broken since ROCm 7.0 (ROCm/rocm-systems#2516, open 4+ months), so all consumer RDNA 3/4 GPUs fall back to the legacy pool where allocations are never freed.
This causes quantized KV to OOM before f16 at equivalent context lengths on stock llama.cpp.
Fix
Replace pool-based f16 temp buffer allocation with a RAII struct using raw
cudaMalloc/cudaFreeon HIP. Memory is released after the FA kernel completes viacudaStreamSynchronize.Single file change:
ggml/src/ggml-cuda/fattn-common.cuh, scoped to#ifdef GGML_USE_HIP. Zero impact on CUDA or Metal.Reproducer
Stock llama.cpp, no modifications needed:
Confirmed hardware
Performance impact
Tested on RX 9070 XT, Qwen2.5-1.5B Q8_0:
The ~5% prefill overhead is from
cudaStreamSynchronizein the RAII destructor. Can be eliminated withhipFreeAsync(stream-ordered free) in future.Fixes #22107
Mentions ROCm/rocm-systems#2516
Requirements