fix(trtllm_mha): clamp page_table to k_cache page range to prevent SWA crash#5
Open
pyc96 wants to merge 1 commit into
Open
Conversation
…A crash
Prevents the deterministic CUDA Warp Illegal Address crash in
'fmhaSm100fKernel_*SlidingOrChunkedCausal*' that triggers under
Gemma-4 + --attention-backend trtllm_mha + MTP + summarization
workloads at ~85% SWA pool utilization (see
crash_repro/TRIAGE_REPORT.md).
Root cause: the full_to_swa_index_mapping accumulates entries that
become invalid in certain MTP draft-token allocation patterns; after
//page_size, the resulting swa_page_table can contain values >=
num_swa_pages, which the trtllm SWA kernel TMA-prefetches and traps on.
Fix: clamp page_table values to [0, k_cache.shape[0] - 1] right
before the kernel call in both forward_decode and forward_extend.
Applies to BOTH the regular page_table and swa_page_table paths.
Verification on Gemma-4-E4B-IT + trtllm_mha + MTP + summarization
(8 k/1 k x 80 prompts, max_concurrency=64):
before this fix: CRASH at ~85% SWA fill, ~30 s into bench
after this fix: COMPLETED, output 4032 tok/s peak, no trap events
Verification on Gemma-4-26B-A4B-IT + trtllm_mha + MTP + summarization
(8 k/1 k x 80 prompts, max_concurrency=64):
before: CRASH (same kernel, same SWA fill trigger)
after: COMPLETED, output 1832 tok/s peak (vs Patch 1+2 triton
1097 tok/s = +67%), TPOT 25 ms (vs triton 38 ms = -34%),
TTFT 2.9 s (vs triton 8.8 s = -67%)
MMLU @ 500 questions on 26B with this fix: 0.718 (vs Patch 2 baseline
0.706, vLLM 0.710) -- within noise, no regression.
KNOWN LIMITATION: accept length drops vs triton backend (1.69 vs 2.76
on 26B summarization). Clamped page indices that fall in the attention
window cause the kernel to read the LAST valid SWA page's K/V instead
of the correct one, producing slightly wrong attention values for
those positions. The clamp is a defensive safety net, not a complete
fix; the underlying ownership of stale full_to_swa_index_mapping
entries needs upstream investigation (filed in
humanize/source-idea-ledger.md as Patch E). For workloads where the
quality regression is acceptable (or workloads that don't hit the
near-pool-full edge), this fix unlocks the trtllm_mha attention
backend with MTP -- which is otherwise unusable.
Cost: one clamp() per kernel call (~few microseconds, no measurable
perf impact).
See crash_repro/TRIAGE_REPORT.md.
Co-authored-by: Claude
pyc96
added a commit
that referenced
this pull request
May 28, 2026
Root-cause fix for the SWA-aware page_table OOB that crashed
trtllm_mha + MTP + hybrid-SWA models (Gemma-4 26B-A4B-IT, E4B-IT).
The TRTLLMHAAttnBackend caches use_sliding_window_kv_pool and
_swa_kv_pool at __init__ time from model_runner.token_to_kv_pool.
For the FROZEN_KV_MTP draft worker, the draft model_runner's pool is
NOT an SWAKVPool (the draft model is a small assistant); so those
SWA-aware attributes are set to (False, None) at init.
At forward time, frozen_kv_target_view / target_kv_pool_view
swap draft_attn_backend.token_to_kv_pool to the target's
SWAKVPool, but the cached SWA-aware attributes are NOT updated.
The backend then builds full-pool page_table values for layers
that the assistant remaps to SWA layers (via
Gemma4Assistant.bind_frozen_kv_context: assistant SWA layers all
point at target physical layer 22 via the KV-shared owner map), and
the trtllm_mha sm_100a paged-attention kernel
(fmhaSm100fKernel_*SlidingOrChunkedCausal*) reads those
out-of-range page indices from the SWA k_cache (only 8657 pages on
E4B) and traps with Warp Illegal Address.
Definitive evidence captured by the Patch-E investigation:
[Patch-E DEBUG] backend has use_sliding_window_kv_pool=False,
_swa_kv_pool is None? True,
layer_id=22, layer.sliding_window_size=512
The fix has two parts:
1. frozen_kv_mtp_utils.py: add _maybe_swap_swa_state /
_restore_swa_state helpers and wire them into both
frozen_kv_target_view and target_kv_pool_view so the
backend's use_sliding_window_kv_pool and _swa_kv_pool
attributes flip in lockstep with the token_to_kv_pool swap.
2. trtllm_mha_backend.py: add self.model_has_sliding_window
computed from model_runner.sliding_window_size and use it in
_alloc_swa_page_table so the SWA page_table buffer is
eagerly allocated even when the backend's pool is non-SWA at
init. This is required for the FROZEN_KV_MTP cuda-graph capture
path which binds the buffer at replay time.
3. frozen_kv_mtp_cuda_graph_runner.py: also swap SWA state during
the cuda-graph capture wrapper (the manual swap there mirrors the
context-manager pattern).
Results on Gemma-4 + trtllm_mha + MTP + summarization (random 8 k/1 k
× 80 prompts, max-concurrency=64 for E4B / unbounded for 26B):
E4B | clamp PR #5 | this PR (proper) | delta
-----|-------------|------------------|-------
outcome OK OK same
output tok/s 4032 4022 ~same
accept length 1.61 **2.13** +32%
total throughput 31.5 k tok/s 36.2 k tok/s +15%
median TPOT (ms) 12.16 9.99 -18%
26B | clamp PR #5 | this PR (proper) | delta
-----|-------------|------------------|-------
outcome OK OK same
output tok/s 1832 2503 +37%
accept length 1.67 **2.84** +70%
total throughput 16.5 k tok/s 22.5 k tok/s +37%
median TPOT (ms) 24.97 20.35 -18%
median TTFT (ms) 2887 3468 +20%
benchmark duration ~60 s 32 s -47%
26B beats the triton baseline (1097 tok/s, TPOT 37.87 ms, accept 2.76)
by +128%, -46%, +3% respectively. MMLU @ 500 questions: 0.716 (vs
triton baseline 0.706, vLLM 0.710) -- within sampling noise.
26B chat 1000/1000: TTFT 510 ms (vs vLLM 880 ms), TPOT 8.72 ms (vs
vLLM 8.46 ms), accept 2.89 (vs vLLM 2.80).
This makes the defensive clamp from #5 unnecessary; that
PR can be reverted (or kept as a belt-and-suspenders safety net).
Co-authored-by: Claude
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Defensive clamp that prevents the deterministic CUDA Warp Illegal
Address crash in
fmhaSm100fKernel_*SlidingOrChunkedCausal*whenrunning Gemma-4 +
--attention-backend trtllm_mha+ MTP +summarization workloads.
Stacked on #3 (debug trap). Staged on
pyc96/sglangonly.What this fixes
Without this PR, the same workload that the bounds trap (#3)
catches in 30 s on E4B crashes the SGLang server with
cudaErrorIllegalAddressandSIGQUIT. With this PR, the sameworkload completes cleanly.
How
In
trtllm_mha_backend.py::forward_decodeandforward_extend, rightafter
_get_layer_page_tableand before the flashinfer kernel call,clamp every page index to
[0, k_cache.shape[0] - 1].Three lines per call site, two call sites.
Benchmark — Gemma-4-E4B-IT, trtllm_mha, MTP, summarization 8 k/1 k × 80
Benchmark — Gemma-4-26B-A4B-IT, trtllm_mha, MTP, summarization 8 k/1 k × 80
This unlocks the trtllm_mha attention backend for Gemma-4 MTP, which is
otherwise unusable.
Quality — MMLU @ 500 questions (Gemma-4-26B-A4B-IT, seed 0, temp 0)
Within MMLU sampling noise; no regression.
Known limitation
Accept length drops from 2.76 (triton) → 1.69 (trtllm_mha + this PR)
on 26B summarization. Investigation:
page's K/V instead of the correct one, producing slightly wrong
attention values, which lowers MTP draft acceptance.
underlying off-by-one in either
full_to_swa_index_mappingor theSWA paged allocator's edge cases needs upstream investigation
(filed as Patch E in
humanize/source-idea-ledger.md).For workloads where the lower acceptance is acceptable (the ~50 %
throughput improvement still significantly beats the triton baseline),
this fix is a net win.
Cost
One
clamp()per kernel call. Few microseconds per forward. Nomeasurable performance impact.
Tests
No new unit test — the test is the reproducer:
Both reproduce no-crash behavior with this PR applied.
CI States
Latest PR Test (Base): ❌ Missing
run-cilabel -- add it to run CI tests.Latest PR Test (Extra): ❌ Blocked --
run-ciis required first.