Fix: Cuda Graph + HiCache + Speculative Decoding Working Together were giving Cuda Illegal memory access error. #19177
Fix: Cuda Graph + HiCache + Speculative Decoding Working Together were giving Cuda Illegal memory access error. #19177PratikNarola1 wants to merge 3 commits intosgl-project:mainfrom
Conversation
…dation - Add memory_relocation_version tracking to HiCacheController (incremented only on actual host↔device memory relocations, not normal alloc/free) - Propagate hicache_memory_version through ScheduleBatch → ModelWorkerBatch → ForwardBatch for CUDA graph invalidation checks - Restore version check in all CUDA graph runners (CudaGraphRunner, PiecewiseCudaGraphRunner, EAGLEDraftCudaGraphRunner, EAGLEDraftExtendCudaGraphRunner, MultiLayerEagleDraftExtendCudaGraphRunner) - Add hicache_consumer_index checks to disable graphs during active loading - Add last_loc bounds clamping in eagle_info.py prepare_for_verify - Add HiCache version-aware last_loc recomputation after memory relocation - Remove GPU→CPU syncing debug logs from allocator, cuda_graph_runner, and piecewise_cuda_graph_runner (partial cleanup, more to follow) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…wise runner Remove [EAGLE verify] entry/freeing logs with .min().item()/.max().item() calls and stale comment in piecewise_cuda_graph_runner replay method.
Summary of ChangesHello @PratikNarola1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves critical CUDA illegal memory access errors encountered when using CUDA Graphs, HiCache, and Speculative Decoding together. The changes introduce robust mechanisms to manage asynchronous memory operations and relocations within the hierarchical cache, ensuring CUDA graphs are disabled when memory states are unstable. Additionally, it refines speculative decoding logic to handle HiCache-induced memory changes and switches the draft attention backend for GLM-5 to FlashInfer to circumvent a specific CUDA graph bug in the NSA backend. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of fixes to address CUDA illegal memory access errors arising from the interaction of CUDA Graphs, HiCache, and speculative decoding (EAGLE). The changes are well-reasoned and robust. Key improvements include adding guards to disable CUDA graphs during HiCache loading, introducing a memory_relocation_version for tracking HiCache memory changes, and adding safety checks in the EAGLE speculative decoding path. My review has identified a potential off-by-one error in a boundary check and a couple of comments that could be clarified for better maintainability. Overall, this is a solid contribution that tackles a complex concurrency issue.
|
|
||
| # Clamp last_loc to valid KV cache range (no GPU sync) | ||
| total_slots = batch.token_to_kv_pool_allocator.size | ||
| last_loc = last_loc.clamp(0, total_slots) |
There was a problem hiding this comment.
There seems to be a potential off-by-one error here. If total_slots is the size of the allocator, the valid indices are in the range [0, total_slots - 1]. torch.clamp(min, max) is inclusive, so clamp(0, total_slots) allows the value total_slots, which is an out-of-bounds index. This could lead to an illegal memory access. It should probably be clamped to total_slots - 1.
| last_loc = last_loc.clamp(0, total_slots) | |
| last_loc = last_loc.clamp(0, total_slots - 1) |
| chunked_req=self.chunked_req, | ||
| ) | ||
| # Always set memory relocation version for CUDA graph invalidation | ||
| # This tracks all KV cache free operations (request completions, evictions) |
There was a problem hiding this comment.
The comment here is a bit misleading. hicache_memory_version tracks HiCache-specific memory relocations (like host-to-device loads or device evictions), not all KV cache free operations. This distinction is important as kv_free_version (from allocator.py) is meant for tracking all normal free operations. To improve clarity, I suggest updating the comment.
| # This tracks all KV cache free operations (request completions, evictions) | |
| # This tracks HiCache memory relocations (e.g., host->device loads, device evictions). |
| batch.prepare_for_decode() | ||
|
|
||
| # Always set memory relocation version for CUDA graph invalidation | ||
| # This tracks all KV cache free operations (request completions, evictions) |
There was a problem hiding this comment.
Similar to the prefill path, this comment is a bit misleading. hicache_memory_version tracks HiCache-specific memory relocations, not all free operations. Updating it for clarity would be beneficial for future maintenance.
| # This tracks all KV cache free operations (request completions, evictions) | |
| # This tracks HiCache memory relocations (e.g., host->device loads, device evictions). |
|
Does this fix solve errors like this one?
For debugging consider passing CUDA_LAUNCH_BLOCKING=1 [2] NCCL INFO ncclCommInitRankConfig comm 0x761b5200 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 43000 commId 0x5230d054fc1eb700 - Init COMPLETE terminate called after throwing an instance of 'c10::DistBackendError' what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered |
|
Yes @chenkaiyue I had faced same error as well. and yes it should fix it. Could you try out my branch or cherry pick commits and see if it fixes your errors? |
@PratikNarola1 OK~ Did you perform stress testing and long-duration testing? I found that this error might only appear once every few hours. |
|
Yes @chenkaiyue My Forked branch is currently running on 4 nodes each with 8x H200s running GLM-5 for almost 2 weeks. Only after testing it for a week, I raise the PR. I have performed load testing with 128 concurrent requests, Multi-turn conversation that builds up over time as well as complete 200k examples of Ultrachat running almost 24 hours with 150 concurrent requests. it hasnt failed so far 😆 |
|
I cherry pick this pr to v0.5.9, it happend again
I0303 23:04:30.822786 792429 real_client.cpp:1807] Time taken for batch_get_into: 1785us, read store: 0us, with memory key count: 256, offload key count: 0 Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first): terminate called after throwing an instance of 'c10::DistBackendError' Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first): Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first): Fatal Python error: Aborted Thread 0x00007ef3cbfff6c0 (most recent call first): Thread 0x00007eef1bffe6c0 (most recent call first): Thread 0x00007ef39ffff6c0 (most recent call first): Thread 0x00007f3411ffe6c0 (most recent call first): Thread 0x00007f3415fff6c0 (most recent call first): Thread 0x00007f3babfff6c0 (most recent call first): Thread 0x00007f56356be300 (most recent call first): Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pybase64._pybase64, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, zmq.backend.cython._zmq, PIL._imaging, sentencepiece._sentencepiece, yaml._yaml, regex._regex, markupsafe._speedups, cuda_utils, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._pcg64, numpy.random._generator, numpy.random._mt19937, numpy.random._philox, numpy.random._sfc64, numpy.random.mtrand, _cffi_backend, _cyutility, scipy._cyutility, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._batched_linalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, I0303 23:04:30.834414 792445 real_client.cpp:1807] Time taken for batch_get_into: 1554us, read store: 0us, with memory key count: 256, offload key count: 0 |
|
i use M2.5 @PratikNarola1 It seems to be related to the synchronization of eviction. |
Change 1: HiCache Loading Guard for CUDA Graphs
Files: scheduler.py, schedule_batch.py, forward_batch_info.py, cache_controller.py, hiradix_cache.py
What: Propagated hicache_consumer_index through the full batch pipeline (ScheduleBatch → ModelWorkerBatch → ForwardBatch) and added checks in all CUDA graph runners to disable graphs when consumer_index >= 0.
Also added is_loading_in_progress() to catch async loads still running from a previous batch.
Why: When HiCache loads KV cache from host memory, it allocates new device pages and copies data asynchronously on a separate CUDA stream. During this window, page indices in req_to_token are being updated. If a
CUDA graph replays and reads half-updated page tables → illegal memory access.
Why wasn't it done already: The upstream code literally had a TODO comment: # todo (zhiqiang): disable cuda graph execution if hicache loading triggered. The original hicache_consumer_index field existed but was only set during prefill batches, not decode batches. We extended it to decode batches AND added the is_loading_in_progress() check.
Why other models don't need this: Models without --enable-hierarchical-cache always have consumer_index = -1, so this check is a no-op.
Change 2: Memory Relocation Version Tracking
Files: cache_controller.py, base_prefix_cache.py, radix_cache.py, hiradix_cache.py, allocator.py
What: Added memory_relocation_version counter to HiCacheController. Incremented only when HiCache actually relocates memory (host→device load, or device eviction). Separate from kv_free_version which increments on every normal free.
Why: We needed to distinguish "normal KV cache free/alloc" (happens every decode step, safe for CUDA graphs) from "HiCache memory relocation" (changes logical page index mapping). This distinction is what allows
the last_loc recomputation in eagle_info.py to know when to refresh stale indices.
Why wasn't it done already: sglang had no version tracking for HiCache relocations at all. The EAGLE code path assumed page indices are stable between operations — which is true without HiCache. HiCache + EAGLE is a very new combination.
Why other models don't need this: RadixCache (non-hierarchical) returns get_memory_relocation_version() = 0 always. Only HiRadixCache has a real counter.
Change 3: last_loc Bounds Clamping + Version-Aware Recomputation
File: eagle_info.py
What: In prepare_for_verify():
Why: last_loc reads req_to_token[req_pool_indices, prefix_lens-1] — the last KV cache slot per request. If HiCache evicts/relocates pages between computing last_loc and using it, the value could point to a freed slot. The allocation itself can trigger eviction (need more device pages → evict old ones).
Why wasn't it done already: Without HiCache, page indices in req_to_token are stable between operations. HiCache introduces asynchronous memory relocation that can invalidate indices mid-operation. This race condition is specific to the alloc → use gap in EAGLE's verify path.
Change 4: Draft Attention Backend → FlashInfer
File: glm5.sh (config only: --speculative-draft-attention-backend flashinfer)
What: Use FlashInfer instead of NSA for the EAGLE draft model's attention. Target model still uses NSA.
Why: GLM-5's NSA backend creates NativeSparseAttnMultiStepBackend (2 NSA sub-backends for 3-step draft). The CUDA graph replay path for this multi-step NSA configuration has a latent bug causing illegal memory access. This bug was always present but masked because CUDA graphs were always disabled — the original code used a kv_free_version check that incremented on every free(), so can_run() always returned False. When
we fixed CUDA graphs to actually run, we exposed this bug.
Rather than debugging the NSA CUDA graph kernel (deep in the NSA attention implementation), we switch the draft model to FlashInferMLAMultiStepDraftBackend which:
Why wasn't it done already: The --speculative-draft-attention-backend flag exists but is rarely used. Most deployments use the same backend for draft and target. The NSA + EAGLE CUDA graph path was effectively
dead code — it existed in the codebase but was never reached because can_run() always returned False.