Skip to content

Fix: Cuda Graph + HiCache + Speculative Decoding Working Together were giving Cuda Illegal memory access error. #19177

Open
PratikNarola1 wants to merge 3 commits intosgl-project:mainfrom
juspay:glm5-fix
Open

Fix: Cuda Graph + HiCache + Speculative Decoding Working Together were giving Cuda Illegal memory access error. #19177
PratikNarola1 wants to merge 3 commits intosgl-project:mainfrom
juspay:glm5-fix

Conversation

@PratikNarola1
Copy link
Copy Markdown

Change 1: HiCache Loading Guard for CUDA Graphs

Files: scheduler.py, schedule_batch.py, forward_batch_info.py, cache_controller.py, hiradix_cache.py

What: Propagated hicache_consumer_index through the full batch pipeline (ScheduleBatch → ModelWorkerBatch → ForwardBatch) and added checks in all CUDA graph runners to disable graphs when consumer_index >= 0.
Also added is_loading_in_progress() to catch async loads still running from a previous batch.

Why: When HiCache loads KV cache from host memory, it allocates new device pages and copies data asynchronously on a separate CUDA stream. During this window, page indices in req_to_token are being updated. If a
CUDA graph replays and reads half-updated page tables → illegal memory access.

Why wasn't it done already: The upstream code literally had a TODO comment: # todo (zhiqiang): disable cuda graph execution if hicache loading triggered. The original hicache_consumer_index field existed but was only set during prefill batches, not decode batches. We extended it to decode batches AND added the is_loading_in_progress() check.

Why other models don't need this: Models without --enable-hierarchical-cache always have consumer_index = -1, so this check is a no-op.


Change 2: Memory Relocation Version Tracking

Files: cache_controller.py, base_prefix_cache.py, radix_cache.py, hiradix_cache.py, allocator.py

What: Added memory_relocation_version counter to HiCacheController. Incremented only when HiCache actually relocates memory (host→device load, or device eviction). Separate from kv_free_version which increments on every normal free.

Why: We needed to distinguish "normal KV cache free/alloc" (happens every decode step, safe for CUDA graphs) from "HiCache memory relocation" (changes logical page index mapping). This distinction is what allows
the last_loc recomputation in eagle_info.py to know when to refresh stale indices.

Why wasn't it done already: sglang had no version tracking for HiCache relocations at all. The EAGLE code path assumed page indices are stable between operations — which is true without HiCache. HiCache + EAGLE is a very new combination.

Why other models don't need this: RadixCache (non-hierarchical) returns get_memory_relocation_version() = 0 always. Only HiRadixCache has a real counter.


Change 3: last_loc Bounds Clamping + Version-Aware Recomputation

File: eagle_info.py

What: In prepare_for_verify():

  1. last_loc.clamp(0, total_slots) — prevents out-of-bounds access, no GPU sync
  2. Check memory_relocation_version before and after alloc_paged_token_slots_extend(). If version changed (HiCache relocated during allocation), recompute last_loc from fresh req_to_token data.

Why: last_loc reads req_to_token[req_pool_indices, prefix_lens-1] — the last KV cache slot per request. If HiCache evicts/relocates pages between computing last_loc and using it, the value could point to a freed slot. The allocation itself can trigger eviction (need more device pages → evict old ones).

Why wasn't it done already: Without HiCache, page indices in req_to_token are stable between operations. HiCache introduces asynchronous memory relocation that can invalidate indices mid-operation. This race condition is specific to the alloc → use gap in EAGLE's verify path.


Change 4: Draft Attention Backend → FlashInfer

File: glm5.sh (config only: --speculative-draft-attention-backend flashinfer)

What: Use FlashInfer instead of NSA for the EAGLE draft model's attention. Target model still uses NSA.

Why: GLM-5's NSA backend creates NativeSparseAttnMultiStepBackend (2 NSA sub-backends for 3-step draft). The CUDA graph replay path for this multi-step NSA configuration has a latent bug causing illegal memory access. This bug was always present but masked because CUDA graphs were always disabled — the original code used a kv_free_version check that incremented on every free(), so can_run() always returned False. When
we fixed CUDA graphs to actually run, we exposed this bug.

Rather than debugging the NSA CUDA graph kernel (deep in the NSA attention implementation), we switch the draft model to FlashInferMLAMultiStepDraftBackend which:

  • Is automatically created for MLA models (GLM-5 uses MLA architecture)
  • Supports page_size=64 and topk=1 (both used by GLM-5)
  • Has a battle-tested CUDA graph implementation
  • Shares the same MLA KV cache format, so draft and target are compatible

Why wasn't it done already: The --speculative-draft-attention-backend flag exists but is rarely used. Most deployments use the same backend for draft and target. The NSA + EAGLE CUDA graph path was effectively
dead code — it existed in the codebase but was never reached because can_run() always returned False.

PratikNarola1 and others added 2 commits February 18, 2026 19:50
…dation

- Add memory_relocation_version tracking to HiCacheController (incremented
  only on actual host↔device memory relocations, not normal alloc/free)
- Propagate hicache_memory_version through ScheduleBatch → ModelWorkerBatch
  → ForwardBatch for CUDA graph invalidation checks
- Restore version check in all CUDA graph runners (CudaGraphRunner,
  PiecewiseCudaGraphRunner, EAGLEDraftCudaGraphRunner,
  EAGLEDraftExtendCudaGraphRunner, MultiLayerEagleDraftExtendCudaGraphRunner)
- Add hicache_consumer_index checks to disable graphs during active loading
- Add last_loc bounds clamping in eagle_info.py prepare_for_verify
- Add HiCache version-aware last_loc recomputation after memory relocation
- Remove GPU→CPU syncing debug logs from allocator, cuda_graph_runner, and
  piecewise_cuda_graph_runner (partial cleanup, more to follow)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…wise runner

Remove [EAGLE verify] entry/freeing logs with .min().item()/.max().item()
calls and stale comment in piecewise_cuda_graph_runner replay method.
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @PratikNarola1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves critical CUDA illegal memory access errors encountered when using CUDA Graphs, HiCache, and Speculative Decoding together. The changes introduce robust mechanisms to manage asynchronous memory operations and relocations within the hierarchical cache, ensuring CUDA graphs are disabled when memory states are unstable. Additionally, it refines speculative decoding logic to handle HiCache-induced memory changes and switches the draft attention backend for GLM-5 to FlashInfer to circumvent a specific CUDA graph bug in the NSA backend.

Highlights

  • HiCache Loading Guard for CUDA Graphs: Propagated hicache_consumer_index through the batch pipeline and added checks in CUDA graph runners to disable graphs when consumer_index >= 0 or is_loading_in_progress() is true. This prevents illegal memory access during asynchronous KV cache loading.
  • Memory Relocation Version Tracking: Introduced a memory_relocation_version counter in HiCacheController, incremented only when HiCache actually relocates memory (host→device load or device eviction). This distinguishes "normal KV cache free/alloc" from "HiCache memory relocation" for CUDA graph invalidation.
  • last_loc Bounds Clamping + Version-Aware Recomputation: Implemented last_loc.clamp(0, total_slots) to prevent out-of-bounds access and added a check for memory_relocation_version changes before and after alloc_paged_token_slots_extend(). If the version changes, last_loc is recomputed to address stale indices caused by HiCache memory relocation.
  • Draft Attention Backend Switch to FlashInfer: Switched the EAGLE draft model's attention backend from NSA to FlashInfer for GLM-5. This bypasses a latent bug in NSA's CUDA graph replay path for multi-step draft configurations, leveraging FlashInfer's robust CUDA graph implementation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/managers/cache_controller.py
    • Added is_loading_in_progress method to check for active HiCache loading.
    • Introduced memory_relocation_version counter and incremented it during load and evict_device operations.
    • Added get_memory_relocation_version method.
  • python/sglang/srt/managers/schedule_batch.py
    • Added hicache_memory_version and hicache_consumer_index fields to ScheduleBatch.
    • Propagated these HiCache-related fields during batch merging and conversion to ModelWorkerBatch.
  • python/sglang/srt/managers/scheduler.py
    • Updated _get_new_batch_prefill_raw and update_running_batch to set hicache_memory_version and hicache_consumer_index based on HiCache loading status.
  • python/sglang/srt/mem_cache/allocator.py
    • Added logging import and logger instance.
    • Introduced kv_free_version counter, incremented on free_group_end and free operations.
    • Added a warning message for failed memory allocations.
  • python/sglang/srt/mem_cache/base_prefix_cache.py
    • Added a default get_memory_relocation_version method returning 0.
  • python/sglang/srt/mem_cache/hiradix_cache.py
    • Implemented is_hicache_loading_in_progress and get_memory_relocation_version methods, delegating to the cache controller.
  • python/sglang/srt/mem_cache/radix_cache.py
    • Added get_memory_relocation_version method, always returning 0 for non-hierarchical cache.
  • python/sglang/srt/model_executor/cuda_graph_runner.py
    • Modified can_run to include a check for hicache_consumer_index to disable CUDA graphs when HiCache loading is active.
  • python/sglang/srt/model_executor/forward_batch_info.py
    • Added logging import and logger instance.
    • Included hicache_consumer_index and hicache_memory_version in ForwardBatch and propagated them during initialization.
  • python/sglang/srt/model_executor/piecewise_cuda_graph_runner.py
    • Updated can_run to disable CUDA graphs if hicache_consumer_index indicates active HiCache loading.
    • Simplified the return logic for can_run.
  • python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py
    • Added logging import and logger instance.
    • Adjusted can_run to consider hicache_consumer_index for disabling CUDA graphs.
  • python/sglang/srt/speculative/eagle_draft_extend_cuda_graph_runner.py
    • Added logging import and logger instance.
    • Modified can_run to disable CUDA graphs if hicache_consumer_index indicates active HiCache loading.
  • python/sglang/srt/speculative/eagle_info.py
    • Added logic to store memory_relocation_version before and after allocation.
    • Recomputed last_loc if memory_relocation_version changed, addressing stale indices.
    • Clamped last_loc to valid KV cache range.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of fixes to address CUDA illegal memory access errors arising from the interaction of CUDA Graphs, HiCache, and speculative decoding (EAGLE). The changes are well-reasoned and robust. Key improvements include adding guards to disable CUDA graphs during HiCache loading, introducing a memory_relocation_version for tracking HiCache memory changes, and adding safety checks in the EAGLE speculative decoding path. My review has identified a potential off-by-one error in a boundary check and a couple of comments that could be clarified for better maintainability. Overall, this is a solid contribution that tackles a complex concurrency issue.


# Clamp last_loc to valid KV cache range (no GPU sync)
total_slots = batch.token_to_kv_pool_allocator.size
last_loc = last_loc.clamp(0, total_slots)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There seems to be a potential off-by-one error here. If total_slots is the size of the allocator, the valid indices are in the range [0, total_slots - 1]. torch.clamp(min, max) is inclusive, so clamp(0, total_slots) allows the value total_slots, which is an out-of-bounds index. This could lead to an illegal memory access. It should probably be clamped to total_slots - 1.

Suggested change
last_loc = last_loc.clamp(0, total_slots)
last_loc = last_loc.clamp(0, total_slots - 1)

chunked_req=self.chunked_req,
)
# Always set memory relocation version for CUDA graph invalidation
# This tracks all KV cache free operations (request completions, evictions)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment here is a bit misleading. hicache_memory_version tracks HiCache-specific memory relocations (like host-to-device loads or device evictions), not all KV cache free operations. This distinction is important as kv_free_version (from allocator.py) is meant for tracking all normal free operations. To improve clarity, I suggest updating the comment.

Suggested change
# This tracks all KV cache free operations (request completions, evictions)
# This tracks HiCache memory relocations (e.g., host->device loads, device evictions).

batch.prepare_for_decode()

# Always set memory relocation version for CUDA graph invalidation
# This tracks all KV cache free operations (request completions, evictions)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to the prefill path, this comment is a bit misleading. hicache_memory_version tracks HiCache-specific memory relocations, not all free operations. Updating it for clarity would be beneficial for future maintenance.

Suggested change
# This tracks all KV cache free operations (request completions, evictions)
# This tracks HiCache memory relocations (e.g., host->device loads, device evictions).

@hzh0425 hzh0425 self-assigned this Feb 24, 2026
@xiezhq-hermann xiezhq-hermann self-assigned this Feb 24, 2026
@xiezhq-hermann xiezhq-hermann added the hicache Hierarchical Caching for SGLang label Feb 24, 2026
@huangtingwei9988 huangtingwei9988 self-assigned this Feb 24, 2026
@stmatengss stmatengss self-assigned this Feb 26, 2026
@chenkaiyue
Copy link
Copy Markdown

chenkaiyue commented Feb 28, 2026

Does this fix solve errors like this one?

[2026-01-30 08:51:27] INFO: 10.0.3.4:48642 - "POST /v1/chat/completions HTTP/1.1" 200 OK [2026-01-30 08:51:28 TP0] Prefill batch, #new-seq: 3, #new-token: 8192, #cached-token: 146, token usage: 0.15, #running-req: 1, #queue-req: 0, [2026-01-30 08:51:29 TP0] Prefill batch, #new-seq: 1, #new-token: 1039, #cached-token: 0, token usage: 0.16, #running-req: 3, #queue-req: 0, [rank2]:[E130 08:51:31.131460991 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered Search forcudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
2P/IPC

For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

[2] NCCL INFO ncclCommInitRankConfig comm 0x761b5200 rank 2 nranks 8 cudaDev 2 nvmlDev 2 busId 43000 commId 0x5230d054fc1eb700 - Init COMPLETE
[2] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 8 total 1.24 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 0.01, topo 1.03, graphs 0.01, connections 0.19, rest 0.00)
Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f128bd91b80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: + 0x11fb7 (0x7f1306166fb7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f128cc92fe0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f128cca26b8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f128cca68b9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f128cca882f in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xecdb4 (0x7f1485b8ddb4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7f14886fcaa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7f1488789c6c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
[2026-01-30 08:51:31 TP2] Scheduler hit an exception: Traceback (most recent call last):
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2974, in run_scheduler_process
scheduler.event_loop_overlap()
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1133, in event_loop_overlap
pop_and_process()
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1104, in pop_and_process
self.process_batch_result(tmp_batch, tmp_result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2425, in process_batch_result
self.process_batch_result_prefill(batch, result)
File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler_output_processor_mixin.py", line 94, in process_batch_result_prefill
result.copy_done.synchronize()
File "/usr/local/lib/python3.12/dist-packages/torch/cuda/streams.py", line 231, in synchronize
super().synchronize()
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

what(): [PG ID 2 PG GUID 3 Rank 2] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSAto enable device-side assertions.

@PratikNarola1
Copy link
Copy Markdown
Author

Yes @chenkaiyue I had faced same error as well. and yes it should fix it. Could you try out my branch or cherry pick commits and see if it fixes your errors?

@chenkaiyue
Copy link
Copy Markdown

Yes @chenkaiyue I had faced same error as well. and yes it should fix it. Could you try out my branch or cherry pick commits and see if it fixes your errors?

@PratikNarola1 OK~ Did you perform stress testing and long-duration testing? I found that this error might only appear once every few hours.

@PratikNarola1
Copy link
Copy Markdown
Author

Yes @chenkaiyue My Forked branch is currently running on 4 nodes each with 8x H200s running GLM-5 for almost 2 weeks. Only after testing it for a week, I raise the PR. I have performed load testing with 128 concurrent requests, Multi-turn conversation that builds up over time as well as complete 200k examples of Ultrachat running almost 24 hours with 150 concurrent requests. it hasnt failed so far 😆

@chenkaiyue
Copy link
Copy Markdown

chenkaiyue commented Mar 4, 2026

I cherry pick this pr to v0.5.9, it happend again

I0303 23:04:30.812590 792429 real_client.cpp:1807] Time taken for batch_get_into: 1522us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.812747 792445 real_client.cpp:1807] Time taken for batch_get_into: 1421us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.814308 792435 real_client.cpp:1807] Time taken for batch_get_into: 2353us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.814478 792429 real_client.cpp:1807] Time taken for batch_get_into: 1531us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.814577 792445 real_client.cpp:1807] Time taken for batch_get_into: 1552us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.814955 792447 real_client.cpp:1807] Time taken for batch_get_into: 2043us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.816399 792429 real_client.cpp:1807] Time taken for batch_get_into: 1572us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.816470 792445 real_client.cpp:1807] Time taken for batch_get_into: 1619us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.818128 792445 real_client.cpp:1807] Time taken for batch_get_into: 1379us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.818199 792435 real_client.cpp:1807] Time taken for batch_get_into: 3482us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.818585 792429 real_client.cpp:1807] Time taken for batch_get_into: 1847us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.818720 792447 real_client.cpp:1807] Time taken for batch_get_into: 3415us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.819808 792445 real_client.cpp:1807] Time taken for batch_get_into: 1410us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.820375 792435 real_client.cpp:1807] Time taken for batch_get_into: 1764us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.820674 792429 real_client.cpp:1807] Time taken for batch_get_into: 1756us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.821660 792445 real_client.cpp:1807] Time taken for batch_get_into: 1577us, read store: 0us, with memory key count: 256, offload key count: 0 I0303 23:04:30.822131 792447 real_client.cpp:1807] Time taken for batch_get_into: 3025us, read store: 0us, with memory key count: 256, offload key count: 0 [2026-03-03 23:04:30 TP1] Scheduler hit an exception: Traceback (most recent call last): File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 3194, in run_scheduler_process scheduler.event_loop_overlap() File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1154, in event_loop_overlap batch = self.get_next_batch_to_run() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1924, in get_next_batch_to_run new_batch = self.get_new_batch_prefill() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 1968, in get_new_batch_prefill ret = self._get_new_batch_prefill_raw( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/managers/scheduler.py", line 2181, in _get_new_batch_prefill_raw new_batch.prepare_for_extend() File "/sgl-workspace/sglang/python/sglang/srt/managers/schedule_batch.py", line 1505, in prepare_for_extend out_cache_loc, req_pool_indices_tensor, req_pool_indices = alloc_for_extend( ^^^^^^^^^^^^^^^^^ File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/common.py", line 377, in alloc_for_extend write_cache_indices( File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/common.py", line 92, in write_cache_indices prefix_pointers = torch.tensor( ^^^^^^^^^^^^^ torch.AcceleratorError: CUDA error: an illegal memory access was encountered Search forcudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I0303 23:04:30.822786 792429 real_client.cpp:1807] Time taken for batch_get_into: 1785us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.822919 792435 real_client.cpp:1807] Time taken for batch_get_into: 2174us, read store: 0us, with memory key count: 256, offload key count: 0
[2026-03-03 23:04:30] SIGQUIT received. signum=None, frame=None. It usually means one child failed.
I0303 23:04:30.823367 792445 real_client.cpp:1807] Time taken for batch_get_into: 1435us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.824891 792435 real_client.cpp:1807] Time taken for batch_get_into: 1592us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.824900 792429 real_client.cpp:1807] Time taken for batch_get_into: 1788us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.825328 792445 real_client.cpp:1807] Time taken for batch_get_into: 1682us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.826064 792447 real_client.cpp:1807] Time taken for batch_get_into: 3467us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.826938 792435 real_client.cpp:1807] Time taken for batch_get_into: 1656us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.827037 792429 real_client.cpp:1807] Time taken for batch_get_into: 1807us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.827523 792445 real_client.cpp:1807] Time taken for batch_get_into: 1915us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.828977 792435 real_client.cpp:1807] Time taken for batch_get_into: 1645us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.829147 792429 real_client.cpp:1807] Time taken for batch_get_into: 1772us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.829236 792445 real_client.cpp:1807] Time taken for batch_get_into: 1427us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.829807 792447 real_client.cpp:1807] Time taken for batch_get_into: 1672us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.830878 792445 real_client.cpp:1807] Time taken for batch_get_into: 1371us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.831256 792429 real_client.cpp:1807] Time taken for batch_get_into: 1774us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.831593 792435 real_client.cpp:1807] Time taken for batch_get_into: 2186us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.832234 792447 real_client.cpp:1807] Time taken for batch_get_into: 2058us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.832581 792445 real_client.cpp:1807] Time taken for batch_get_into: 1428us, read store: 0us, with memory key count: 256, offload key count: 0
[rank1]:[E303 23:04:30.974417380 ProcessGroupNCCL.cpp:2057] [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f5438d91b80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: + 0x11fb7 (0x7f54b3166fb7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f5439c92fe0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f5439ca26b8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f5439ca68b9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f5439ca882f in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xecdb4 (0x7f5632bb1db4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7f563575baa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7f56357e8c6c in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of 'c10::DistBackendError'
I0303 23:04:30.833293 792429 real_client.cpp:1807] Time taken for batch_get_into: 1702us, read store: 0us, with memory key count: 256, offload key count: 0
what(): [PG ID 2 PG GUID 3 Rank 1] Process group watchdog thread terminated with exception: CUDA error: an illegal memory access was encountered
Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at /pytorch/c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f5438d91b80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: + 0x11fb7 (0x7f54b3166fb7 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10d::ProcessGroupNCCL::WorkNCCL::finishedGPUExecutionInternal() const + 0x50 (0x7f5439c92fe0 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::WorkNCCL::isCompleted() + 0x68 (0x7f5439ca26b8 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #4: c10d::ProcessGroupNCCL::Watchdog::runLoop() + 0x969 (0x7f5439ca68b9 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #5: c10d::ProcessGroupNCCL::Watchdog::run() + 0xdf (0x7f5439ca882f in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #6: + 0xecdb4 (0x7f5632bb1db4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #7: + 0x9caa4 (0x7f563575baa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #8: + 0x129c6c (0x7f56357e8c6c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from run at /pytorch/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:2063 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits, std::allocator >) + 0x80 (0x7f5438d91b80 in /usr/local/lib/python3.12/dist-packages/torch/lib/libc10.so)
frame #1: + 0xe69b51 (0x7f5439c7eb51 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0x951271 (0x7f5439766271 in /usr/local/lib/python3.12/dist-packages/torch/lib/libtorch_cuda.so)
frame #3: + 0xecdb4 (0x7f5632bb1db4 in /usr/lib/x86_64-linux-gnu/libstdc++.so.6)
frame #4: + 0x9caa4 (0x7f563575baa4 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #5: + 0x129c6c (0x7f56357e8c6c in /usr/lib/x86_64-linux-gnu/libc.so.6)

Fatal Python error: Aborted

Thread 0x00007ef3cbfff6c0 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 783 in _put_batch_zero_copy_impl
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 599 in batch_set_v1
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 1045 in _page_set_zero_copy
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 1075 in _page_backup
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 1101 in backup_thread_func
File "/usr/lib/python3.12/threading.py", line 1010 in run
File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007eef1bffe6c0 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 788 in _get_batch_zero_copy_impl
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/storage/mooncake_store/mooncake_store.py", line 547 in batch_get_v1
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 823 in _page_get_zero_copy
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 870 in _page_transfer
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 891 in prefetch_io_aux_func
File "/usr/lib/python3.12/threading.py", line 1010 in run
File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007ef39ffff6c0 (most recent call first):
File "/sgl-workspace/sglang/python/sglang/srt/utils/watchdog.py", line 145 in _watchdog_once
File "/sgl-workspace/sglang/python/sglang/srt/utils/watchdog.py", line 125 in _watchdog_thread
File "/usr/lib/python3.12/threading.py", line 1010 in run
File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007f3411ffe6c0 (most recent call first):
File "/usr/lib/python3.12/threading.py", line 359 in wait
File "/usr/lib/python3.12/threading.py", line 655 in wait
File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007f3415fff6c0 (most recent call first):
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 73 in _recv_msg
File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_worker/subproc_pool.py", line 228 in _read_thread
File "/usr/lib/python3.12/threading.py", line 1010 in run
File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007f3babfff6c0 (most recent call first):
File "/usr/lib/python3.12/threading.py", line 359 in wait
File "/usr/lib/python3.12/threading.py", line 655 in wait
File "/usr/local/lib/python3.12/dist-packages/tqdm/_monitor.py", line 60 in run
File "/usr/lib/python3.12/threading.py", line 1073 in _bootstrap_inner
File "/usr/lib/python3.12/threading.py", line 1030 in _bootstrap

Thread 0x00007f56356be300 (most recent call first):
File "/usr/lib/python3.12/threading.py", line 1167 in _wait_for_tstate_lock
File "/usr/lib/python3.12/threading.py", line 1151 in join
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 400 in _stop_storage_threads
File "/sgl-workspace/sglang/python/sglang/srt/managers/cache_controller.py", line 539 in detach_storage_backend
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 361 in detach_storage_backend
File "/sgl-workspace/sglang/python/sglang/srt/mem_cache/hiradix_cache.py", line 179 in shutdown

Extension modules: numpy._core._multiarray_umath, numpy.linalg._umath_linalg, pybase64._pybase64, charset_normalizer.md, requests.packages.charset_normalizer.md, requests.packages.chardet.md, multidict._multidict, yarl._quoting_c, propcache._helpers_c, aiohttp._http_writer, aiohttp._http_parser, aiohttp._websocket.mask, aiohttp._websocket.reader_c, frozenlist._frozenlist, torch._C, torch._C._dynamo.autograd_compiler, torch._C._dynamo.eval_frame, torch._C._dynamo.guards, torch._C._dynamo.utils, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, psutil._psutil_linux, zmq.backend.cython._zmq, PIL._imaging, sentencepiece._sentencepiece, yaml._yaml, regex._regex, markupsafe._speedups, cuda_utils, PIL._imagingft, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._pcg64, numpy.random._generator, numpy.random._mt19937, numpy.random._philox, numpy.random._sfc64, numpy.random.mtrand, _cffi_backend, _cyutility, scipy._cyutility, scipy._lib._ccallback_c, scipy.linalg._fblas, scipy.linalg._flapack, scipy.linalg.cython_lapack, scipy.linalg._cythonized_array_utils, scipy.linalg._solve_toeplitz, scipy.linalg._batched_linalg, scipy.linalg._decomp_lu_cython, scipy.linalg._matfuncs_schur_sqrtm, scipy.linalg._matfuncs_expm, scipy.linalg._linalg_pythran, scipy.linalg.cython_blas, scipy.linalg._decomp_update, I0303 23:04:30.834414 792445 real_client.cpp:1807] Time taken for batch_get_into: 1554us, read store: 0us, with memory key count: 256, offload key count: 0
scipy.sparse._sparsetools, _csparsetools, scipy.sparse._csparsetools, scipy.sparse.linalg._dsolve._superlu, scipy.sparse.linalg._eigen.arpack._arpacklib, scipy.sparse.linalg._propack, scipy.optimize._group_columns, scipy._lib.messagestream, scipy.optimize._trlib._trlib, scipy.optimize._lbfgsb, _moduleTNC, scipy.optimize._moduleTNC, scipy.optimize._slsqplib, scipy.optimize._minpack, scipy.optimize._lsq.givens_elimination, scipy.optimize._zeros, scipy._lib._uarray._uarray, scipy.special._ufuncs_cxx, scipy.special._ellip_harm_2, scipy.special._special_ufuncs, scipy.special._gufuncs, scipy.special._ufuncs, scipy.special._specfun, scipy.special._comb, scipy.linalg._decomp_interpolative, scipy.optimize._bglu_dense, scipy.optimize._lsap, scipy.spatial._ckdtree, scipy.spatial._qhull, scipy.spatial._voronoi, scipy.spatial._hausdorff, scipy.spatial._distance_wrap, scipy.spatial.transform._rotation_cy, scipy.spatial.transform._rigid_transform_cy, scipy.optimize._direct, setproctitle._setproctitle, cuda.bindings._bindings.cydriver, cuda.bindings.cydriver, cuda.bindings.driver, tvm_ffi.core, cuda.bindings._bindings.cyruntime_ptds, cuda.bindings._bindings.cyruntime, cuda.bindings.cyruntime, cuda.bindings.runtime, cuda.bindings._bindings.cynvrtc, cuda.bindings.cynvrtc, cuda.bindings.nvrtc, msgspec._core, grpc._cython.cygrpc, google._upb._message, cuda.cudart, cuda.nvrtc, __triton_launcher (total: 112)
I0303 23:04:30.835356 792429 real_client.cpp:1807] Time taken for batch_get_into: 1723us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.835371 792435 real_client.cpp:1807] Time taken for batch_get_into: 3388us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.838027 792429 real_client.cpp:1807] Time taken for batch_get_into: 2249us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.838120 792445 real_client.cpp:1807] Time taken for batch_get_into: 3444us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.838400 792435 real_client.cpp:1807] Time taken for batch_get_into: 2547us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.839972 792429 real_client.cpp:1807] Time taken for batch_get_into: 1604us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.840166 792445 real_client.cpp:1807] Time taken for batch_get_into: 1725us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.840598 792435 real_client.cpp:1807] Time taken for batch_get_into: 1771us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.856478 792435 real_client.cpp:1807] Time taken for batch_get_into: 1517us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.857472 792445 real_client.cpp:1807] Time taken for batch_get_into: 1561us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.870467 792445 real_client.cpp:1807] Time taken for batch_get_into: 2529us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.871104 792429 real_client.cpp:1807] Time taken for batch_get_into: 1667us, read store: 0us, with memory key count: 256, offload key count: 0
I0303 23:04:30.872181 792435 real_client.cpp:1807] Time taken for batch_get_into: 1395us, read store: 0us, with
`

@chenkaiyue
Copy link
Copy Markdown

chenkaiyue commented Mar 4, 2026

python3 -m sglang.launch_server \ --model-path /dev/shm/$MODEL_NAME \ --served-model-name $MODEL_NAME \ --tp 4 \ --tool-call-parser minimax-m2 \ --reasoning-parser minimax-append-think \ --log-requests-level 2 \ --log-requests \ --enable-metrics \ --allow-auto-truncate \ --enable-request-time-stats-logging \ --mem-fraction 0.85 \ --enable-cache-report \ --collect-tokens-histogram \ --trust-remote-code \ --host 0.0.0.0 \ --port $MASTER_PORT $EXTRA_ARGS \ --enable-hierarchical-cache \ --hicache-ratio 1.9 \ --hicache-write-policy write_through \ --hicache-storage-prefetch-policy best_effort \ --hicache-storage-backend mooncake

i use M2.5 @PratikNarola1 It seems to be related to the synchronization of eviction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

hicache Hierarchical Caching for SGLang high priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants