Skip to content

Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads#10788

Merged
hnyls2002 merged 16 commits intosgl-project:mainfrom
YAMY1234:fix-rope-illegal-memory-access-10713
Oct 19, 2025
Merged

Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads#10788
hnyls2002 merged 16 commits intosgl-project:mainfrom
YAMY1234:fix-rope-illegal-memory-access-10713

Conversation

@YAMY1234
Copy link
Contributor

@YAMY1234 YAMY1234 commented Sep 23, 2025

Title

Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads


Background

Running SGLang with EAGLE speculative decoding and long sequences occasionally triggers

RuntimeError: BatchQKApplyRotaryPosIdsCosSinCache failed with error code an illegal memory access was encountered

Root cause analysis shows that the crash occurs whenever a generated position_id exceeds the length of the RoPE cos_sin_cache tensor, which is initialized to max_position_embeddings.

  • Mismatch between the target and draft model’s max_position_embeddings increases the likelihood, the fundamental issue is that the cache can be too short for real runtime positions, especially with speculative multi-step decoding and batching.

What This PR Does

  • Thread-Safe Dynamic Expansion

    • Adds _ensure_cos_sin_cache_length() to RotaryEmbedding for on-demand, cache growth.
  • Device-Aware Initialization

    • Builds initial cos_sin_cache directly on the current CUDA device to prevent CPU–GPU copy overhead.

Expansion Policy

The recommended pre-expansion formula (handled in the model runner before graph capture) is:

reserve_len = max(target_max_pos, draft_max_pos)
            + speculative_num_steps * speculative_num_draft_tokens
            + safety_margin  # (~1k)

This covers:

  • Target/draft model declared maxima,
  • Multi-step speculative decoding,
  • A configurable buffer for future changes.

Results & Verification

  • Stress-tested with 1,200+ input tokens, speculative_num_steps=5, and high concurrency.
  • No illegal memory access observed.
  • Single, one-time memory allocation; no runtime re-allocations during inference.
  • Negligible performance impact—graph capture proceeds without synchronization penalties.

Why This Matters

Any under-provisioned RoPE cache will still break under long or speculative workloads.
This PR makes SGLang robust regardless of model mismatch and future-proofs against larger context windows.


In short:

Before: RoPE cache length fixed at model config → runtime overflow possible.
After: RoPE cache auto-expands safely and transparently → no illegal memory access.


Closes: #10713 (and related long-sequence RoPE overflow issues)


@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @YAMY1234, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical "RuntimeError" in SGLang's EAGLE speculative decoding, which occurred when "position_id" exceeded the fixed-size RoPE "cos_sin_cache". The core solution involves making the RoPE cache dynamically expandable and thread-safe, ensuring it can grow to accommodate longer sequences and speculative decoding steps without crashing. This change significantly enhances the robustness of SGLang, making it resilient to model "max_position_embeddings" mismatches and future-proofing it for larger context windows.

Highlights

  • Dynamic RoPE Cache Expansion: Implemented a thread-safe mechanism to dynamically expand the RoPE "cos_sin_cache" on demand, preventing out-of-bounds errors during long-sequence and speculative decoding workloads. Cache growth is aligned to 128 tokens to optimize reallocations.
  • CUDA-Graph Safe Position Handling: Enhanced "forward_cuda" to ensure "positions" are "int64" and introduced optional debug assertions for position ID bounds checking, which are active only outside CUDA Graph capture to avoid host synchronization issues.
  • Device-Aware Cache Initialization: Modified the "_compute_cos_sin_cache" method to initialize the cache directly on the current CUDA device, eliminating CPU-GPU copy overhead during setup.
  • Proactive Cache Pre-expansion: Added a pre-expansion logic in the model runner to reserve sufficient RoPE cache length before CUDA Graph capture, accounting for base context, speculative decoding steps, and a safety margin, ensuring robustness against various workload configurations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a robust mechanism for dynamically expanding the RoPE cache, effectively preventing out-of-bounds errors during long-sequence workloads and speculative decoding. The core changes, including thread-safe cache expansion, pre-allocation at model load, and CUDA-graph-safe debug checks, are well-implemented. My review focuses on improving configurability, design, and robustness. I've identified a hardcoded length limit that could undermine the pre-allocation goal, a design choice for locking that could be improved for better concurrency, a magic number that should be a constant for clarity, and an overly broad exception catch that could mask bugs.

@YAMY1234 YAMY1234 force-pushed the fix-rope-illegal-memory-access-10713 branch from 0922bc3 to 8d26dee Compare September 23, 2025 18:28
@JustinTong0323 JustinTong0323 added the ready-to-merge The PR is ready to merge after the CI is green. label Sep 25, 2025
@hnyls2002 hnyls2002 merged commit 80407b0 into sgl-project:main Oct 19, 2025
5 of 44 checks passed
YAMY1234 added a commit to YAMY1234/sglang that referenced this pull request Oct 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-to-merge The PR is ready to merge after the CI is green. run-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants