Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads by YAMY1234 · Pull Request #10788 · sgl-project/sglang

YAMY1234 · 2025-09-23T08:08:56Z

Title

Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads

Background

Running SGLang with EAGLE speculative decoding and long sequences occasionally triggers

RuntimeError: BatchQKApplyRotaryPosIdsCosSinCache failed with error code an illegal memory access was encountered

Root cause analysis shows that the crash occurs whenever a generated position_id exceeds the length of the RoPE cos_sin_cache tensor, which is initialized to max_position_embeddings.

Mismatch between the target and draft model’s max_position_embeddings increases the likelihood, the fundamental issue is that the cache can be too short for real runtime positions, especially with speculative multi-step decoding and batching.

What This PR Does

Thread-Safe Dynamic Expansion
- Adds _ensure_cos_sin_cache_length() to RotaryEmbedding for on-demand, cache growth.
Device-Aware Initialization
- Builds initial cos_sin_cache directly on the current CUDA device to prevent CPU–GPU copy overhead.

Expansion Policy

The recommended pre-expansion formula (handled in the model runner before graph capture) is:

reserve_len = max(target_max_pos, draft_max_pos)
            + speculative_num_steps * speculative_num_draft_tokens
            + safety_margin  # (~1k)

This covers:

Target/draft model declared maxima,
Multi-step speculative decoding,
A configurable buffer for future changes.

Results & Verification

Stress-tested with 1,200+ input tokens, speculative_num_steps=5, and high concurrency.
No illegal memory access observed.
Single, one-time memory allocation; no runtime re-allocations during inference.
Negligible performance impact—graph capture proceeds without synchronization penalties.

Why This Matters

Any under-provisioned RoPE cache will still break under long or speculative workloads.
This PR makes SGLang robust regardless of model mismatch and future-proofs against larger context windows.

In short:

Before: RoPE cache length fixed at model config → runtime overflow possible.
After: RoPE cache auto-expands safely and transparently → no illegal memory access.

Closes: #10713 (and related long-sequence RoPE overflow issues)

gemini-code-assist · 2025-09-23T08:09:15Z

Summary of Changes

Hello @YAMY1234, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a critical "RuntimeError" in SGLang's EAGLE speculative decoding, which occurred when "position_id" exceeded the fixed-size RoPE "cos_sin_cache". The core solution involves making the RoPE cache dynamically expandable and thread-safe, ensuring it can grow to accommodate longer sequences and speculative decoding steps without crashing. This change significantly enhances the robustness of SGLang, making it resilient to model "max_position_embeddings" mismatches and future-proofing it for larger context windows.

Highlights

Dynamic RoPE Cache Expansion: Implemented a thread-safe mechanism to dynamically expand the RoPE "cos_sin_cache" on demand, preventing out-of-bounds errors during long-sequence and speculative decoding workloads. Cache growth is aligned to 128 tokens to optimize reallocations.
CUDA-Graph Safe Position Handling: Enhanced "forward_cuda" to ensure "positions" are "int64" and introduced optional debug assertions for position ID bounds checking, which are active only outside CUDA Graph capture to avoid host synchronization issues.
Device-Aware Cache Initialization: Modified the "_compute_cos_sin_cache" method to initialize the cache directly on the current CUDA device, eliminating CPU-GPU copy overhead during setup.
Proactive Cache Pre-expansion: Added a pre-expansion logic in the model runner to reserve sufficient RoPE cache length before CUDA Graph capture, accounting for base context, speculative decoding steps, and a safety margin, ensuring robustness against various workload configurations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a robust mechanism for dynamically expanding the RoPE cache, effectively preventing out-of-bounds errors during long-sequence workloads and speculative decoding. The core changes, including thread-safe cache expansion, pre-allocation at model load, and CUDA-graph-safe debug checks, are well-implemented. My review focuses on improving configurability, design, and robustness. I've identified a hardcoded length limit that could undermine the pre-allocation goal, a design choice for locking that could be improved for better concurrency, a magic number that should be a constant for clarity, and an overly broad exception catch that could mask bugs.

python/sglang/srt/model_executor/model_runner.py

python/sglang/srt/layers/rotary_embedding.py

python/sglang/srt/model_executor/model_runner.py

# Conflicts: # python/sglang/srt/model_executor/model_runner.py

…s in EAGLE + Long-Sequence Workloads (sgl-project#10788)

YAMY1234 requested review from BBuf, Edwardf0t1, HaiShaw, Ying1123, ch-wan, hnyls2002, ispobock, kushanam, merrymercy and zhyncs as code owners September 23, 2025 08:08

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

hnyls2002 assigned YAMY1234, hnyls2002 and JustinTong0323 Sep 23, 2025

YAMY1234 force-pushed the fix-rope-illegal-memory-access-10713 branch from a7bd243 to 7f48bd4 Compare September 23, 2025 08:47

JustinTong0323 added the run-ci label Sep 23, 2025

JustinTong0323 mentioned this pull request Sep 23, 2025

fix: draft model IMA by overide max_positional_embeddings #10787

Merged

4 tasks

hnyls2002 requested changes Sep 23, 2025

View reviewed changes

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

python/sglang/srt/model_executor/model_runner.py Outdated Show resolved Hide resolved

YAMY1234 added 7 commits September 23, 2025 18:24

Fix Issue sgl-project#10713

5e18a48

Simplify fix: Remove thread locks and debug assertions

b3668da

fix: remove hard coded config

188a0a5

fix: remove cap for max_position_embeddings

6da9f26

remove unnecessary change

05686cc

format

fc03d01

Move rope cache reservation to utils and use sglang.environ

8d26dee

YAMY1234 force-pushed the fix-rope-illegal-memory-access-10713 branch from 0922bc3 to 8d26dee Compare September 23, 2025 18:28

hnyls2002 approved these changes Sep 23, 2025

View reviewed changes

YAMY1234 and others added 2 commits September 23, 2025 18:38

fix: env import path after rebase

d67a213

Merge branch 'main' into fix-rope-illegal-memory-access-10713

cf9bc25

JustinTong0323 added the ready-to-merge The PR is ready to merge after the CI is green. label Sep 25, 2025

JustinTong0323 and others added 7 commits September 26, 2025 23:10

Merge branch 'main' into fix-rope-illegal-memory-access-10713

df243b7

Merge branch 'main' into fix-rope-illegal-memory-access-10713

6c35950

Merge branch 'main' into fix-rope-illegal-memory-access-10713

339cfa7

Merge branch 'main' into fix-rope-illegal-memory-access-10713

ca7d348

Merge branch 'main' into fix-rope-illegal-memory-access-10713

f8cb4c3

# Conflicts: # python/sglang/srt/model_executor/model_runner.py

Merge branch 'main' into fix-rope-illegal-memory-access-10713

abc8343

solve conflicts duplicate code

0b1dbc8

hnyls2002 merged commit 80407b0 into sgl-project:main Oct 19, 2025
5 of 44 checks passed

hnyls2002 mentioned this pull request Oct 19, 2025

Revert "Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads" #11827

Merged

YAMY1234 added a commit to YAMY1234/sglang that referenced this pull request Oct 20, 2025

Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bound…

36974cc

…s in EAGLE + Long-Sequence Workloads (sgl-project#10788)

YAMY1234 mentioned this pull request Oct 20, 2025

Fix: Safe RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads #11871

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads#10788

Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads#10788
hnyls2002 merged 16 commits intosgl-project:mainfrom
YAMY1234:fix-rope-illegal-memory-access-10713

YAMY1234 commented Sep 23, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

YAMY1234 commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Title

Background

What This PR Does

Expansion Policy

Results & Verification

Why This Matters

Uh oh!

gemini-code-assist bot commented Sep 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

YAMY1234 commented Sep 23, 2025 •

edited

Loading