Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads#10788
Conversation
Summary of ChangesHello @YAMY1234, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical "RuntimeError" in SGLang's EAGLE speculative decoding, which occurred when "position_id" exceeded the fixed-size RoPE "cos_sin_cache". The core solution involves making the RoPE cache dynamically expandable and thread-safe, ensuring it can grow to accommodate longer sequences and speculative decoding steps without crashing. This change significantly enhances the robustness of SGLang, making it resilient to model "max_position_embeddings" mismatches and future-proofing it for larger context windows. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a robust mechanism for dynamically expanding the RoPE cache, effectively preventing out-of-bounds errors during long-sequence workloads and speculative decoding. The core changes, including thread-safe cache expansion, pre-allocation at model load, and CUDA-graph-safe debug checks, are well-implemented. My review focuses on improving configurability, design, and robustness. I've identified a hardcoded length limit that could undermine the pre-allocation goal, a design choice for locking that could be improved for better concurrency, a magic number that should be a constant for clarity, and an overly broad exception catch that could mask bugs.
a7bd243 to
7f48bd4
Compare
0922bc3 to
8d26dee
Compare
# Conflicts: # python/sglang/srt/model_executor/model_runner.py
…s in EAGLE + Long-Sequence Workloads (sgl-project#10788)
Title
Fix: Dynamic RoPE Cache Expansion to Prevent Position-ID Out-of-Bounds in EAGLE + Long-Sequence Workloads
Background
Running SGLang with EAGLE speculative decoding and long sequences occasionally triggers
Root cause analysis shows that the crash occurs whenever a generated
position_idexceeds the length of the RoPEcos_sin_cachetensor, which is initialized tomax_position_embeddings.max_position_embeddingsincreases the likelihood, the fundamental issue is that the cache can be too short for real runtime positions, especially with speculative multi-step decoding and batching.What This PR Does
Thread-Safe Dynamic Expansion
_ensure_cos_sin_cache_length()toRotaryEmbeddingfor on-demand, cache growth.Device-Aware Initialization
cos_sin_cachedirectly on the current CUDA device to prevent CPU–GPU copy overhead.Expansion Policy
The recommended pre-expansion formula (handled in the model runner before graph capture) is:
This covers:
Results & Verification
speculative_num_steps=5, and high concurrency.Why This Matters
Any under-provisioned RoPE cache will still break under long or speculative workloads.
This PR makes SGLang robust regardless of model mismatch and future-proofs against larger context windows.
In short:
Closes: #10713 (and related long-sequence RoPE overflow issues)