Closed
Conversation
…hanics - CudaGraphManager.capture() now handles iteration, warmup, and capture - Subclasses provide callbacks that set up forward context - Move EagleCudaGraphManager to spec_decode/eagle/cudagraph.py - Rename forward_fn to generate_fn for Eagle Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
- Remove PIECEWISE handling from EagleCudaGraphManager - Move set_forward_context into run_model (like main) - generate_draft takes params directly instead of using forward context - Remove unused instance variables from CudaGraphManager - Remove dead code in _init_candidates Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The base CudaGraphManager uses a shared global pool for all cudagraphs. When Eagle and the main model share the same pool, their internal allocations (e.g., gumbel_sample temporaries like local_argmax/local_max) can overlap in memory. This causes memory corruption during cudagraph replay, leading to incorrect draft token sampling and broken verification. Symptoms: - Abnormally high acceptance rates (e.g., 76% instead of 62% at pos0) - Low accuracy (46% instead of 78% on GSM8K) - GPU-specific (appeared on H200 but not B200 due to allocation patterns) Fix: Create a dedicated cudagraph pool for Eagle, matching main branch behavior where each cudagraph manager has its own pool. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the CUDA graph dispatch logic. The core of the changes is in vllm/v1/worker/gpu/cudagraph_utils.py, where the CudaGraphManager is reworked to be more extensible and robust.
Key improvements include:
- Introduction of
BatchExecutionDescriptorto encapsulate batch shape information for CUDA graph matching. - A new dispatch mechanism that uses pre-computed candidate graphs for different batch sizes, making the dispatch logic cleaner and more efficient.
- Refactoring of the graph capture logic into a generic
capturemethod that takes a factory function for the forward pass, decoupling the graph manager from model-specific details. - Introduction of
ModelCudaGraphManagerto handle model-specific aspects like hidden states, improving separation of concerns. - Updates to data parallelism utilities (
dp_utils.py) to synchronizeBatchExecutionDescriptoracross ranks. - Consistent handling of padding for tokens and requests across various components, including
block_table.py,model_runner.py, and model states. - Refactoring of
EagleCudaGraphManagerto inherit from the newCudaGraphManager, simplifying its implementation significantly.
Overall, these changes make the CUDA graph handling more modular, easier to understand, and more extensible for future features. The code quality is high, and the new design is a clear improvement over the previous implementation. I have not found any critical or high-severity issues in this pull request.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Forked from #35959 since the PR doesn't allow editing 😅