Skip to content

[MRV2] Extensible CG dispatch rework#36541

Closed
WoosukKwon wants to merge 37 commits intomainfrom
lwilkinson/mrv2-cg-dispatch
Closed

[MRV2] Extensible CG dispatch rework#36541
WoosukKwon wants to merge 37 commits intomainfrom
lwilkinson/mrv2-cg-dispatch

Conversation

@WoosukKwon
Copy link
Collaborator

Forked from #35959 since the PR doesn't allow editing 😅

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…hanics

- CudaGraphManager.capture() now handles iteration, warmup, and capture
- Subclasses provide callbacks that set up forward context
- Move EagleCudaGraphManager to spec_decode/eagle/cudagraph.py
- Rename forward_fn to generate_fn for Eagle

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
- Remove PIECEWISE handling from EagleCudaGraphManager
- Move set_forward_context into run_model (like main)
- generate_draft takes params directly instead of using forward context
- Remove unused instance variables from CudaGraphManager
- Remove dead code in _init_candidates

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
LucasWilkinson and others added 7 commits March 6, 2026 00:01
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
The base CudaGraphManager uses a shared global pool for all cudagraphs.
When Eagle and the main model share the same pool, their internal
allocations (e.g., gumbel_sample temporaries like local_argmax/local_max)
can overlap in memory. This causes memory corruption during cudagraph
replay, leading to incorrect draft token sampling and broken verification.

Symptoms:
- Abnormally high acceptance rates (e.g., 76% instead of 62% at pos0)
- Low accuracy (46% instead of 78% on GSM8K)
- GPU-specific (appeared on H200 but not B200 due to allocation patterns)

Fix: Create a dedicated cudagraph pool for Eagle, matching main branch
behavior where each cudagraph manager has its own pool.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the CUDA graph dispatch logic. The core of the changes is in vllm/v1/worker/gpu/cudagraph_utils.py, where the CudaGraphManager is reworked to be more extensible and robust.

Key improvements include:

  • Introduction of BatchExecutionDescriptor to encapsulate batch shape information for CUDA graph matching.
  • A new dispatch mechanism that uses pre-computed candidate graphs for different batch sizes, making the dispatch logic cleaner and more efficient.
  • Refactoring of the graph capture logic into a generic capture method that takes a factory function for the forward pass, decoupling the graph manager from model-specific details.
  • Introduction of ModelCudaGraphManager to handle model-specific aspects like hidden states, improving separation of concerns.
  • Updates to data parallelism utilities (dp_utils.py) to synchronize BatchExecutionDescriptor across ranks.
  • Consistent handling of padding for tokens and requests across various components, including block_table.py, model_runner.py, and model states.
  • Refactoring of EagleCudaGraphManager to inherit from the new CudaGraphManager, simplifying its implementation significantly.

Overall, these changes make the CUDA graph handling more modular, easier to understand, and more extensible for future features. The code quality is high, and the new design is a clear improvement over the previous implementation. I have not found any critical or high-severity issues in this pull request.

@WoosukKwon WoosukKwon closed this Mar 9, 2026
@github-project-automation github-project-automation bot moved this to Done in NVIDIA Mar 9, 2026
@WoosukKwon WoosukKwon deleted the lwilkinson/mrv2-cg-dispatch branch March 9, 2026 20:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

2 participants