[cudagraphs] Refactor cudagraph capture loop#32946
[cudagraphs] Refactor cudagraph capture loop#32946LucasWilkinson merged 2 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
There was a problem hiding this comment.
Code Review
The pull request refactors the CUDA graph capture loop by centralizing the logic for determining which graphs to capture into the CudagraphDispatcher. This significantly cleans up the capture_model method in gpu_model_runner.py. New test cases have been added to ensure the get_capture_descs method in the dispatcher works as expected. However, a critical issue was identified in how uniform_decode is determined for CUDA graph capture, which could lead to incorrect graph configurations.
ProExpertProg
left a comment
There was a problem hiding this comment.
Nice, didn't realize we had logic for different keys in two places
| # We skip EPLB here since we don't want to record dummy metrics | ||
| for num_tokens, activate_lora in compilation_cases: | ||
| for batch_desc in batch_descriptors: | ||
| num_tokens = batch_desc.num_tokens |
There was a problem hiding this comment.
I feel like we're moving closer and closer to passing BatchDescriptor to dummy run directly...
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: 陈建华 <1647430658@qq.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Refactor cudagraph capture loop to pave the way for different PIECEWISE and FULL sizes and dynamic spec-decode sizes