[Core] Avoid seq_lens_cpu GPU->CPU sync#40654
Conversation
Signed-off-by: Nick Hill <nickhill123@gmail.com>
6e33c79 to
e4da3cc
Compare
There was a problem hiding this comment.
Code Review
This pull request introduces seq_lens_cpu_upper_bound to CommonAttentionMetadata to provide a CPU-side upper bound for sequence lengths, aiming to eliminate blocking GPU-to-CPU synchronizations during model execution, particularly in speculative decoding scenarios. The field is integrated across various attention backends and model runners to facilitate kernel dispatch and workspace sizing using optimistic bounds. A review comment identifies a critical issue in vllm/v1/spec_decode/eagle.py, where subtracting num_rejected_tokens from the CPU upper bound could trigger a synchronization or lead to device mismatches, potentially undermining the performance benefits of the change.
|
@njhill we already have something similar here in V1 with |
|
Why is it modifying the vLLM V1 specdec implementation? Is this change not intended to be isolated to MRV2? |
|
@benchislett the primary motivation is eliminating the current prefill cpu sync for DS3.2. This generalizes the v1-specific optimistic_seq_lens_cpu via explicit field in |
|
Also aiming to avoid use of |
|
@LucasWilkinson @MatthewBonanni for review |
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
# Conflicts: # vllm/v1/spec_decode/eagle.py Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Adrian <info@zzit.ch>
Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: hongbolv <33214277+hongbolv@users.noreply.github.com>
With help from claude