fix(worker): optimize swap_states to copy only active token prefixes#34733
fix(worker): optimize swap_states to copy only active token prefixes#34733njhill merged 6 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Philip Ottesen <phiott256@gmail.com>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
There was a problem hiding this comment.
Code Review
The pull request successfully optimizes the swap_states and condense methods in InputBatch by limiting the data movement to the active token prefix of each request. This change significantly reduces the overhead of reordering requests in the batch, especially when the maximum model length is large. The introduction of the _get_active_token_count helper method centralizes the logic for determining the active range of tokens, including speculative tokens. The performance benchmarks provided in the description confirm a substantial reduction in execution time for these operations. The implementation is correct and maintains consistency with the existing metadata management.
LucasWilkinson
left a comment
There was a problem hiding this comment.
Thanks for the contribution! running CI
|
@LucasWilkinson Looks like we had some intermittent CI failures. Pulling in |
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com> Signed-off-by: Ifta Khairul Alam Adil <ikaadil007@gmail.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com> Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com>
…llm-project#34733) Signed-off-by: Philip Ottesen <phiott256@gmail.com> Signed-off-by: Vinay Damodaran <vrdn@hey.com>
Purpose
This PR optimizes
InputBatch.swap_states()invllm/v1/worker/gpu_input_batch.pyby swapping only the active tokenprefix instead of full
max_model_lenrows.Fixes #34731.
Changes
_get_active_token_countfori1andi2token_ids_cpu[..., :max_active_token_count]is_token_ids[..., :max_active_token_count]Test Plan
Test Result
Performance
Benchmark script
Click to expand benchmark script
Seeing ~25ms saved with this limited benchmarking script
lm_eval
mainThis PR
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.