perf(v1): optimize InputBatch.swap_states by swapping active token prefixes#35018
perf(v1): optimize InputBatch.swap_states by swapping active token prefixes#35018VedantMadane wants to merge 4 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
7487bf0 to
4304bcc
Compare
There was a problem hiding this comment.
Code Review
This pull request optimizes InputBatch.swap_states by only swapping active token prefixes, which is a great performance improvement. However, I've found a critical bug in the implementation of swap_states in vllm/v1/worker/gpu_input_batch.py. The calculation of token lengths happens after some metadata has been swapped, leading to incorrect lengths being used for slicing, which can cause data corruption or out-of-bounds access. I've provided a suggestion to fix this. The other changes in vllm/v1/attention/backends/triton_attn.py look like correct refactoring.
|
Hey @VedantMadane, thanks for taking a look at the issue. I already have a PR up addressing |
80c0a50 to
b55b1c2
Compare
|
Hi @VedantMadane, the pre-commit checks have failed. Please run: uv pip install pre-commit
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
This pull request has merge conflicts that must be resolved before it can be |
980a1c4 to
1b4c99d
Compare
Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>
…efixes Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>
Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: Vedant Madane <6527493+VedantMadane@users.noreply.github.com>
1b4c99d to
37bbcbf
Compare
Addresses #34731.
Summary
This PR optimizes InputBatch.swap_states() in �llm/v1/worker/gpu_input_batch.py by only swapping the active token prefix of oken_ids_cpu and is_token_ids instead of copying the entire rows.
Changes
Impact
Reduces memory copy overhead during request reordering in the V1 engine, especially beneficial when max_model_len is large (e.g., 32k or 128k).