[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint#6475
[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint#6475yiz-liu merged 5 commits intovllm-project:mainfrom
Conversation
…IA/TND constraint (vllm-project#6459)" This reverts commit 5b0a6bc. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Updates the condition to pad the query start location based on a mismatch between padded and unpadded token counts, clarifying handling for Graph Mode and SP scenarios. Improves accuracy of padding application. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
Summary of ChangesHello @yiz-liu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses and resolves an issue where Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request addresses a constraint for FIA/TND execution in full graph modes by padding query_start_loc. The changes are well-structured, centralizing the padding logic into a new _pad_query_start_loc_for_fia helper function and adding an assertion to verify the constraint. The PR also includes new test cases for this functionality. I've found one potential issue in _dummy_run where an incorrect parameter is passed during attention metadata creation, which could affect graph capture. I've also provided suggestions for the PR title and summary to align with the repository's style guide.
Suggested PR Title:
[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraintSuggested PR Summary:
### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit 5b0a6bcfe9eca595bbcd064363596553b6bbd1fe)" and fixes a check in `model_runner_v1`.
This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.
We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.
### Does this PR introduce _any_ user-facing change?
None.
### How was this patch tested?
Test cases added.Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Restrict the padding of query start locations to specific scenarios, such as Graph Mode and Sequence Parallelism, to prevent unpredictable side effects from an overly broad condition. Additionally, remove a strict assertion related to kernel constraints to accommodate the more targeted padding logic. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
|
After discussion, we may have to create an RFC for this, as it's vital for other features. |
Restricts sequence parallelism logic to scenarios where Multi-head Latent Attention is disabled and context parallelism is not in use. This avoids unsupported padding or graph execution paths for these specific model configurations. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (59 commits) [Feat.]: 310p support MOE models (vllm-project#6530) [Doc] backport 0.13.0 release note (vllm-project#6584) [CI] Update UT CANN version to 8.5.0 for main branch (vllm-project#6564) [CI] Change A2 runner (vllm-project#6557) [Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (vllm-project#6469) [main2main] upgrade vllm main 0202 (vllm-project#6560) [CI][npugraph_ex]Fix npugraph ex e2e test (vllm-project#6553) [Feature]KV pool supports sparse attention (vllm-project#6339) [bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (vllm-project#6491) perf: adaptive block size selection in linear_persistent kernel (vllm-project#6537) [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (vllm-project#6475) [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (vllm-project#6126) [Fusion] Add rmsnorm dynamic quant fusion pass (vllm-project#6274) [Bugfix] Synchronize only the current stream to avoid device sync (vllm-project#6432) [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6499) [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (vllm-project#6442) [bugfix][npugraph_ex]duplicate pattern issue (vllm-project#6513) [bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (vllm-project#6430) [Quant] GLM4.7-Flash Support W8A8 (vllm-project#6492) [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (vllm-project#6505) ...
…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>
…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit 5b0a6bc)" and fixes a check in
model_runner_v1.A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.
This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.
We currently place this helper in
execute_model. My original design was to include it in_prepare_inputs, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.Does this PR introduce any user-facing change?
None.
How was this patch tested?
Test cases added.