[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint by yiz-liu · Pull Request #6475 · vllm-project/vllm-ascend

yiz-liu · 2026-02-02T02:31:38Z

What this PR does / why we need it?

This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit 5b0a6bc)" and fixes a check in model_runner_v1.

A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.

This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.

We currently place this helper in execute_model. My original design was to include it in _prepare_inputs, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

Test cases added.

vLLM version: v0.14.1
vLLM main: vllm-project/vllm@dc917cc

…IA/TND constraint (vllm-project#6459)" This reverts commit 5b0a6bc. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Updates the condition to pad the query start location based on a mismatch between padded and unpadded token counts, clarifying handling for Graph Mode and SP scenarios. Improves accuracy of padding application. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

github-actions · 2026-02-02T02:31:51Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist · 2026-02-02T02:31:55Z

Summary of Changes

Hello @yiz-liu, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves an issue where query_start_loc padding could lead to kernel mismatches or failures during FIA/TND execution in full graph modes. It centralizes the padding logic into a new helper function, adjusts buffer allocations, and adds a runtime assertion to ensure critical layout constraints are met. These changes enhance the stability and correctness of model execution in specific graph compilation modes.

Highlights

Refactored Padding Logic: The ad-hoc padding logic for query_start_loc has been consolidated into a new dedicated helper function, _pad_query_start_loc_for_fia, improving code organization and maintainability.
Enhanced query_start_loc Buffer: The query_start_loc buffer size has been increased by two to reserve extra space for padding, specifically to satisfy FIA/TND constraints in full graph modes.
Uniform and Mixed Batch Handling: The new padding helper function now explicitly handles both uniform and mixed batches, including inserting a dummy request for mixed batches to ensure correct behavior.
FIA Kernel Constraint Assertion: A new assertion has been added in _build_attention_metadata to verify that the first dimension of hidden_states equals the last element of actual_seq_lengths_q, enforcing a critical FIA kernel constraint.
New End-to-End Tests: New test cases (CASE_QWEN_FULL, CASE_DS_FULL) and a corresponding test function (test_full_res_consistency) have been introduced to validate the fix for full graph mode execution.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses a constraint for FIA/TND execution in full graph modes by padding query_start_loc. The changes are well-structured, centralizing the padding logic into a new _pad_query_start_loc_for_fia helper function and adding an assertion to verify the constraint. The PR also includes new test cases for this functionality. I've found one potential issue in _dummy_run where an incorrect parameter is passed during attention metadata creation, which could affect graph capture. I've also provided suggestions for the PR title and summary to align with the repository's style guide.

Suggested PR Title:

[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint

Suggested PR Summary:

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit 5b0a6bcfe9eca595bbcd064363596553b6bbd1fe)" and fixes a check in `model_runner_v1`.

This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.

We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

Restrict the padding of query start locations to specific scenarios, such as Graph Mode and Sequence Parallelism, to prevent unpredictable side effects from an overly broad condition. Additionally, remove a strict assertion related to kernel constraints to accommodate the more targeted padding logic. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu · 2026-02-03T06:40:40Z

After discussion, we may have to create an RFC for this, as it's vital for other features.

Restricts sequence parallelism logic to scenarios where Multi-head Latent Attention is disabled and context parallelism is not in use. This avoids unsupported padding or graph execution paths for these specific model configurations. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…to qwen3next_rebase * 'main' of https://github.com/vllm-project/vllm-ascend: (59 commits) [Feat.]: 310p support MOE models (vllm-project#6530) [Doc] backport 0.13.0 release note (vllm-project#6584) [CI] Update UT CANN version to 8.5.0 for main branch (vllm-project#6564) [CI] Change A2 runner (vllm-project#6557) [Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (vllm-project#6469) [main2main] upgrade vllm main 0202 (vllm-project#6560) [CI][npugraph_ex]Fix npugraph ex e2e test (vllm-project#6553) [Feature]KV pool supports sparse attention (vllm-project#6339) [bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (vllm-project#6491) perf: adaptive block size selection in linear_persistent kernel (vllm-project#6537) [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (vllm-project#6475) [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (vllm-project#6126) [Fusion] Add rmsnorm dynamic quant fusion pass (vllm-project#6274) [Bugfix] Synchronize only the current stream to avoid device sync (vllm-project#6432) [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6499) [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (vllm-project#6442) [bugfix][npugraph_ex]duplicate pattern issue (vllm-project#6513) [bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (vllm-project#6430) [Quant] GLM4.7-Flash Support W8A8 (vllm-project#6492) [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (vllm-project#6505) ...

…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: momochenchuw <chenchuw@huawei.com>

…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com> Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>

…vllm-project#6475) ### What this PR does / why we need it? This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint vllm-project#6459 (commit 5b0a6bc)" and fixes a check in `model_runner_v1`. **A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.** This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes. We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment. ### Does this PR introduce _any_ user-facing change? None. ### How was this patch tested? Test cases added. - vLLM version: v0.14.1 - vLLM main: vllm-project/vllm@dc917cc --------- Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu added 2 commits February 2, 2026 09:40

Revert "[ModelRunner] Revert "[Fix] Pads query_start_loc to satisfy F…

01acdca

…IA/TND constraint (vllm-project#6459)" This reverts commit 5b0a6bc. Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu marked this pull request as ready for review February 2, 2026 02:31

yiz-liu requested review from MengqingCao and wangxiyuan as code owners February 2, 2026 02:31

github-actions bot added the module:tests label Feb 2, 2026

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

yiz-liu changed the title ~~[Fix] Pads query_start_loc to satisfy FIA/TND constraint~~ [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint Feb 2, 2026

Improves formatting for padding logic and function calls

ebc101e

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>

yiz-liu force-pushed the fix-full branch from b8b1aed to ebc101e Compare February 2, 2026 04:53

yiz-liu added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Feb 2, 2026

yiz-liu requested a review from zzzzwwjj February 3, 2026 03:21

yiz-liu added ready read for review ready-for-test start test by label for PR labels Feb 3, 2026

wangxiyuan approved these changes Feb 3, 2026

View reviewed changes

zzzzwwjj approved these changes Feb 3, 2026

View reviewed changes

yiz-liu removed ready read for review ready-for-test start test by label for PR labels Feb 4, 2026

yiz-liu merged commit 2ee4f23 into vllm-project:main Feb 4, 2026
16 checks passed

paulyu12 mentioned this pull request Feb 9, 2026

[Test][LoRA] Add e2e test for base model inference #6624

Merged

yiz-liu deleted the fix-full branch February 11, 2026 02:22

wangxiyuan mentioned this pull request Feb 24, 2026

[Misc]: test #6787

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint#6475

[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint#6475
yiz-liu merged 5 commits intovllm-project:mainfrom
yiz-liu:fix-full

yiz-liu commented Feb 2, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

yiz-liu commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yiz-liu commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

yiz-liu commented Feb 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiz-liu commented Feb 2, 2026 •

edited

Loading