Skip to content

[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint#6475

Merged
yiz-liu merged 5 commits intovllm-project:mainfrom
yiz-liu:fix-full
Feb 4, 2026
Merged

[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint#6475
yiz-liu merged 5 commits intovllm-project:mainfrom
yiz-liu:fix-full

Conversation

@yiz-liu
Copy link
Copy Markdown
Collaborator

@yiz-liu yiz-liu commented Feb 2, 2026

What this PR does / why we need it?

This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit 5b0a6bc)" and fixes a check in model_runner_v1.

A key change is that we remove the strict assertion in the latest commit, as it turns out MLA + PIECEWISE will slice during computing, leaving our assertion uncalled for and will only cause false alarm.

This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, which prevents kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.

We currently place this helper in execute_model. My original design was to include it in _prepare_inputs, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.

Does this PR introduce any user-facing change?

None.

How was this patch tested?

Test cases added.

…IA/TND constraint (vllm-project#6459)"

This reverts commit 5b0a6bc.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Updates the condition to pad the query start location based on a mismatch
between padded and unpadded token counts, clarifying handling for
Graph Mode and SP scenarios. Improves accuracy of padding application.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@yiz-liu yiz-liu marked this pull request as ready for review February 2, 2026 02:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Feb 2, 2026

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yiz-liu, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses and resolves an issue where query_start_loc padding could lead to kernel mismatches or failures during FIA/TND execution in full graph modes. It centralizes the padding logic into a new helper function, adjusts buffer allocations, and adds a runtime assertion to ensure critical layout constraints are met. These changes enhance the stability and correctness of model execution in specific graph compilation modes.

Highlights

  • Refactored Padding Logic: The ad-hoc padding logic for query_start_loc has been consolidated into a new dedicated helper function, _pad_query_start_loc_for_fia, improving code organization and maintainability.
  • Enhanced query_start_loc Buffer: The query_start_loc buffer size has been increased by two to reserve extra space for padding, specifically to satisfy FIA/TND constraints in full graph modes.
  • Uniform and Mixed Batch Handling: The new padding helper function now explicitly handles both uniform and mixed batches, including inserting a dummy request for mixed batches to ensure correct behavior.
  • FIA Kernel Constraint Assertion: A new assertion has been added in _build_attention_metadata to verify that the first dimension of hidden_states equals the last element of actual_seq_lengths_q, enforcing a critical FIA kernel constraint.
  • New End-to-End Tests: New test cases (CASE_QWEN_FULL, CASE_DS_FULL) and a corresponding test function (test_full_res_consistency) have been introduced to validate the fix for full graph mode execution.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a constraint for FIA/TND execution in full graph modes by padding query_start_loc. The changes are well-structured, centralizing the padding logic into a new _pad_query_start_loc_for_fia helper function and adding an assertion to verify the constraint. The PR also includes new test cases for this functionality. I've found one potential issue in _dummy_run where an incorrect parameter is passed during attention metadata creation, which could affect graph capture. I've also provided suggestions for the PR title and summary to align with the repository's style guide.

Suggested PR Title:

[ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint

Suggested PR Summary:

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to satisfy FIA/TND constraint #6459 (commit 5b0a6bcfe9eca595bbcd064363596553b6bbd1fe)" and fixes a check in `model_runner_v1`.

This handles both uniform and mixed batches (by inserting a dummy request for mixed batches), consolidates ad-hoc padding into a single helper, copies the updated buffer to the device, and asserts the layout constraint before building the attention metadata. Together, these changes prevent kernel mismatches or failures and ensure correct shapes for FIA/TND execution in full graph modes.

We currently place this helper in `execute_model`. My original design was to include it in `_prepare_inputs`, but that doesn’t work because it must run after padding. While I’d prefer to minimize the impact and reuse as much of the base class as possible in the future, it doesn’t seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

@yiz-liu yiz-liu changed the title [Fix] Pads query_start_loc to satisfy FIA/TND constraint [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint Feb 2, 2026
Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR and removed ready read for review ready-for-test start test by label for PR labels Feb 2, 2026
Restrict the padding of query start locations to specific scenarios, such as Graph Mode and Sequence Parallelism, to prevent unpredictable side effects from an overly broad condition.

Additionally, remove a strict assertion related to kernel constraints to accommodate the more targeted padding logic.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@yiz-liu yiz-liu requested a review from zzzzwwjj February 3, 2026 03:21
@yiz-liu yiz-liu added ready read for review ready-for-test start test by label for PR labels Feb 3, 2026
@yiz-liu
Copy link
Copy Markdown
Collaborator Author

yiz-liu commented Feb 3, 2026

After discussion, we may have to create an RFC for this, as it's vital for other features.

@yiz-liu yiz-liu removed ready read for review ready-for-test start test by label for PR labels Feb 4, 2026
Restricts sequence parallelism logic to scenarios where Multi-head Latent Attention is disabled and context parallelism is not in use. This avoids unsupported padding or graph execution paths for these specific model configurations.

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
@yiz-liu yiz-liu merged commit 2ee4f23 into vllm-project:main Feb 4, 2026
16 checks passed
845473182 pushed a commit to 845473182/vllm-ascend that referenced this pull request Feb 6, 2026
…to qwen3next_rebase

* 'main' of https://github.com/vllm-project/vllm-ascend: (59 commits)
  [Feat.]: 310p support MOE models (vllm-project#6530)
  [Doc] backport 0.13.0 release note (vllm-project#6584)
  [CI] Update UT CANN version to 8.5.0 for main branch (vllm-project#6564)
  [CI] Change A2 runner (vllm-project#6557)
  [Bugfix] Fix the incorrect use of the output parameter in _forward_fia_slidingwindow (vllm-project#6469)
  [main2main] upgrade vllm main 0202 (vllm-project#6560)
  [CI][npugraph_ex]Fix npugraph ex e2e test (vllm-project#6553)
  [Feature]KV pool supports sparse attention (vllm-project#6339)
  [bugfix]Fix accuracy issue in PCP/DCP with speculative decoding (vllm-project#6491)
  perf: adaptive block size selection in linear_persistent kernel (vllm-project#6537)
  [ModelRunner][Fix] Pads query_start_loc to satisfy FIA/TND constraint (vllm-project#6475)
  [Bugfix]Fix of Pooling Code and Update of Pooling Usage Guide (vllm-project#6126)
  [Fusion] Add rmsnorm dynamic quant fusion pass (vllm-project#6274)
  [Bugfix] Synchronize only the current stream to avoid device sync (vllm-project#6432)
  [CI] Add long and short prompt tests for DeepSeek-V3.2 (vllm-project#6499)
  [Refactor] MLP weight prefetch to consistency with MoE Model's prefetching in terms of code and usage (vllm-project#6442)
  [bugfix][npugraph_ex]duplicate pattern issue (vllm-project#6513)
  [bugfix][npugraph_ex]add the extra check for allreduce rmsnorm fusion pass (vllm-project#6430)
  [Quant] GLM4.7-Flash Support W8A8 (vllm-project#6492)
  [Nightly][BugFix] Remove kv_cache nz test case for test_mla_preprocess_nq.py (vllm-project#6505)
  ...
@yiz-liu yiz-liu deleted the fix-full branch February 11, 2026 02:22
chenchuw886 pushed a commit to chenchuw886/vllm-ascend that referenced this pull request Feb 12, 2026
…vllm-project#6475)

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint vllm-project#6459 (commit
5b0a6bc)" and fixes a check in
`model_runner_v1`.

**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**

This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: momochenchuw <chenchuw@huawei.com>
@wangxiyuan wangxiyuan mentioned this pull request Feb 24, 2026
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Feb 28, 2026
…vllm-project#6475)

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint vllm-project#6459 (commit
5b0a6bc)" and fixes a check in
`model_runner_v1`.

**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**

This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
maoxx241 pushed a commit to maoxx241/vllm-ascend that referenced this pull request Mar 2, 2026
…vllm-project#6475)

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint vllm-project#6459 (commit
5b0a6bc)" and fixes a check in
`model_runner_v1`.

**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**

This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
ZRJ026 pushed a commit to ZRJ026/vllm-ascend that referenced this pull request Mar 4, 2026
…vllm-project#6475)

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint vllm-project#6459 (commit
5b0a6bc)" and fixes a check in
`model_runner_v1`.

**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**

This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Signed-off-by: zrj026 <zhangrunjiang026@gmail.com>
LCAIZJ pushed a commit to LCAIZJ/vllm-ascend that referenced this pull request Mar 7, 2026
…vllm-project#6475)

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint vllm-project#6459 (commit
5b0a6bc)" and fixes a check in
`model_runner_v1`.

**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**

This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
jiangyunfan1 pushed a commit to jiangyunfan1/vllm-ascend that referenced this pull request Apr 9, 2026
…vllm-project#6475)

### What this PR does / why we need it?
This PR reverts "[ModelRunner] Revert [Fix] Pads query_start_loc to
satisfy FIA/TND constraint vllm-project#6459 (commit
5b0a6bc)" and fixes a check in
`model_runner_v1`.

**A key change is that we remove the strict assertion in the latest
commit, as it turns out MLA + PIECEWISE will slice during computing,
leaving our assertion uncalled for and will only cause false alarm.**

This handles both uniform and mixed batches (by inserting a dummy
request for mixed batches), consolidates ad-hoc padding into a single
helper, copies the updated buffer to the device, which prevents kernel
mismatches or failures and ensure correct shapes for FIA/TND execution
in full graph modes.

We currently place this helper in `execute_model`. My original design
was to include it in `_prepare_inputs`, but that doesn’t work because it
must run after padding. While I’d prefer to minimize the impact and
reuse as much of the base class as possible in the future, it doesn’t
seem achievable at the moment.

### Does this PR introduce _any_ user-facing change?
None.

### How was this patch tested?
Test cases added.

- vLLM version: v0.14.1
- vLLM main:
vllm-project/vllm@dc917cc

---------

Signed-off-by: Yizhou Liu <liu_yizhou@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants