[HMA]Move hybrid blksize to update_block_size_for_backend to fix attn supported block size is not 16 issue#37467
Conversation
There was a problem hiding this comment.
Code Review
This pull request updates the KV cache page size unification logic to handle models with non-integral page size ratios between layers. It replaces the use of max() with math.lcm() to determine the unified page size, which is a more robust approach for this scenario. The changes also correctly propagate this scaling to page_size_padded if it is present in the cache specification. While this change is correct, I've identified a potential issue where the calculated LCM could become excessively large, leading to high memory consumption. I've added a high-severity comment with a suggestion to add a safeguard. An unrelated change for XPU support is also included in vllm/model_executor/layers/fused_moe/layer.py.
|
Hi @xuechendi, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
c954497 to
32dcef7
Compare
|
Hi @xuechendi, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
c523883 to
b724691
Compare
|
Hi @xuechendi, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, Tip Is
|
|
There is another fix at #37429 and I think that one is considering more? |
NickLucche
left a comment
There was a problem hiding this comment.
2 things off the top of my head
- lm_eval scores are lower than they should be for this model (iirc we're at 80+ for nemotronH @ZhanqiuHu )
- what's the setup that is leading to this assert? I don't recall seeing this issue before.
In particular
has two page_size for linear_attn and full_attn
this is only true when kernel_block_size is required, and then again in that case the page_size is divided by the logical/physical ratio (duplicating num_blocks accordingly).
Let's look deeper into the cause of this.
|
Oh, I am testing with Intel GPU which uses Just tested with for accuracy, I am following the config here - https://github.com/vllm-project/vllm/blob/main/.buildkite/lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml If using full set (no limit) and without multiturn, strict-match acc gets to 0.837. Wondering if you're testing with lm-eval-harness/configs/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16.yaml or using a different gsm8k setting? |
Yes, without |
|
@xuechendi btw was this function called: |
9192ad6 to
d8c6c7e
Compare
Any suggestion? |
|
something like |
|
@xuechendi I found on platform XPU, the final block size is 64 dividable using |
I did same test for Triton, and since Triton should not limit block_size to 64, the logic here is to set as 16, not sure why you're seeing misalign, did you set triton_attn block_size=64 somewhere in your codes? |
TRITON_ATTN has no limitation on block size to be dividable by 64, but in gdn attention, we do computation leveraging on kv cache so it's still need block-size dividable by 64. |
|
Share more details? Is that specific to current xpu_kv_cache impl? If that is only for GDN, maybe we can fix in the Qwen3.5 PR since I am not able to test it right now? |
|
Since #33657 hasn't landed yet, I think this PR is fine as is for now. If GDN does truly require block size 64 for all backends on XPU, that change can be made as part of 33657 |
…issue (blk_size=64,tp=4) (vllm-project#37467) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: neweyes <328719365@qq.com>
…issue (blk_size=64,tp=4) (vllm-project#37467) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Signed-off-by: Rishi Puri <riship@nvidia.com>
…issue (blk_size=64,tp=4) (vllm-project#37467) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…issue (blk_size=64,tp=4) (vllm-project#37467) Signed-off-by: Chendi Xue <chendi.xue@intel.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Chendi.Xue <chendi.xue@intel.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…xtral, MoE and Granite regressions (#1311) ## Summary This PR fixes a set of regressions introduced by recent upstream changes and observed in vLLM-Gaudi hourly validation. The branch now includes: - Pixtral HPUAttention projection path fix, - MoE dispatch and method override alignment updates for fused MoE and compressed tensors, - unit test updates to match the new MoE runner API usage, - fix hybrid model page size alignment for Granite 4.0-H. ## Related upstream PRs that introduced the regressions - vllm-project/vllm#37234 - vllm-project/vllm#35153 - vllm-project/vllm#36963 - vllm-project/vllm#38960 - vllm-project/vllm#35326 - vllm-project/vllm#37467 --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Purpose
--
Issue description:
Testing
nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-bf16with default block_size = 64.if FA block_size = 16 w/ TP4 => PASS: 540672 / 8192 = 66
if FA block_size = 64 w/ TP4 => ERROR!!! SSM_page_size will not be evenly divided by FA_page_size: 540672/32768 = 16.5
--
Root Cause:
Default hybrid model block_size alignment is happened before platform.
check_and_update_configsequence is
platform init=>cache_config init=>model_config init(hybrid model block_size alignment) => __post_init =>platform.check_and_update_configbecause of the sequence, current alignment for hybrid model is only calculated based on block_size=16. While XPU block_size will be updated to 64 in
platform.check_and_update_config, which cause the unalignment issue.--
Solution:
After discussion with Reviewers in this PR, suggested to update block_size in
platform.update_block_size_for_backendfor XPU.Including:
--
Test Plan - re-tested on 0323 with latest fix
validation after fix
case 0 - Hybrid model
case 1 - full attn model