Skip to content

[v0.21.0] Fix accuracy issue in minimax_m2 with TP > 1#1505

Closed
skavulya wants to merge 29 commits into
vllm-project:releases/v0.21.0from
skavulya:skavulya/minimax2_accuracy
Closed

[v0.21.0] Fix accuracy issue in minimax_m2 with TP > 1#1505
skavulya wants to merge 29 commits into
vllm-project:releases/v0.21.0from
skavulya:skavulya/minimax2_accuracy

Conversation

@skavulya
Copy link
Copy Markdown
Contributor

Fix accuracy of minimax m2 for tensor parallel size > 1. Reduce is handled in FusedMoE after #1377 and reduce_results=False dropped #1444

Output without this PR:

curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/mnt/weka/data/llm-d-models-pv/MiniMaxAI-MiniMax-M2.7",
"messages": [
{"role": "user", "content": [{"type": "text", "text": "Write a quick sort algorithm in python"}]}
], "max_tokens": 200
}'
{"id":"chatcmpl-8eb68aec66d7f527","object":"chat.completion","created":1778891236,"prompt_routed_experts":null,"model":"/mnt/weka/data/llm-d-models-pv/MiniMaxAI-MiniMax-M2.7","choices":[{"index":0,"message":{"role":"assistant","content":"I hadnet me find a programme2/apto/c- 241?.o. no (the operation.yb-b\n> ыйо, not change this;~~ I think_colour =="light pink";}) in...\n**The These must be not} was\n and \n\n):\n\nI('key=ельблиматš micrac / 1)2rasm_0.2 → add__2dict_eagle/tabString/im不过是 \list-ofchf_one \nCompute_with_prt_init: (New Tool Pro)\n-Main%-day_ ** [B1] : {nb_z0'];\n--own-traor: with: =: use 0.096-10_l_`this col0: 26;```\n</t_lN-蔓音频四文アنتストu+002:htt 도 원책임.(↑): The thought_dirty_s","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"routed_experts":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.1rc1.dev276+g54f548e9e-tp4-ep-614b7488","usage":{"prompt_tokens":45,"total_tokens":245,"completion_tokens":200,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

With PR

{"id":"chatcmpl-b79acb2e48acc5d0","object":"chat.completion","created":1778891747,"prompt_routed_experts":null,"model":"/mnt/weka/data/llm-d-models-pv/MiniMaxAI-MiniMax-M2.7","choices":[{"index":0,"message":{"role":"assistant","content":"We are going to write a quick sort algorithm in Python.\n We will define a function quicksort that takes a list as input.\n We will choose a pivot (commonly the last element, but we can also choose a random element or the middle).\n We will partition the list into two parts: elements less than the pivot and elements greater than the pivot.\n Then we recursively sort the two parts and combine them with the pivot in between.\n\n However, note that the problem asks for a quick sort algorithm, so we'll implement the standard in-place quick sort.\n\n Steps:\n 1. If the list has length 0 or 1, it is already sorted.\n 2. Otherwise, select a pivot (we'll use the last element for simplicity).\n 3. Partition the list into two sublists: left (elements less than pivot) and right (elements greater than or equal to pivot).\n 4. Return the sorted left part, then the pivot, then the sorted right part.\n\n Alternatively, we","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"routed_experts":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.1rc1.dev276+g54f548e9e-tp4-ep-614b7488","usage":{"prompt_tokens":45,"total_tokens":245,"completion_tokens":200,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

hsubramony and others added 29 commits May 15, 2026 09:31
When mrope_interleaved is enabled, HPUMRotaryEmbedding was still using
the non-interleaved split/concat section mapping for cos/sin.
This produced incorrect rotary channel ordering for multimodal MRoPE
inputs and could cause sample-level mismatches against upstream vLLM
behavior.
Use apply_interleaved_rope for the interleaved branch, and preserve the
existing split/concat logic for non-interleaved layouts.

Signed-off-by: Harish Subramony <harish.subramony@intel.com>
Co-authored-by: Jimin Ha <jimin.ha@intel.com>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Co-authored-by: Seunghyuk Park (shepark) <separk@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…project#1264) (vllm-project#1401)

Bug 1 (hpu_async_scheduler): clamp num_external_computed_tokens to 0 in
_update_requests_with_invalid_blocks() override. When OOM causes block
invalidation the affected-token span can exceed the externally-computed
prefix, incorrectly driving num_external_computed_tokens negative.

Bug 2 (hpu_async_scheduler): fix stale num_cached_tokens after
preemption. After OOM preemption and requeue a request restarts from
num_computed_tokens=0; the OffloadingConnector may assign new external
cache hits leaving num_cached_tokens inconsistent (<
num_external_computed_tokens). A schedule() post-processing pass detects
and corrects this.

Bug 2b (utils): clamp PromptTokenStats.get_by_source() to 0 via
monkey-patch. During the brief inconsistency window the Prometheus
counter would crash with "Counters can only be incremented by
non-negative amounts".

Bug 3 (hpu_model_runner): fix tensor shape mismatch [N,1] vs [N,M] in
the async scheduling path of _create_decode_input_data when a
spec-decode request has num_tokens > 1.

Bug 4 (hpu_model_runner): prevent Habana workspace OOM triggered by
OffloadingConnector requeuing a decode request with many scheduled
tokens. Route multi-token non-spec-decode requests through the prefill
bucket path (which handles large context correctly) instead of the
decode bucket path (which has no prepared bucket for
batch_size=N*blocks, causing JIT recompile with a 107 GiB workspace
allocation).

    Co-authored-by: GitHub Copilot

---------

---------

Signed-off-by: Harish Subramony <harish.subramony@intel.com>
Signed-off-by: Artur Fierka <artur.fierka@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Co-authored-by: Artur Fierka <artur.fierka@intel.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Kamil Kaczor <kamil.kaczor@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
vllm-project#1433 fixed a Qwen3.5
accuracy regression that was only detected
when the prompt bucket batch size is large. Adding
VLLM_PROMPT_BS_BUCKET_MAX=32 to the CI test covers that case.
Also tighten the passing threshold to better catch future regressions.

Signed-off-by: Seunghyuk Park <separk@habana.ai>
Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com>
Co-authored-by: Libin Tang <libin.tang@intel.com>
…1447)

## Fixes

Two bugs introduced by vllm-project#1122 (commit f24f3f9):

### 1. IndexError when using file-based bucketing (GAUDISW-248587)
When `VLLM_BUCKETING_FROM_FILE` is used (e.g. GraniteMoeHybrid model),
`ctx_range` is passed as an empty list to `generate_buckets()`. The
`num_ctx_tokens_less_or_equal_batched_max_model_len` filter accessed
`ctx_range[0]` unconditionally, causing `IndexError: list index out of
range`.

**Fix**: Safe access with fallback to 0 when `ctx_range` is empty.

### 2. Contiguous PA decode buckets incorrectly filtered
(GAUDISW-248598)
The ctx filter was applied to contiguous PA decode buckets, incorrectly
dropping valid buckets. For example, with `max_model_len=2048`,
`block_size=256`, `max_num_seqs=256`, bucket `(256, 1, 2112)` was
filtered because `2112 > ceil(2048/256)*256 = 2048`, but 2112 is a valid
user-configured `VLLM_DECODE_BLOCK_BUCKET_MAX`.

**Fix**: Remove the ctx filter from contiguous PA decode buckets. For
contiguous PA, the block range is already bounded by `max_blocks` in the
bucketing strategies.

## Tests
- Added `test_file_buckets_with_empty_ctx_range_no_crash` — reproduces
the server.log IndexError
- Added `test_contiguous_pa_decode_buckets_not_filtered_by_ctx` —
reproduces the std_out.txt issue
- Narrowed `test_decode_buckets_satisfy_ctx_filter` to non-contiguous PA
only
- Updated docstrings

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
…ject#1449)

Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to
MambaAttentionBackendEnum. The hybrid cache allocation in
hpu_model_runner.py still compared against str literals, causing GDN
layers to fall through to the Mamba2 shared-buffer path. This created
mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage,
triggering an aot_autograd assertion error during compilation.

Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values
and string literals for backward compatibility with older upstream
versions.

---------

Signed-off-by: Seunghyuk Park <separk@habana.ai>
…ror on HPU (vllm-project#1412)

## Summary

Upstream vLLM decorates `batched_count_greater_than` with
`@torch.compile(dynamic=True)`, which causes Habana's `recipe_compiler`
to raise `TypeError: Cannot convert symbols to int` when processing
symbolic shapes. Additionally, `mark_unbacked` in the caller
(`gather_logprobs`) prevents `dynamic=False` from being a viable
alternative.

## Fix

Replace with a plain (uncompiled) version of the same function. The
patching is deferred to `load_general_plugins` time via a hook on
`vllm.plugins.load_general_plugins`, because importing
`vllm.v1.sample.sampler` during early plugin registration triggers a
heavy import chain that interferes with platform initialisation.

## Why deferred patching?

- Importing `vllm.v1.sample.sampler` during `apply()` (called from
`register()`) triggers a heavy import chain that resets platform
detection, causing `Device string must not be empty`.
- The patching hooks into `load_general_plugins` which runs in every
process (parent + EngineCore subprocess) after the platform is ready.
- `sampler.py` uses `from ... import batched_count_greater_than` which
creates a module-level global resolved via `LOAD_GLOBAL` at call time,
so patching the module attribute works.

## Testing

- `test_skip_tokenizer_initialization` PASSES
- `test_engine_args` (3 tests) PASS
- Inference with `logprobs=5` produces correct output

Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>
…vllm-project#1441)

## Problem

DeepSeek R1 (671B) crashes during warmup on G3 with FP8 quantization
(GAUDISW-248418).

Two error manifestations:
- `RuntimeError: Incompatible input shapes, broadcast not possible.
Tensor1 Size: 7168 30720 Tensor2 Size: 256 1`
- `RuntimeError: Attempting to broadcast a dimension of length 256 at
-1! Mismatching argument at index 1 had torch.Size([1, 256]); but
expected shape should be broadcastable to [8192, 7168]`

Both crash at `hpu_grouped_topk_router.py:64` during MoE gate
application.

## Root Cause

`_forward_impl` introduces graph breaks via
`_sequence_parallel_context()` (calls `get_forward_context()`). Combined
with double gate application (gate called in `patched_fused_moe_forward`
AND again inside `_forward_impl`), Dynamo miscompiles the graph on HPU
Synapse, causing shape mismatches.

Regression window: Build 254 (good) → Build 260 (broken), introduced by
commit `98863a7` (MoE dynamo recompilation fix).

## Fix

For `dp_size==1` (the common single-node case), bypass `_forward_impl`
entirely and call `_apply_quant_method` + `_maybe_combine` directly.
This:
1. Eliminates graph breaks from `_sequence_parallel_context()` and
`get_forward_context()`
2. Skips the no-op `_maybe_dispatch()` (only needed for dp_size > 1)
3. Prevents double gate application
4. Adds a RuntimeError guard for `pcp_size > 1` (unsupported in fast
path)

The `dp_size > 1` fallback via `_forward_entry` is unchanged.

## Testing

Tested on G3 (8x HL-325L) with DeepSeek R1 671B FP8 TP=8:
- ✅ Prompt warmup: 54/54 items completed (crash site in original bug)
- ✅ Decode warmup: 25/25 items completed
- ✅ End-to-end inference: valid completions returned

Fixes: GAUDISW-248418

---------

Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…m-project#1454)

Revert the decode bucket filter introduced in f24f3f9 that drops buckets
with batched contexts larger than batched max_model_len as it is
functionally duplicate to
[correct_for_max_model_len](https://github.com/vllm-project/vllm-gaudi/blob/e5b23b22af2a32fb572df8b3c75758ba3df1795f/vllm_gaudi/extension/bucketing/common.py#L442).

## Changes:
- Remove the `num_ctx_tokens_less_or_equal_batched_max_model_len` filter
function from `generate_buckets()`
- Revert `filters_map` decode filters to pre-f24f3f9 state (`True: []`,
`False: [batch_size_smaller_than_blocks]`)
- Remove corresponding tests
(`test_exponential_decode_block_limit_uncapped`,
`test_decode_buckets_satisfy_ctx_filter`)

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

---------

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Co-authored-by: Kamil Kaczor <kamil.kaczor@intel.com>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…project#1434)

## Problem

For hybrid models (e.g., Qwen3.5-35B-A3B), decode buckets warmed during
startup are later reported as "not warmed-up" during inference. This
causes every decode step to fall back to the `_check_config` warning
path and potentially suboptimal performance.

## Root Cause

Two related issues:

### 1. `initialize_kv_cache` overwrites `block_size` with inflated
KV-manager page size

Lines added in main (not present in v0.19.0) in `initialize_kv_cache`:
```python
self.block_size = self.vllm_config.cache_config.block_size
self.bucketing_manager.block_size = self.block_size
```

For hybrid models, `HybridAttentionMambaModelConfig` sets
`cache_config.block_size` to a large aligned page size (e.g., 1152 for
Qwen3.5 with Mamba layers). This overwrites `self.block_size` from 128
to 1152 **after** the HPU platform's `check_and_update_config` had
already reset it to 128.

This causes `generate_buckets()` to produce decode buckets at 1152-token
granularity (max ~10,260 blocks), while `_create_decode_input_data`
computes `num_blocks` using `attn_block_size=128` (max ~92,160 blocks).
The runtime values exceed warmed buckets, triggering "not warmed-up"
warnings.

### 2. `_prepare_dummy_scenario` used wrong block_size for decode

The decode dummy sequence generation used `self.block_size` instead of
`self.attn_block_size`, causing a mismatch with
`_create_decode_input_data` which uses `self.attn_block_size`.

## Fix

1. **Remove the `block_size` overwrite in `initialize_kv_cache`** -
These lines must not be present because `self.block_size` is already set
correctly during `__init__` and must remain at 128 (the HPU kernel block
size) for proper bucket generation. The KV-manager page size (1152) is a
separate concept used for memory allocation, not for bucketing.

2. **Use `self.attn_block_size` in `_prepare_dummy_scenario`** for
decode sequences, matching what `_create_decode_input_data` uses.

## Verification

- Tested on Gaudi3 (HL-325) with Qwen/Qwen3.5-35B-A3B, TP=2, EP=2
- 247 prompt + 117 decode buckets warmed successfully
- Decode bucket range: 1 to 21,858 blocks (correct, using 128-token
granularity)
- Multiple inference requests completed with **zero** "not warmed-up"
warnings
- Server log (537 lines) contains no `_check_config` or warmup mismatch
warnings

## Why v0.19.0 worked

The `initialize_kv_cache` method in v0.19.0 did **not** have the
`self.block_size = self.vllm_config.cache_config.block_size` lines, so
`block_size` stayed at 128 throughout the lifecycle.

Signed-off-by: Agata Dobrzyniewicz <agata.dobrzyniewicz@intel.com>
Qwen3Next uses a hybrid GDN+attention architecture that requires
separate KV cache groups for GDN vs standard attention layers. Add it to
the mamba_like_arch list so maybe_set_mamba_kv_cache_groups_ids() sets
up the cache groups correctly.

Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
…ltiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring (vllm-project#1436)

Fix upstream regressions affecting hourly CI:

1. **MultiModelEngineClient**: Added missing
`notify_kv_transfer_request_rejected` abstract method (upstream PR
vllm-project/vllm#41269)
2. **Qwen3.5 test harness**: Updated `test_common.py` to read
`enforce_eager` from model card config (with env var override), enabling
per-model compilation control
3. **EPLB refactoring**: Removed `EMPTY_EPLB_STATE` import and
`enable_eplb` parameter from `patched_create_fused_moe_router` after
upstream MoE refactor (upstream PR vllm-project/vllm#41055)

Note: The `enforce_eager: true` workaround for Qwen3.5 compilation has
been removed — the root cause (mamba_type str-vs-Enum comparison in
hybrid cache allocation) is properly fixed by vllm-project#1449, which should merge
first.

Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed
tensors).

---------

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
1) added in vllm-project#1453
16 is supported for testing/smaller models; 128 is the standard HPU
kernel block size; 528 is required for Granite 4.0-H
(granitemoehybrid) without prefix caching (16-token FA alignment),
768 with prefix caching (chunk-aligned).

2) _patch_hf3fs_mock_client_for_cpu_only
Upstream mock client unconditionally calls
``torch.cuda.current_stream().wait_event(event)`` in ``batch_write``.
In environments where PyTorch is not compiled with CUDA, that path
throws
and the method returns ``-1`` for writes, causing connector unit tests
to
fail. This patch keeps the same behavior but skips CUDA synchronization
when
    CUDA is unavailable.

---------

Signed-off-by: Harish Subramony <harish.subramony@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
This pull request updates the `.github/workflows/pre-merge.yaml`
workflow configuration to add a `timeout-minutes: 720` (12 hours) limit
to all jobs. This change ensures that no individual job in the pre-merge
workflow can run indefinitely, which helps prevent stuck or runaway jobs
in CI and improves overall pipeline reliability.

**CI/CD Workflow Improvements:**

* Added `timeout-minutes: 720` to all jobs in
`.github/workflows/pre-merge.yaml` to enforce a 12-hour maximum runtime
per job. This applies to jobs such as `retrieve_head_sha`, `gatekeeper`,
`discover_runner`, `discover_tests`, `discover_calibration_tests`, test
execution jobs, and finalization/cleanup jobs.

No other logic or behavior changes were made—this is a
configuration-only update to improve CI robustness.

Signed-off-by: Bartosz Myrcha <bartosz.myrcha@intel.com>
…floading_connector test flush assertion for load transfers (vllm-project#1468)

Upstream vLLM PR vllm-project/vllm#42611 ("Flush all pending jobs on
last step") changed \`get_flushed_transfers()\` to return both store and
load flushes. The vllm-gaudi copy of the offloading_connector unit tests
assumed only store flushes, causing:

1. \`AssertionError\` in \`utils.py\` \`_parse_transfers\`
(\`isinstance(src_spec, GPULoadStoreSpec)\` assert fails on load
flushes)
2. \`flushed_gpu_block_indexes\` mismatch in \`test_scheduler\` tests

**Fix**: Mirror the upstream change — replace the assert with an
\`if/else\` handling both store and load flush types, and add
\`expected_flushed_gpu_block_indexes\` to affected tests.

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Bartosz Myrcha <bartosz.myrcha@intel.com>
…llm-project#1473)

## Summary

Adds `environment: approved-workflow` to every job that consumes
`secrets.HF_TOKEN` across the three CI workflows. Together with the
existing approval gate in `pre-merge-trigger.yaml` (`environment:
pre-merge-approval`, added in vllm-project#1471), this completes the two-layer
protection model:

```
PR opened
  -> pre-merge-trigger `gate` job: pauses for required reviewer (approval vllm-project#1)
  -> on approval, pre-merge.yaml is dispatched
  -> downstream secret-using jobs resolve HF_TOKEN from the
     `approved-workflow` environment (no second per-job approval)
```

## Why

With `HF_TOKEN` previously at repo-secret scope, any matrix entry of any
e2e/test job had direct access the moment CI started. The recent
malicious fork PR exfiltrated it via an auto-discovered `run_*`
function. After this change, the token is only released from a GitHub
Environment that a maintainer-controlled deployment-branch rule
restricts to `main` / `releases/**`, and only after the upstream gate
has approved the dispatch.

We deliberately add the environment only on jobs that actually use the
secret (15 jobs). Helper jobs (`gatekeeper`, `discover_*`, `retrieve_*`,
`pre-commit`, `post-comment`, `cleanup_*`, `build_nixl_dockerfile`,
`check_dockerfile_changes`, `prepare-release-branch`,
`summarize_and_notify`, `setup_and_build`,
`store_last_stable_vllm_commit`) do not touch HF_TOKEN and are not
modified, to avoid pointless extra gate evaluations.

## Affected jobs (15)

- `pre-merge.yaml`: `hpu_unit_tests`, `hpu_pd_tests`, `hpu_perf_tests`,
`hpu_dp_tests`, `e2e`, `calibration_tests`
- `hourly-ci.yaml`: `run_unit_tests`, `e2e`, `run_data_parallel_test`,
`run_pd_disaggregate_test`
- `create-release-branch.yaml`: `run_unit_tests`, `e2e`,
`run_data_parallel_test`, `run_pd_disaggregate_test`,
`run_hpu_perf_tests`

## Diff

+15 lines, 0 deletions. Each touched job gets exactly one new line:
`environment: approved-workflow`, inserted immediately after `runs-on:`.

## Required repo configuration (before this PR can be merged safely)

1. Settings → Environments → create environment **`approved-workflow`**.
2. Add **`HF_TOKEN`** as an environment secret (the rotated value).
3. **No required reviewers** on this environment (the upstream
`pre-merge-approval` gate already enforces approval; adding reviewers
here would prompt once per job).
4. **Deployment branches and tags**: Selected branches → `main`,
`releases/**`. Prevents a fork PR from claiming the environment from a
non-trusted ref.
5. **Delete** `HF_TOKEN` from repository-level secrets so the
environment value is the only source.

## Testing

Validated end-to-end against `bmyrcha/vllm-gaudi` first using a benign
fork PR. With the two environments configured as above, the gate paused
as expected, jobs received the secret after approval without a second
prompt, and a deliberately mis-authored downstream PR could not reach
the secret.

Close-cross-ref: builds on vllm-project#1471.

Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
…namicNTKScalingRotaryEmbedding and HPUCompressedTensorsConfig (vllm-project#1479)

## Root cause
Upstream vLLM at SHA 0a54df28 introduced two API changes that broke
vllm-gaudi:
1. PR vllm-project/vllm#41277 added a required `max_trained_positions`
parameter to `DynamicNTKScalingRotaryEmbedding.__init__()`, causing the
unit test to fail with TypeError.
2. PR vllm-project/vllm#43144 removed `sparsity_scheme_map` and
`sparsity_ignore_list` from `CompressedTensorsConfig.__init__()`,
causing `HPUCompressedTensorsConfig` instantiation to fail during e2e
tests.

## Upstream PR
vllm-project/vllm#41277
Added max_trained_positions to DynamicNTKScalingRotaryEmbedding

vllm-project/vllm#43144
Removed sparsity parameters from CompressedTensorsConfig

## Fix
1. Add `max_trained_positions` parameter to the rotary embedding unit
test.
2. Remove stale `sparsity_scheme_map` and `sparsity_ignore_list` from
HPUCompressedTensorsConfig init signature and super() call, plus the
unused SparsityCompressionConfig import.

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…fast path (vllm-project#1469)

PR vllm-project#1441 added an _hpu_gate_ref fallback in the dp_size==1 fast path
that unconditionally re-invoked a runner-owned gate, overwriting
router_logits supplied by the caller. For SharedFusedMoE models
(Qwen3 MoE, ernie45, ...) the block's mlp.gate(...) has already
produced router_logits and _sync_shared_moe_gates sets
runner.gate=None post-INC; the cached _hpu_gate_ref still points at
the pre-INC module and produced shape/dtype mismatches under fp8.

Only invoke the runner-owned gate when the caller did not provide
router_logits, preserving the DeepSeek R1 internal-router fast path
from vllm-project#1441.

---------

Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…project#1465)

Move prompt_token_ids to self.device in selective sampling metadata
creation for both skip_copy paths.
This keeps prompt and output penalty masks on the same device and
prevents runtime device mismatch errors during
repetition/presence/frequency penalty application.

Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
…sizes (vllm-project#1485)

## Problem

For hybrid models like Qwen3.5 (GDN + attention),
`_align_hybrid_block_size()` sets `block_size=640` (unified KV-cache
page for mamba/attention alignment), while HPU kernels use
`attn_block_size=128`.

The decode bucket generation (introduced by f24f3f9) uses the formula:
```
max_decode_blocks = ceil(max_model_len / block_size) * max_num_seqs
                  = ceil(262144 / 640) * 45 = 18450
```

But the runtime decode path (`_create_decode_input_data`) computes
`num_blocks` using `attn_block_size=128`, producing values up to
`ceil(262144/128) * 45 = 92160`.

This causes hundreds of **"Configuration was not warmed-up"** warnings
and costly HPU graph recompilation on every decode step.

## Root Cause

Two different block_size semantics coexist:
- `self.block_size = 640`: KV-cache management page size (unified for
hybrid mamba/attention)
- `self.attn_block_size = 128`: HPU attention kernel page size (what
hardware actually uses)

Decode bucket generation used `block_size` but should use
`attn_block_size` to match the runtime.

## Fix

Temporarily scope `bucketing_manager.block_size` to `attn_block_size`
during decode bucket generation in `warmup_model()`, then restore the
original value so prompt fallback paths remain unaffected.

## Testing

- Verified with Qwen3.5-35B-A3B on 4x Gaudi3 (TP=4,
max_model_len=262144, max_num_seqs=45)
- Decode buckets now correctly cover runtime num_blocks range
- No more "Configuration was not warmed-up" warnings during serving

Signed-off-by: Youlei Yang <youlei.yang@intel.com>

Signed-off-by: Youlei Yang <youlei.yang@intel.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n_linear_attn import path after upstream mamba refactor (vllm-project#1496)

## Root cause

Upstream vLLM PR vllm-project/vllm#41126 (commit 7e1b45a092) refactored

`vllm.model_executor.layers.mamba.gdn_linear_attn.GatedDeltaNetAttention`
into a `gdn/` subpackage:
`vllm.model_executor.layers.mamba.gdn.qwen_gdn_linear_attn.QwenGatedDeltaNetAttention`.

This broke `vllm_gaudi/models/qwen3_5.py` which imported from the old
path.

## Fix

Updated 6 lines in `vllm_gaudi/models/qwen3_5.py`:
- Changed import path from `gdn_linear_attn` to
`gdn.qwen_gdn_linear_attn`
- Updated class reference from `GatedDeltaNetAttention` to
`QwenGatedDeltaNetAttention`

## Upstream compatibility

Pinned to vLLM SHA: `b06813e87207e15b133e903d641e03f237d85b17`

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
vllm-project#1482)

…d models (vllm-project#1413)"

This reverts commit 808dbfa.

Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>
Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>
@skavulya skavulya requested a review from wpyszka as a code owner May 28, 2026 17:30
Copilot AI review requested due to automatic review settings May 28, 2026 17:30
@skavulya skavulya requested a review from PatrykWo as a code owner May 28, 2026 17:30
@skavulya skavulya had a problem deploying to pre-merge-approval May 28, 2026 17:30 — with GitHub Actions Error
@skavulya skavulya closed this May 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR improves HPU runtime robustness around KV-offload/preemption, hybrid-model decode bucketing/warmup behavior, and a few compatibility patches for upstream API/behavior changes.

Changes:

  • Add scheduling/bookkeeping fixes for KV-offload preemption and guard metrics against negative prompt-token counter increments.
  • Fix hybrid-model decode bucketing & warmup logic (attn_block_size vs KV page size), and add regression tests for bucket coverage.
  • Add/adjust several monkey-patches (MoE runner gate ownership, hf3fs mock client CPU-only behavior, sampler op workaround) and minor model/operator updates.

Reviewed changes

Copilot reviewed 30 out of 31 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_worker.py Centralizes GDN mamba-type detection via a shared constant.
vllm_gaudi/v1/worker/hpu_model_runner.py Adds decode reordering for multi-token catch-up, hybrid warmup fixes, and MoE gate/dedup adjustments.
vllm_gaudi/v1/worker/hpu_input_batch.py Ensures selective sampling prompt-token IDs are on-device for penalty computation.
vllm_gaudi/v1/core/sched/hpu_async_scheduler.py Adds HPU overrides for cached-token staleness and invalid block bookkeeping.
vllm_gaudi/v1/attention/backends/hpu_attn.py Expands supported kernel block sizes (adds 16).
vllm_gaudi/utils.py Monkey-patches PromptTokenStats to clamp negative counter increments.
vllm_gaudi/patches.py Adds CPU-only-safe hf3fs mock client patch and defers sampler op patching to plugin load time.
vllm_gaudi/ops/hpu_rotary_embedding.py Supports interleaved mRoPE path via upstream helper.
vllm_gaudi/ops/hpu_fused_moe.py Changes dp_size==1 fast path and updates MoE router factory patching.
vllm_gaudi/ops/hpu_compressed_tensors.py Removes sparsity-related args/types from compressed tensors path.
vllm_gaudi/models/qwen3_5.py Switches Qwen GDN attention import path and patches upstream symbols accordingly.
vllm_gaudi/models/minimax_m2.py Removes TP all-reduce usage from a MiniMax MoE forward path.
vllm_gaudi/extension/bucketing/common.py Simplifies decode-bucket filters, affecting bucket validity constraints.
vllm_gaudi/entrypoints/openai/multi_model_api_server.py Adds KV-transfer rejection notification passthrough + formatting tweaks.
tests/unit_tests/worker/test_ensure_multi_token_decodes_last.py Adds unit tests for decode-region reordering helper.
tests/unit_tests/test_decode_bucket_hybrid.py Adds regression tests for hybrid decode bucket generation & warmup scenarios.
tests/unit_tests/test_bucketing.py Updates/trim decode cfg test descriptions and removes some prior bucket filter tests.
tests/unit_tests/ops/test_hpu_rotary_embedding.py Adds max_trained_positions for rotary embedding test config.
tests/unit_tests/lora/test_llm_with_multi_loras.py Removes HF token dependency from LoRA test setup.
tests/unit_tests/lora/test_llama_tp.py Removes HF token dependency from LoRA TP test setup.
tests/unit_tests/kv_offload/offloading_connector/utils.py Handles both store-flush and load-flush cases when parsing transfers.
tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py Updates expectations for async scheduling flush timing and adds invariant checks.
tests/models/language/generation/test_common.py Refactors config/env parsing and improves formatting for readability.
tests/full_tests/model_cards/qwen3.5-35b-a3b.yaml Updates expected metric value.
tests/full_tests/ci_e2e_discoverable_tests.sh Sets prompt BS bucket max for the Qwen3.5 GSM8K e2e test.
requirements.txt Removes ray and transformers requirements from this file.
README.md Pins torchaudio to the local torch version for CPU wheel install guidance.
.github/workflows/pre-merge.yaml Adds long timeouts and requires an environment for several jobs.
.github/workflows/pre-merge-trigger.yaml Adds an explicit approval gate environment before triggering pre-merge.
.github/workflows/hourly-ci.yaml Requires an environment for hourly CI execution jobs.
.github/workflows/create-release-branch.yaml Requires an environment for release-branch CI execution jobs.

# eplb parameters
enable_eplb: bool = False,
eplb_state: EplbLayerState = EMPTY_EPLB_STATE,
eplb_state: EplbLayerState | None = None,
Comment on lines +458 to +459
True: [],
False: [batch_size_smaller_than_blocks],
Comment thread vllm_gaudi/patches.py
Comment on lines +200 to +206
_original_load_general = _plugins_mod.load_general_plugins

def _load_general_with_hpu_patches():
_original_load_general()
_patch_batched_count_greater_than()

_plugins_mod.load_general_plugins = _load_general_with_hpu_patches
Comment thread vllm_gaudi/utils.py
Comment on lines +299 to +306
_stats_get_by_source_orig = _stats_module.PromptTokenStats.get_by_source


def _hpu_get_by_source(self, source: str) -> int:
return max(0, _stats_get_by_source_orig(self, source))


_stats_module.PromptTokenStats.get_by_source = _hpu_get_by_source
return [input[i] if i is not None else v for i in indices]


def ensure_multi_token_decodes_last(b: InputBatch, scheduled_tokens: Mapping[str, int]) -> None:
Comment on lines +452 to +457
num_reqs = b.num_reqs
decode_end = num_reqs
for i in range(num_reqs):
if b.num_computed_tokens_cpu[i] < b.num_prompt_tokens[i]:
decode_end = i
break
Comment on lines +28 to +35
output = super().schedule()
for request in self.running:
# vLLM Request no longer exposes num_cached_tokens on newer
# branches. Keep the old fix only when the field exists.
if (hasattr(request, "num_cached_tokens")
and request.num_cached_tokens < request.num_external_computed_tokens):
request.num_cached_tokens = request.num_computed_tokens
return output

import pytest
import torch
import habana_frameworks.torch # noqa: F401
max_num_reqs=max(len(reqs), 1),
max_model_len=1024,
max_num_batched_tokens=1024,
device=torch.device("hpu"),
@mergify
Copy link
Copy Markdown

mergify Bot commented May 28, 2026

⚠️ The sha of the head commit of this PR conflicts with #1451. Mergify cannot evaluate rules on this PR. Once #1451 is merged or closed, Mergify will resume processing this PR. ⚠️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.