[v0.21.0] Fix accuracy issue in minimax_m2 with TP > 1#1505
Closed
skavulya wants to merge 29 commits into
Closed
Conversation
When mrope_interleaved is enabled, HPUMRotaryEmbedding was still using the non-interleaved split/concat section mapping for cos/sin. This produced incorrect rotary channel ordering for multimodal MRoPE inputs and could cause sample-level mismatches against upstream vLLM behavior. Use apply_interleaved_rope for the interleaved branch, and preserve the existing split/concat logic for non-interleaved layouts. Signed-off-by: Harish Subramony <harish.subramony@intel.com> Co-authored-by: Jimin Ha <jimin.ha@intel.com> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Seunghyuk Park (shepark) <separk@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…project#1264) (vllm-project#1401) Bug 1 (hpu_async_scheduler): clamp num_external_computed_tokens to 0 in _update_requests_with_invalid_blocks() override. When OOM causes block invalidation the affected-token span can exceed the externally-computed prefix, incorrectly driving num_external_computed_tokens negative. Bug 2 (hpu_async_scheduler): fix stale num_cached_tokens after preemption. After OOM preemption and requeue a request restarts from num_computed_tokens=0; the OffloadingConnector may assign new external cache hits leaving num_cached_tokens inconsistent (< num_external_computed_tokens). A schedule() post-processing pass detects and corrects this. Bug 2b (utils): clamp PromptTokenStats.get_by_source() to 0 via monkey-patch. During the brief inconsistency window the Prometheus counter would crash with "Counters can only be incremented by non-negative amounts". Bug 3 (hpu_model_runner): fix tensor shape mismatch [N,1] vs [N,M] in the async scheduling path of _create_decode_input_data when a spec-decode request has num_tokens > 1. Bug 4 (hpu_model_runner): prevent Habana workspace OOM triggered by OffloadingConnector requeuing a decode request with many scheduled tokens. Route multi-token non-spec-decode requests through the prefill bucket path (which handles large context correctly) instead of the decode bucket path (which has no prepared bucket for batch_size=N*blocks, causing JIT recompile with a 107 GiB workspace allocation). Co-authored-by: GitHub Copilot --------- --------- Signed-off-by: Harish Subramony <harish.subramony@intel.com> Signed-off-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Artur Fierka <artur.fierka@intel.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Kamil Kaczor <kamil.kaczor@intel.com>
Signed-off-by: Youlei Yang <youlei.yang@intel.com>
vllm-project#1433 fixed a Qwen3.5 accuracy regression that was only detected when the prompt bucket batch size is large. Adding VLLM_PROMPT_BS_BUCKET_MAX=32 to the CI test covers that case. Also tighten the passing threshold to better catch future regressions. Signed-off-by: Seunghyuk Park <separk@habana.ai> Co-authored-by: Agata Dobrzyniewicz <160237065+adobrzyn@users.noreply.github.com> Co-authored-by: Libin Tang <libin.tang@intel.com>
…1447) ## Fixes Two bugs introduced by vllm-project#1122 (commit f24f3f9): ### 1. IndexError when using file-based bucketing (GAUDISW-248587) When `VLLM_BUCKETING_FROM_FILE` is used (e.g. GraniteMoeHybrid model), `ctx_range` is passed as an empty list to `generate_buckets()`. The `num_ctx_tokens_less_or_equal_batched_max_model_len` filter accessed `ctx_range[0]` unconditionally, causing `IndexError: list index out of range`. **Fix**: Safe access with fallback to 0 when `ctx_range` is empty. ### 2. Contiguous PA decode buckets incorrectly filtered (GAUDISW-248598) The ctx filter was applied to contiguous PA decode buckets, incorrectly dropping valid buckets. For example, with `max_model_len=2048`, `block_size=256`, `max_num_seqs=256`, bucket `(256, 1, 2112)` was filtered because `2112 > ceil(2048/256)*256 = 2048`, but 2112 is a valid user-configured `VLLM_DECODE_BLOCK_BUCKET_MAX`. **Fix**: Remove the ctx filter from contiguous PA decode buckets. For contiguous PA, the block range is already bounded by `max_blocks` in the bucketing strategies. ## Tests - Added `test_file_buckets_with_empty_ctx_range_no_crash` — reproduces the server.log IndexError - Added `test_contiguous_pa_decode_buckets_not_filtered_by_ctx` — reproduces the std_out.txt issue - Narrowed `test_decode_buckets_satisfy_ctx_filter` to non-contiguous PA only - Updated docstrings Signed-off-by: Youlei Yang <youlei.yang@intel.com> --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com>
…ject#1449) Upstream vllm commit 5536fc0c0 changed MambaSpec.mamba_type from str to MambaAttentionBackendEnum. The hybrid cache allocation in hpu_model_runner.py still compared against str literals, causing GDN layers to fall through to the Mamba2 shared-buffer path. This created mixed-dtype views (bf16 conv_state+fp32 ssm_state) on the same storage, triggering an aot_autograd assertion error during compilation. Use a module-level _GDN_MAMBA_TYPES tuple that includes both enum values and string literals for backward compatibility with older upstream versions. --------- Signed-off-by: Seunghyuk Park <separk@habana.ai>
…ror on HPU (vllm-project#1412) ## Summary Upstream vLLM decorates `batched_count_greater_than` with `@torch.compile(dynamic=True)`, which causes Habana's `recipe_compiler` to raise `TypeError: Cannot convert symbols to int` when processing symbolic shapes. Additionally, `mark_unbacked` in the caller (`gather_logprobs`) prevents `dynamic=False` from being a viable alternative. ## Fix Replace with a plain (uncompiled) version of the same function. The patching is deferred to `load_general_plugins` time via a hook on `vllm.plugins.load_general_plugins`, because importing `vllm.v1.sample.sampler` during early plugin registration triggers a heavy import chain that interferes with platform initialisation. ## Why deferred patching? - Importing `vllm.v1.sample.sampler` during `apply()` (called from `register()`) triggers a heavy import chain that resets platform detection, causing `Device string must not be empty`. - The patching hooks into `load_general_plugins` which runs in every process (parent + EngineCore subprocess) after the platform is ready. - `sampler.py` uses `from ... import batched_count_greater_than` which creates a module-level global resolved via `LOAD_GLOBAL` at call time, so patching the module attribute works. ## Testing - `test_skip_tokenizer_initialization` PASSES - `test_engine_args` (3 tests) PASS - Inference with `logprobs=5` produces correct output Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com>
…vllm-project#1441) ## Problem DeepSeek R1 (671B) crashes during warmup on G3 with FP8 quantization (GAUDISW-248418). Two error manifestations: - `RuntimeError: Incompatible input shapes, broadcast not possible. Tensor1 Size: 7168 30720 Tensor2 Size: 256 1` - `RuntimeError: Attempting to broadcast a dimension of length 256 at -1! Mismatching argument at index 1 had torch.Size([1, 256]); but expected shape should be broadcastable to [8192, 7168]` Both crash at `hpu_grouped_topk_router.py:64` during MoE gate application. ## Root Cause `_forward_impl` introduces graph breaks via `_sequence_parallel_context()` (calls `get_forward_context()`). Combined with double gate application (gate called in `patched_fused_moe_forward` AND again inside `_forward_impl`), Dynamo miscompiles the graph on HPU Synapse, causing shape mismatches. Regression window: Build 254 (good) → Build 260 (broken), introduced by commit `98863a7` (MoE dynamo recompilation fix). ## Fix For `dp_size==1` (the common single-node case), bypass `_forward_impl` entirely and call `_apply_quant_method` + `_maybe_combine` directly. This: 1. Eliminates graph breaks from `_sequence_parallel_context()` and `get_forward_context()` 2. Skips the no-op `_maybe_dispatch()` (only needed for dp_size > 1) 3. Prevents double gate application 4. Adds a RuntimeError guard for `pcp_size > 1` (unsupported in fast path) The `dp_size > 1` fallback via `_forward_entry` is unchanged. ## Testing Tested on G3 (8x HL-325L) with DeepSeek R1 671B FP8 TP=8: - ✅ Prompt warmup: 54/54 items completed (crash site in original bug) - ✅ Decode warmup: 25/25 items completed - ✅ End-to-end inference: valid completions returned Fixes: GAUDISW-248418 --------- Signed-off-by: Kamil Kaczor <kamil.kaczor@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
…m-project#1454) Revert the decode bucket filter introduced in f24f3f9 that drops buckets with batched contexts larger than batched max_model_len as it is functionally duplicate to [correct_for_max_model_len](https://github.com/vllm-project/vllm-gaudi/blob/e5b23b22af2a32fb572df8b3c75758ba3df1795f/vllm_gaudi/extension/bucketing/common.py#L442). ## Changes: - Remove the `num_ctx_tokens_less_or_equal_batched_max_model_len` filter function from `generate_buckets()` - Revert `filters_map` decode filters to pre-f24f3f9 state (`True: []`, `False: [batch_size_smaller_than_blocks]`) - Remove corresponding tests (`test_exponential_decode_block_limit_uncapped`, `test_decode_buckets_satisfy_ctx_filter`) Signed-off-by: Youlei Yang <youlei.yang@intel.com> --------- Signed-off-by: Youlei Yang <youlei.yang@intel.com> Co-authored-by: Kamil Kaczor <kamil.kaczor@intel.com>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…project#1434) ## Problem For hybrid models (e.g., Qwen3.5-35B-A3B), decode buckets warmed during startup are later reported as "not warmed-up" during inference. This causes every decode step to fall back to the `_check_config` warning path and potentially suboptimal performance. ## Root Cause Two related issues: ### 1. `initialize_kv_cache` overwrites `block_size` with inflated KV-manager page size Lines added in main (not present in v0.19.0) in `initialize_kv_cache`: ```python self.block_size = self.vllm_config.cache_config.block_size self.bucketing_manager.block_size = self.block_size ``` For hybrid models, `HybridAttentionMambaModelConfig` sets `cache_config.block_size` to a large aligned page size (e.g., 1152 for Qwen3.5 with Mamba layers). This overwrites `self.block_size` from 128 to 1152 **after** the HPU platform's `check_and_update_config` had already reset it to 128. This causes `generate_buckets()` to produce decode buckets at 1152-token granularity (max ~10,260 blocks), while `_create_decode_input_data` computes `num_blocks` using `attn_block_size=128` (max ~92,160 blocks). The runtime values exceed warmed buckets, triggering "not warmed-up" warnings. ### 2. `_prepare_dummy_scenario` used wrong block_size for decode The decode dummy sequence generation used `self.block_size` instead of `self.attn_block_size`, causing a mismatch with `_create_decode_input_data` which uses `self.attn_block_size`. ## Fix 1. **Remove the `block_size` overwrite in `initialize_kv_cache`** - These lines must not be present because `self.block_size` is already set correctly during `__init__` and must remain at 128 (the HPU kernel block size) for proper bucket generation. The KV-manager page size (1152) is a separate concept used for memory allocation, not for bucketing. 2. **Use `self.attn_block_size` in `_prepare_dummy_scenario`** for decode sequences, matching what `_create_decode_input_data` uses. ## Verification - Tested on Gaudi3 (HL-325) with Qwen/Qwen3.5-35B-A3B, TP=2, EP=2 - 247 prompt + 117 decode buckets warmed successfully - Decode bucket range: 1 to 21,858 blocks (correct, using 128-token granularity) - Multiple inference requests completed with **zero** "not warmed-up" warnings - Server log (537 lines) contains no `_check_config` or warmup mismatch warnings ## Why v0.19.0 worked The `initialize_kv_cache` method in v0.19.0 did **not** have the `self.block_size = self.vllm_config.cache_config.block_size` lines, so `block_size` stayed at 128 throughout the lifecycle. Signed-off-by: Agata Dobrzyniewicz <agata.dobrzyniewicz@intel.com>
Qwen3Next uses a hybrid GDN+attention architecture that requires separate KV cache groups for GDN vs standard attention layers. Add it to the mamba_like_arch list so maybe_set_mamba_kv_cache_groups_ids() sets up the cache groups correctly. Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com>
…ltiModelEngineClient, Qwen3.5 compilation, and EPLB refactoring (vllm-project#1436) Fix upstream regressions affecting hourly CI: 1. **MultiModelEngineClient**: Added missing `notify_kv_transfer_request_rejected` abstract method (upstream PR vllm-project/vllm#41269) 2. **Qwen3.5 test harness**: Updated `test_common.py` to read `enforce_eager` from model card config (with env var override), enabling per-model compilation control 3. **EPLB refactoring**: Removed `EMPTY_EPLB_STATE` import and `enable_eplb` parameter from `patched_create_fused_moe_router` after upstream MoE refactor (upstream PR vllm-project/vllm#41055) Note: The `enforce_eager: true` workaround for Qwen3.5 compilation has been removed — the root cause (mamba_type str-vs-Enum comparison in hybrid cache allocation) is properly fixed by vllm-project#1449, which should merge first. Verified on HPU: unit tests pass on Gaudi 3 (MoE, FP8, compressed tensors). --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com> Signed-off-by: Pawel Olejniczak <pawelx.olejniczak@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
1) added in vllm-project#1453 16 is supported for testing/smaller models; 128 is the standard HPU kernel block size; 528 is required for Granite 4.0-H (granitemoehybrid) without prefix caching (16-token FA alignment), 768 with prefix caching (chunk-aligned). 2) _patch_hf3fs_mock_client_for_cpu_only Upstream mock client unconditionally calls ``torch.cuda.current_stream().wait_event(event)`` in ``batch_write``. In environments where PyTorch is not compiled with CUDA, that path throws and the method returns ``-1`` for writes, causing connector unit tests to fail. This patch keeps the same behavior but skips CUDA synchronization when CUDA is unavailable. --------- Signed-off-by: Harish Subramony <harish.subramony@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
This pull request updates the `.github/workflows/pre-merge.yaml` workflow configuration to add a `timeout-minutes: 720` (12 hours) limit to all jobs. This change ensures that no individual job in the pre-merge workflow can run indefinitely, which helps prevent stuck or runaway jobs in CI and improves overall pipeline reliability. **CI/CD Workflow Improvements:** * Added `timeout-minutes: 720` to all jobs in `.github/workflows/pre-merge.yaml` to enforce a 12-hour maximum runtime per job. This applies to jobs such as `retrieve_head_sha`, `gatekeeper`, `discover_runner`, `discover_tests`, `discover_calibration_tests`, test execution jobs, and finalization/cleanup jobs. No other logic or behavior changes were made—this is a configuration-only update to improve CI robustness. Signed-off-by: Bartosz Myrcha <bartosz.myrcha@intel.com>
…ments (vllm-project#1445) Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…floading_connector test flush assertion for load transfers (vllm-project#1468) Upstream vLLM PR vllm-project/vllm#42611 ("Flush all pending jobs on last step") changed \`get_flushed_transfers()\` to return both store and load flushes. The vllm-gaudi copy of the offloading_connector unit tests assumed only store flushes, causing: 1. \`AssertionError\` in \`utils.py\` \`_parse_transfers\` (\`isinstance(src_spec, GPULoadStoreSpec)\` assert fails on load flushes) 2. \`flushed_gpu_block_indexes\` mismatch in \`test_scheduler\` tests **Fix**: Mirror the upstream change — replace the assert with an \`if/else\` handling both store and load flush types, and add \`expected_flushed_gpu_block_indexes\` to affected tests. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
Signed-off-by: Bartosz Myrcha <bartosz.myrcha@intel.com>
…llm-project#1473) ## Summary Adds `environment: approved-workflow` to every job that consumes `secrets.HF_TOKEN` across the three CI workflows. Together with the existing approval gate in `pre-merge-trigger.yaml` (`environment: pre-merge-approval`, added in vllm-project#1471), this completes the two-layer protection model: ``` PR opened -> pre-merge-trigger `gate` job: pauses for required reviewer (approval vllm-project#1) -> on approval, pre-merge.yaml is dispatched -> downstream secret-using jobs resolve HF_TOKEN from the `approved-workflow` environment (no second per-job approval) ``` ## Why With `HF_TOKEN` previously at repo-secret scope, any matrix entry of any e2e/test job had direct access the moment CI started. The recent malicious fork PR exfiltrated it via an auto-discovered `run_*` function. After this change, the token is only released from a GitHub Environment that a maintainer-controlled deployment-branch rule restricts to `main` / `releases/**`, and only after the upstream gate has approved the dispatch. We deliberately add the environment only on jobs that actually use the secret (15 jobs). Helper jobs (`gatekeeper`, `discover_*`, `retrieve_*`, `pre-commit`, `post-comment`, `cleanup_*`, `build_nixl_dockerfile`, `check_dockerfile_changes`, `prepare-release-branch`, `summarize_and_notify`, `setup_and_build`, `store_last_stable_vllm_commit`) do not touch HF_TOKEN and are not modified, to avoid pointless extra gate evaluations. ## Affected jobs (15) - `pre-merge.yaml`: `hpu_unit_tests`, `hpu_pd_tests`, `hpu_perf_tests`, `hpu_dp_tests`, `e2e`, `calibration_tests` - `hourly-ci.yaml`: `run_unit_tests`, `e2e`, `run_data_parallel_test`, `run_pd_disaggregate_test` - `create-release-branch.yaml`: `run_unit_tests`, `e2e`, `run_data_parallel_test`, `run_pd_disaggregate_test`, `run_hpu_perf_tests` ## Diff +15 lines, 0 deletions. Each touched job gets exactly one new line: `environment: approved-workflow`, inserted immediately after `runs-on:`. ## Required repo configuration (before this PR can be merged safely) 1. Settings → Environments → create environment **`approved-workflow`**. 2. Add **`HF_TOKEN`** as an environment secret (the rotated value). 3. **No required reviewers** on this environment (the upstream `pre-merge-approval` gate already enforces approval; adding reviewers here would prompt once per job). 4. **Deployment branches and tags**: Selected branches → `main`, `releases/**`. Prevents a fork PR from claiming the environment from a non-trusted ref. 5. **Delete** `HF_TOKEN` from repository-level secrets so the environment value is the only source. ## Testing Validated end-to-end against `bmyrcha/vllm-gaudi` first using a benign fork PR. With the two environments configured as above, the gate paused as expected, jobs received the secret after approval without a second prompt, and a deliberately mis-authored downstream PR could not reach the secret. Close-cross-ref: builds on vllm-project#1471. Signed-off-by: Agata Dobrzyniewicz <adobrzyniewicz@habana.ai>
…namicNTKScalingRotaryEmbedding and HPUCompressedTensorsConfig (vllm-project#1479) ## Root cause Upstream vLLM at SHA 0a54df28 introduced two API changes that broke vllm-gaudi: 1. PR vllm-project/vllm#41277 added a required `max_trained_positions` parameter to `DynamicNTKScalingRotaryEmbedding.__init__()`, causing the unit test to fail with TypeError. 2. PR vllm-project/vllm#43144 removed `sparsity_scheme_map` and `sparsity_ignore_list` from `CompressedTensorsConfig.__init__()`, causing `HPUCompressedTensorsConfig` instantiation to fail during e2e tests. ## Upstream PR vllm-project/vllm#41277 Added max_trained_positions to DynamicNTKScalingRotaryEmbedding vllm-project/vllm#43144 Removed sparsity parameters from CompressedTensorsConfig ## Fix 1. Add `max_trained_positions` parameter to the rotary embedding unit test. 2. Remove stale `sparsity_scheme_map` and `sparsity_ignore_list` from HPUCompressedTensorsConfig init signature and super() call, plus the unused SparsityCompressionConfig import. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
…fast path (vllm-project#1469) PR vllm-project#1441 added an _hpu_gate_ref fallback in the dp_size==1 fast path that unconditionally re-invoked a runner-owned gate, overwriting router_logits supplied by the caller. For SharedFusedMoE models (Qwen3 MoE, ernie45, ...) the block's mlp.gate(...) has already produced router_logits and _sync_shared_moe_gates sets runner.gate=None post-INC; the cached _hpu_gate_ref still points at the pre-INC module and produced shape/dtype mismatches under fp8. Only invoke the runner-owned gate when the caller did not provide router_logits, preserving the DeepSeek R1 internal-router fast path from vllm-project#1441. --------- Signed-off-by: Iryna Boiko <iboiko@habana.ai>
Signed-off-by: Iryna Boiko <iboiko@habana.ai>
…project#1465) Move prompt_token_ids to self.device in selective sampling metadata creation for both skip_copy paths. This keeps prompt and output penalty masks on the same device and prevents runtime device mismatch errors during repetition/presence/frequency penalty application. Signed-off-by: Yeonsil Yoon <yeon.sil.yoon@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
…sizes (vllm-project#1485) ## Problem For hybrid models like Qwen3.5 (GDN + attention), `_align_hybrid_block_size()` sets `block_size=640` (unified KV-cache page for mamba/attention alignment), while HPU kernels use `attn_block_size=128`. The decode bucket generation (introduced by f24f3f9) uses the formula: ``` max_decode_blocks = ceil(max_model_len / block_size) * max_num_seqs = ceil(262144 / 640) * 45 = 18450 ``` But the runtime decode path (`_create_decode_input_data`) computes `num_blocks` using `attn_block_size=128`, producing values up to `ceil(262144/128) * 45 = 92160`. This causes hundreds of **"Configuration was not warmed-up"** warnings and costly HPU graph recompilation on every decode step. ## Root Cause Two different block_size semantics coexist: - `self.block_size = 640`: KV-cache management page size (unified for hybrid mamba/attention) - `self.attn_block_size = 128`: HPU attention kernel page size (what hardware actually uses) Decode bucket generation used `block_size` but should use `attn_block_size` to match the runtime. ## Fix Temporarily scope `bucketing_manager.block_size` to `attn_block_size` during decode bucket generation in `warmup_model()`, then restore the original value so prompt fallback paths remain unaffected. ## Testing - Verified with Qwen3.5-35B-A3B on 4x Gaudi3 (TP=4, max_model_len=262144, max_num_seqs=45) - Decode buckets now correctly cover runtime num_blocks range - No more "Configuration was not warmed-up" warnings during serving Signed-off-by: Youlei Yang <youlei.yang@intel.com> Signed-off-by: Youlei Yang <youlei.yang@intel.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n_linear_attn import path after upstream mamba refactor (vllm-project#1496) ## Root cause Upstream vLLM PR vllm-project/vllm#41126 (commit 7e1b45a092) refactored `vllm.model_executor.layers.mamba.gdn_linear_attn.GatedDeltaNetAttention` into a `gdn/` subpackage: `vllm.model_executor.layers.mamba.gdn.qwen_gdn_linear_attn.QwenGatedDeltaNetAttention`. This broke `vllm_gaudi/models/qwen3_5.py` which imported from the old path. ## Fix Updated 6 lines in `vllm_gaudi/models/qwen3_5.py`: - Changed import path from `gdn_linear_attn` to `gdn.qwen_gdn_linear_attn` - Updated class reference from `GatedDeltaNetAttention` to `QwenGatedDeltaNetAttention` ## Upstream compatibility Pinned to vLLM SHA: `b06813e87207e15b133e903d641e03f237d85b17` Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
vllm-project#1482) …d models (vllm-project#1413)" This reverts commit 808dbfa. Signed-off-by: Radoslaw Smyrek <radoslawx.smyrek@intel.com> Co-authored-by: Iryna Boiko <iryna.boiko@intel.com>
Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>
Signed-off-by: Soila Kavulya <soila.p.kavulya@intel.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR improves HPU runtime robustness around KV-offload/preemption, hybrid-model decode bucketing/warmup behavior, and a few compatibility patches for upstream API/behavior changes.
Changes:
- Add scheduling/bookkeeping fixes for KV-offload preemption and guard metrics against negative prompt-token counter increments.
- Fix hybrid-model decode bucketing & warmup logic (attn_block_size vs KV page size), and add regression tests for bucket coverage.
- Add/adjust several monkey-patches (MoE runner gate ownership, hf3fs mock client CPU-only behavior, sampler op workaround) and minor model/operator updates.
Reviewed changes
Copilot reviewed 30 out of 31 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| vllm_gaudi/v1/worker/hpu_worker.py | Centralizes GDN mamba-type detection via a shared constant. |
| vllm_gaudi/v1/worker/hpu_model_runner.py | Adds decode reordering for multi-token catch-up, hybrid warmup fixes, and MoE gate/dedup adjustments. |
| vllm_gaudi/v1/worker/hpu_input_batch.py | Ensures selective sampling prompt-token IDs are on-device for penalty computation. |
| vllm_gaudi/v1/core/sched/hpu_async_scheduler.py | Adds HPU overrides for cached-token staleness and invalid block bookkeeping. |
| vllm_gaudi/v1/attention/backends/hpu_attn.py | Expands supported kernel block sizes (adds 16). |
| vllm_gaudi/utils.py | Monkey-patches PromptTokenStats to clamp negative counter increments. |
| vllm_gaudi/patches.py | Adds CPU-only-safe hf3fs mock client patch and defers sampler op patching to plugin load time. |
| vllm_gaudi/ops/hpu_rotary_embedding.py | Supports interleaved mRoPE path via upstream helper. |
| vllm_gaudi/ops/hpu_fused_moe.py | Changes dp_size==1 fast path and updates MoE router factory patching. |
| vllm_gaudi/ops/hpu_compressed_tensors.py | Removes sparsity-related args/types from compressed tensors path. |
| vllm_gaudi/models/qwen3_5.py | Switches Qwen GDN attention import path and patches upstream symbols accordingly. |
| vllm_gaudi/models/minimax_m2.py | Removes TP all-reduce usage from a MiniMax MoE forward path. |
| vllm_gaudi/extension/bucketing/common.py | Simplifies decode-bucket filters, affecting bucket validity constraints. |
| vllm_gaudi/entrypoints/openai/multi_model_api_server.py | Adds KV-transfer rejection notification passthrough + formatting tweaks. |
| tests/unit_tests/worker/test_ensure_multi_token_decodes_last.py | Adds unit tests for decode-region reordering helper. |
| tests/unit_tests/test_decode_bucket_hybrid.py | Adds regression tests for hybrid decode bucket generation & warmup scenarios. |
| tests/unit_tests/test_bucketing.py | Updates/trim decode cfg test descriptions and removes some prior bucket filter tests. |
| tests/unit_tests/ops/test_hpu_rotary_embedding.py | Adds max_trained_positions for rotary embedding test config. |
| tests/unit_tests/lora/test_llm_with_multi_loras.py | Removes HF token dependency from LoRA test setup. |
| tests/unit_tests/lora/test_llama_tp.py | Removes HF token dependency from LoRA TP test setup. |
| tests/unit_tests/kv_offload/offloading_connector/utils.py | Handles both store-flush and load-flush cases when parsing transfers. |
| tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py | Updates expectations for async scheduling flush timing and adds invariant checks. |
| tests/models/language/generation/test_common.py | Refactors config/env parsing and improves formatting for readability. |
| tests/full_tests/model_cards/qwen3.5-35b-a3b.yaml | Updates expected metric value. |
| tests/full_tests/ci_e2e_discoverable_tests.sh | Sets prompt BS bucket max for the Qwen3.5 GSM8K e2e test. |
| requirements.txt | Removes ray and transformers requirements from this file. |
| README.md | Pins torchaudio to the local torch version for CPU wheel install guidance. |
| .github/workflows/pre-merge.yaml | Adds long timeouts and requires an environment for several jobs. |
| .github/workflows/pre-merge-trigger.yaml | Adds an explicit approval gate environment before triggering pre-merge. |
| .github/workflows/hourly-ci.yaml | Requires an environment for hourly CI execution jobs. |
| .github/workflows/create-release-branch.yaml | Requires an environment for release-branch CI execution jobs. |
| # eplb parameters | ||
| enable_eplb: bool = False, | ||
| eplb_state: EplbLayerState = EMPTY_EPLB_STATE, | ||
| eplb_state: EplbLayerState | None = None, |
Comment on lines
+458
to
+459
| True: [], | ||
| False: [batch_size_smaller_than_blocks], |
Comment on lines
+200
to
+206
| _original_load_general = _plugins_mod.load_general_plugins | ||
|
|
||
| def _load_general_with_hpu_patches(): | ||
| _original_load_general() | ||
| _patch_batched_count_greater_than() | ||
|
|
||
| _plugins_mod.load_general_plugins = _load_general_with_hpu_patches |
Comment on lines
+299
to
+306
| _stats_get_by_source_orig = _stats_module.PromptTokenStats.get_by_source | ||
|
|
||
|
|
||
| def _hpu_get_by_source(self, source: str) -> int: | ||
| return max(0, _stats_get_by_source_orig(self, source)) | ||
|
|
||
|
|
||
| _stats_module.PromptTokenStats.get_by_source = _hpu_get_by_source |
| return [input[i] if i is not None else v for i in indices] | ||
|
|
||
|
|
||
| def ensure_multi_token_decodes_last(b: InputBatch, scheduled_tokens: Mapping[str, int]) -> None: |
Comment on lines
+452
to
+457
| num_reqs = b.num_reqs | ||
| decode_end = num_reqs | ||
| for i in range(num_reqs): | ||
| if b.num_computed_tokens_cpu[i] < b.num_prompt_tokens[i]: | ||
| decode_end = i | ||
| break |
Comment on lines
+28
to
+35
| output = super().schedule() | ||
| for request in self.running: | ||
| # vLLM Request no longer exposes num_cached_tokens on newer | ||
| # branches. Keep the old fix only when the field exists. | ||
| if (hasattr(request, "num_cached_tokens") | ||
| and request.num_cached_tokens < request.num_external_computed_tokens): | ||
| request.num_cached_tokens = request.num_computed_tokens | ||
| return output |
|
|
||
| import pytest | ||
| import torch | ||
| import habana_frameworks.torch # noqa: F401 |
| max_num_reqs=max(len(reqs), 1), | ||
| max_model_len=1024, | ||
| max_num_batched_tokens=1024, | ||
| device=torch.device("hpu"), |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix accuracy of minimax m2 for tensor parallel size > 1. Reduce is handled in FusedMoE after #1377 and reduce_results=False dropped #1444
Output without this PR:
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "/mnt/weka/data/llm-d-models-pv/MiniMaxAI-MiniMax-M2.7",
"messages": [
{"role": "user", "content": [{"type": "text", "text": "Write a quick sort algorithm in python"}]}
], "max_tokens": 200
}'
{"id":"chatcmpl-8eb68aec66d7f527","object":"chat.completion","created":1778891236,"prompt_routed_experts":null,"model":"/mnt/weka/data/llm-d-models-pv/MiniMaxAI-MiniMax-M2.7","choices":[{"index":0,"message":{"role":"assistant","content":"I hadnet me find a programme2/apto/c- 241?.o. no (the operation.yb-b\n> ыйо, not change this;~~ I think_colour =="light pink";}) in...\n**The These must be not} was\n and \n\n):\n\nI('key=ельблиматš micrac / 1)2rasm_0.2 → add__2dict_eagle/tabString/im不过是 \list-ofchf_one \nCompute_with_prt_init: (New Tool Pro)\n-Main%-day_ ** [B1] : {nb_z0'];\n--own-traor: with: =: use 0.096-10_l_`this col0: 26;```\n</t_lN-蔓音频四文アنتストu+002:htt 도 원책임.(↑): The thought_dirty_s","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"routed_experts":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.1rc1.dev276+g54f548e9e-tp4-ep-614b7488","usage":{"prompt_tokens":45,"total_tokens":245,"completion_tokens":200,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
With PR
{"id":"chatcmpl-b79acb2e48acc5d0","object":"chat.completion","created":1778891747,"prompt_routed_experts":null,"model":"/mnt/weka/data/llm-d-models-pv/MiniMaxAI-MiniMax-M2.7","choices":[{"index":0,"message":{"role":"assistant","content":"We are going to write a quick sort algorithm in Python.\n We will define a function quicksort that takes a list as input.\n We will choose a pivot (commonly the last element, but we can also choose a random element or the middle).\n We will partition the list into two parts: elements less than the pivot and elements greater than the pivot.\n Then we recursively sort the two parts and combine them with the pivot in between.\n\n However, note that the problem asks for a quick sort algorithm, so we'll implement the standard in-place quick sort.\n\n Steps:\n 1. If the list has length 0 or 1, it is already sorted.\n 2. Otherwise, select a pivot (we'll use the last element for simplicity).\n 3. Partition the list into two sublists: left (elements less than pivot) and right (elements greater than or equal to pivot).\n 4. Return the sorted left part, then the pivot, then the sorted right part.\n\n Alternatively, we","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null,"routed_experts":null}],"service_tier":null,"system_fingerprint":"vllm-0.20.1rc1.dev276+g54f548e9e-tp4-ep-614b7488","usage":{"prompt_tokens":45,"total_tokens":245,"completion_tokens":200,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}