[FIX_FOR_VLLM_CUSTOM=3975eb6de6ea914b9d7b27fd517e0c971ddeb6fc] Fix upstream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5 by pawel-olejniczak · Pull Request #1377 · vllm-project/vllm-gaudi

pawel-olejniczak · 2026-04-21T13:19:04Z

Summary

Compatibility fixes for vLLM bump to 3975eb6de6. Addresses breakages from multiple upstream PRs affecting NIXL connectors, MoE runner refactor, offloading tests, Qwen3 MoE models, and transformers v5 upgrade.

Root Cause

NIXL import gate — Upstream PR nixl refactor [2/N]: unify TpKVTopology + HeteroTPTransferConfig into TransferTopology vllm#39529 (commit cc3993b05d) moved NIXL imports to vllm/distributed/nixl_utils.py and changed the platform gate from if not is_rocm() to if is_cuda(). HPU is neither CUDA nor ROCm, so it falls into the else branch → tries rixl._api (ROCm-only) → fails → NixlWrapper = None → RuntimeError("NIXL is not available").
TpKVTopology rename — Same upstream PR #39529 unified TpKVTopology + HeteroTPTransferConfig into TransferTopology, breaking vllm-gaudi NIXL connector imports.
Offloading tests — Upstream PR [kv_offload+HMA][4/N]: Support sliding window lookup vllm#36645 changed OffloadingManager.lookup() API.
MoE runner refactor — Upstream PR [MoE Refactor] Move the shared/fused expert output sum into MoERunnerBase vllm#35949 (commit 726efe177b) moved reduce logic into MoERunnerBase, removing reduce_results, renaming forward_dispatch → _forward_dispatch, forward_entry → _forward_entry, _maybe_reduce_output → _maybe_reduce_final_output. Follow-up PR moved MoERunnerBase and get_layer_from_name to moe_runner_base.py.
Qwen3 MoE — SharedFusedMoE returns a combined tensor (not a tuple), and MoE runner now handles TP reduction internally, causing double-reduce in qwen3_moe.py / qwen3_next.py.
Transformers v5 — granite tokenizer — Upstream PR Update to transformers v5 vllm#30566 updated transformers to allow v5. GPT2Tokenizer in v5 now respects add_bos_token=True (silently ignored in v4), causing degenerate outputs and 0.0 GSM8K accuracy on granite models.
Transformers v5.6.x — DeepSeek-V2-Lite tokenizer — In transformers v5.6.x, LlamaTokenizerFast was unified into LlamaTokenizer, which does not apply the ByteLevel BPE decoder declared in tokenizer.json. DeepSeek-V2-Lite-Chat's tokenizer decoding strips all spaces (Ġ chars not converted back), producing garbled output and 0.0 accuracy on GSM8K. Fixed natively in transformers v5.7.0.

Fix

NIXL import patch: Add patch_nixl_utils_for_hpu() in register_utils() to monkey-patch vllm.distributed.nixl_utils — imports from nixl._api instead of rixl._api on HPU. Update hetero_hpu_nixl_connector.py to import from vllm.distributed.nixl_utils instead of hardcoded nixl._api.
TpKVTopology → TransferTopology: Rename in NIXL connector imports and monkey-patches.
Offloading tests: Replace runner.manager.lookup.return_value with connector_scheduler._maximal_prefix_lookup.
MoE refactor: Update imports (MoERunnerBase from moe_runner_base), method names (_forward_dispatch, _forward_entry, _maybe_reduce_final_output), remove dead reduce_results / reduce_output().
Qwen3 MoE: Remove incorrect shared_expert tuple indexing and double TP reduction.
Transformers v5 — granite: Remove hardcoded add_bos_token=True from lm-eval model_args to fix GSM8K accuracy regression.
Transformers v5.6.x — DeepSeek-V2-Lite: Exclude transformers 5.6.* in requirements.txt to prevent installation of versions with broken ByteLevel BPE tokenizer decoding. Verified on Gaudi2: gsm8k accuracy 0.65 (expected 0.66, within tolerance) with transformers 5.7.0.

Copilot

Pull request overview

Updates vllm-gaudi’s KV transfer/offload integration to match upstream vLLM API changes (TpKVTopology → TransferTopology and OffloadingManager.lookup signature change), unblocking import-time failures and fixing broken offloading unit tests.

Changes:

Update NIXL connector imports/monkey-patches and topology construction to use TransferTopology.
Adjust heterogeneous HPU NIXL connector logic to use new TransferTopology fields/methods.
Update offloading scheduler tests to accommodate upstream lookup behavior changes.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`vllm_gaudi/distributed/kv_transfer/kv_connector/v1/hpu_nixl_connector.py`	Renames topology import and updates the HPU-specific `__post_init__` monkey-patch to target `TransferTopology`.
`vllm_gaudi/distributed/kv_transfer/kv_connector/v1/hetero_hpu_nixl_connector.py`	Replaces `TpKVTopology` with `TransferTopology` and updates downstream usages (layout flags, block size ratio).
`tests/unit_tests/kv_offload/offloading_connector/utils.py`	Exposes `connector_scheduler` on the test runner for updated scheduler test control.
`tests/unit_tests/kv_offload/offloading_connector/test_scheduler.py`	Updates assertions and stubbing strategy to align with upstream offloading lookup changes.

Copilot · 2026-04-21T13:24:52Z

    if self.is_mla and self._cross_layers_blocks:
-        logger.warning("[HPU] TpKVTopology: overriding false-positive _cross_layers_blocks=True "
+        logger.warning("[HPU] TransferTopology: overriding false-positive _cross_layers_blocks=True "
                       "for MLA model. HPU get_kv_cache_shape() returns 3-D tensors, causing "
                       "the dim-count heuristic to misfire.  Forcing _cross_layers_blocks=False.")
        self._cross_layers_blocks = False


This monkey-patch relies on the private attribute _cross_layers_blocks of TransferTopology. Since TransferTopology is coming from upstream vLLM and was recently renamed/refactored, this is brittle and can fail at import time if the internal attribute changes. Prefer using a public API (e.g., cross_layers_blocks if it’s settable) or guard the override with hasattr/getattr so the patch degrades safely when upstream internals change.

Copilot · 2026-04-21T13:24:53Z

    runner.scheduler.reset_prefix_cache()
    runner.new_request(token_ids=[0] * offloaded_block_size)
    runner.manager.prepare_store.side_effect = (lambda block_hashes, req_context: generate_store_output([]))
-    runner.manager.lookup.return_value = 1
+    runner.connector_scheduler._maximal_prefix_lookup = lambda key, req_context: 1
    runner.run(decoded_tokens=[EOS_TOKEN_ID], expected_loaded_gpu_block_indexes=(0, 1, 2))


These tests now stub the scheduler by overwriting _maximal_prefix_lookup directly. Because this is a private implementation detail, it makes the tests brittle to upstream refactors and bypasses coverage of the new OffloadingManager.lookup(key: OffloadKey) -> bool behavior. Consider mocking manager.lookup with a side_effect that returns True/False based on the provided OffloadKey (or providing a small helper/fake manager) so the test exercises the real prefix-lookup logic.

Copilot · 2026-04-21T13:24:53Z

    runner.run(decoded_tokens=[EOS_TOKEN_ID])
-    runner.manager.lookup.assert_called()
-    assert len(list(runner.manager.lookup.call_args.args[0])) == 1
+    runner.manager.lookup.assert_called_once()


The assertion was weakened from validating the lookup input (previously checking the iterable length) to only checking lookup() was called once. To keep coverage of the signature change, consider asserting that lookup was called with an OffloadKey (and, if applicable, that req_context is passed via args/kwargs) so the test will fail if the old iterable-based call is accidentally reintroduced.

Suggested change

runner.manager.lookup.assert_called_once()

runner.manager.lookup.assert_called_once()

lookup_args, lookup_kwargs = runner.manager.lookup.call_args

assert lookup_args

assert isinstance(lookup_args[0], OffloadKey)

if "req_context" in lookup_kwargs:

assert lookup_kwargs["req_context"] is not None

else:

assert len(lookup_args) >= 2

assert lookup_args[1] is not None

Copilot · 2026-04-21T13:24:53Z

+    self.transfer_topo = TransferTopology(
        tp_rank=self.tp_rank,
+        tp_size=self.world_size,
+        block_size=self.block_size,
        engine_id=self.engine_id,


TransferTopology is initialized with block_size=self.block_size here, but self.block_size can be mutated later in register_kv_caches() when the physical kernel block size differs (and self._block_size[self.engine_id] is updated). If TransferTopology.block_size_ratio() uses its stored local block_size, it can become stale and produce incorrect ratios/mappings during transfers. Consider constructing TransferTopology after the final effective block size is known, or updating/recreating self.transfer_topo whenever self.block_size is adjusted.

github-actions · 2026-04-22T10:11:02Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-04-23T09:34:55Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

github-actions · 2026-04-28T19:30:12Z

🚧 CI Blocked

The main CI workflow was not started for the following reason:

Your branch is behind the base branch. Please merge or rebase to get the latest changes.

…KVTopology rename to TransferTopology and update offloading tests Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

…and forward_dispatch rename (upstream PR #35949) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

…(forward_entry, maybe_reduce_output) Upstream PR #35949 (commit 726efe177b) renamed: - forward_entry -> _forward_entry (now stored as private attribute) - _maybe_reduce_output -> _maybe_reduce_final_output Update patched_fused_moe_forward in vllm_gaudi/ops/hpu_fused_moe.py to use the new method names. Without this, DP MoE paths and several e2e tests fail with AttributeError on DefaultMoERunner. Follow-up to the previous fix in this branch which addressed reduce_results removal and forward_dispatch rename from the same upstream PR. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

…ve double TP reduction in MoE models - Update hpu_fused_moe.py imports: DefaultMoERunner/moe_runner_base removed in upstream PR #40560 (combined into MoERunner in moe_runner.py) - Remove incorrect shared_expert tuple indexing in qwen3_next.py (SharedFusedMoE returns combined tensor, not tuple) - Remove double tensor_model_parallel_all_reduce in qwen3_moe.py and qwen3_next.py (MoE runner handles TP reduction internally after PR #35949) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

…Base from moe_runner_base At vLLM b47840019e, MoERunnerBase and get_layer_from_name live in moe_runner_base.py (not moe_runner.py). MoERunner in moe_runner.py is just an ABC without __init__ or forward. Fix imports and monkey-patch target to use MoERunnerBase directly. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Align to the vllm change (PR #30566) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

This reverts commit 9148d75. Since the vllm 3975eb6de6 is reverted, this no longer needed. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Remove hardcoded add_bos_token=True from lm-eval model_args. In transformers v4, GPT2TokenizerFast silently ignored add_bos_token=True. In transformers v5, GPT2Tokenizer respects it and prepends the BOS token (id=0, <|end_of_text|> for granite) to every prompt. This causes the model to see an end-of-text signal at position 0, producing degenerate outputs (repetition loops) and 0.0 accuracy on GSM8K CoT benchmarks. The parameter was effectively a no-op with v4 and is not needed — granite tokenizer has add_bos_token=False by default. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

…l._api Upstream vLLM commit 3975eb6de6 gates NIXL imports on is_cuda(), falling back to rixl._api for non-CUDA platforms. HPU needs nixl._api (same as CUDA). This adds a monkey-patch in register_utils() and fixes the hetero_hpu_nixl_connector to import from vllm.distributed.nixl_utils instead of hardcoding nixl._api. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

In transformers v5.6.x, LlamaTokenizerFast was unified into LlamaTokenizer which does not apply the ByteLevel BPE decoder declared in tokenizer.json. This causes DeepSeek-V2-Lite-Chat decoding to strip all spaces (Ġ chars not converted back), producing garbled output and 0.0 accuracy on GSM8K. transformers v5.7.0 resolves this by returning TokenizersBackend from AutoTokenizer.from_pretrained(), which correctly applies the ByteLevel decoder. Exclude 5.6.* in requirements.txt to prevent installation of the broken versions. Verified on Gaudi2 pod: gsm8k accuracy 0.65 (expected 0.66, within tolerance) with transformers 5.7.0 and no code changes. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

github-actions · 2026-04-28T22:40:40Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
3975eb6de6ea914b9d7b27fd517e0c971ddeb6fc

Copilot AI review requested due to automatic review settings April 21, 2026 13:19

pawel-olejniczak requested review from PatrykWo, adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel, michalkuligowski and xuechendi as code owners April 21, 2026 13:19

Copilot started reviewing on behalf of pawel-olejniczak April 21, 2026 13:19 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

github-actions Bot mentioned this pull request Apr 21, 2026

🚦 Team Review Dashboard #701

Open

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch 3 times, most recently from 8f94303 to f4fdf89 Compare April 22, 2026 21:27

pawel-olejniczak requested a review from jbyczkow as a code owner April 22, 2026 21:27

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch 2 times, most recently from a1a7df4 to 35e2767 Compare April 23, 2026 09:33

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch from 35e2767 to 216833d Compare April 23, 2026 09:43

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch 5 times, most recently from ee8a974 to 6a3bedd Compare April 25, 2026 11:31

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch from 9bd6588 to 867e244 Compare April 26, 2026 19:26

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch 2 times, most recently from 9ef636f to 4604d76 Compare April 27, 2026 13:57

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch from 8f775ef to 5b2f6c1 Compare April 27, 2026 20:12

pawel-olejniczak added 11 commits April 28, 2026 22:31

[FIX_FOR_VLLM_CUSTOM=b47840019e61a3983c8144066a99c843d177947d] Fix Tp…

74bf578

…KVTopology rename to TransferTopology and update offloading tests Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

[FIX_FOR_VLLM_CUSTOM=b47840019e] Fix FusedMoE reduce_results removal …

074fc49

…and forward_dispatch rename (upstream PR #35949) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Import transformers to avoid partial-init races

7d38f7f

Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Update transformers to v5

d0250f9

Align to the vllm change (PR #30566) Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

Revert "Import transformers to avoid partial-init races"

eb21b1f

This reverts commit 9148d75. Since the vllm 3975eb6de6 is reverted, this no longer needed. Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>

pawel-olejniczak force-pushed the fix/vllm-hourly-tpkvtopo-rename branch from 84a0171 to 7e937d0 Compare April 28, 2026 19:32

iboiko-habana approved these changes Apr 29, 2026

View reviewed changes

iboiko-habana merged commit c2c370f into vllm-project:main Apr 29, 2026
71 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX_FOR_VLLM_CUSTOM=3975eb6de6ea914b9d7b27fd517e0c971ddeb6fc] Fix upstream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5#1377

[FIX_FOR_VLLM_CUSTOM=3975eb6de6ea914b9d7b27fd517e0c971ddeb6fc] Fix upstream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5#1377
iboiko-habana merged 11 commits intovllm-project:mainfrom
pawel-olejniczak:fix/vllm-hourly-tpkvtopo-rename

pawel-olejniczak commented Apr 21, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 23, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

github-actions Bot commented Apr 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

-    runner.manager.lookup.assert_called_once()
+    runner.manager.lookup.assert_called_once()
+    lookup_args, lookup_kwargs = runner.manager.lookup.call_args
+    assert lookup_args
+    assert isinstance(lookup_args[0], OffloadKey)
+    if "req_context" in lookup_kwargs:
+        assert lookup_kwargs["req_context"] is not None
+    else:
+        assert len(lookup_args) >= 2
+        assert lookup_args[1] is not None

Conversation

pawel-olejniczak commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 22, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Apr 23, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Apr 28, 2026

🚧 CI Blocked

Uh oh!

github-actions Bot commented Apr 28, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pawel-olejniczak commented Apr 21, 2026 •

edited

Loading