nixl refactor [2/N]: unify TpKVTopology + HeteroTPTransferConfig into TransferTopology#39529
Conversation
|
Documentation preview: https://vllm--39529.org.readthedocs.build/en/39529/ |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request refactors the NIXL connector by modularizing its components into a new package structure and introducing a TransferTopology class to centralize transfer geometry logic. My review identified three critical issues: a potential NameError in TransferTopology when is_mamba is true, an incorrect registration of Mamba KV cache tensors that would exclude SSM state, and a missing import for compute_mamba_phys_ratio in the worker module.
| if tensor_shape is not None: | ||
| self._cross_layers_blocks = len(tensor_shape) == len(kv_cache_shape) + 1 |
There was a problem hiding this comment.
If is_mamba is True, kv_cache_shape is never defined (see line 741). This will cause a NameError here if tensor_shape is provided. Since hybrid SSM models do not yet support cross-layer layout (as noted in connector.py), this logic should be guarded by not is_mamba to avoid accessing the undefined variable.
| if tensor_shape is not None: | |
| self._cross_layers_blocks = len(tensor_shape) == len(kv_cache_shape) + 1 | |
| if not is_mamba and tensor_shape is not None: | |
| self._cross_layers_blocks = len(tensor_shape) == len(kv_cache_shape) + 1 |
| assert self.transfer_topo is not None | ||
| transfer_topo = self.transfer_topo | ||
| physical_blocks_per_logical = ( | ||
| compute_mamba_phys_ratio( |
6988b62 to
0a6585a
Compare
NickLucche
left a comment
There was a problem hiding this comment.
I think you did not delete TpKVTopology, which in turn leads to changes to mooncake and test files. We should be able to just replace it safely though right?
I left some comments for now, will continue asap
| remote_block_size=nixl_agent_meta.block_size, | ||
| remote_block_len=nixl_agent_meta.block_lens[0], | ||
| remote_physical_blocks_per_logical=physical_blocks_per_logical, | ||
| local_block_len=(self.block_len_per_layer[0] if self._has_mamba else 0), |
There was a problem hiding this comment.
I think this sort of logic is already handled in the register_remote_engine function so you can just do
| local_block_len=(self.block_len_per_layer[0] if self._has_mamba else 0), | |
| local_block_len=self.block_len_per_layer[0], |
e0a17dd to
d11d865
Compare
NickLucche
left a comment
There was a problem hiding this comment.
Thanks for the work @ZhanqiuHu !
Looking good, I only left a few minor comments to address.
Getting CI rolling in the meantime
| def _physical_head_range(tp_size: int, num_heads: int, rank: int) -> range: | ||
| """Physical KV head range stored in a rank's KV cache tensor. | ||
|
|
There was a problem hiding this comment.
I think this method could be moved into TransferTopology it's not used anywhere else
| kv_cache_shape = tuple(kv_cache_shape[i] for i in kv_cache_stride_order) | ||
|
|
||
| # ============================================================ | ||
| # Engine registration (new) |
There was a problem hiding this comment.
| # Engine registration (new) | |
| # Engine registration |
| *, | ||
| local_block_len: int = 0, | ||
| ) -> EngineTransferInfo: | ||
| """Register a remote engine, replacing scattered worker dicts. |
There was a problem hiding this comment.
nit:
| """Register a remote engine, replacing scattered worker dicts. | |
| """Register a remote engine, unifying worker dicts state. |
| logger.info( | ||
| "Mamba transfer plan: %s", | ||
| transfer_topo.describe_mamba(engine_id), | ||
| ) |
There was a problem hiding this comment.
I think we should have a single describe method here and describe the instance regardless (not under has_mamba guard), figuring out inside the method whether we're describing mamba or not
| mamba_info.remote_all_source_ranks, | ||
| mamba_info.remote_num_fa_reads, | ||
| mamba_info.remote_fa_descriptor_bytes, | ||
| mamba_info.remote_num_mamba_reads, |
There was a problem hiding this comment.
nit: is this ~kinda like a .describe?
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
…ransferTopology Replace scattered per-engine dicts (_tp_size, _block_size) and separate HeteroTPTransferConfig construction with unified TransferTopology that stores per-engine facts atomically via register_remote_engine(). Key changes in worker.py: - kv_topo → transfer_topo (TransferTopology type) - TpKVTopology constructor → TransferTopology (explicit int params) - register_remote_engine() replaces _tp_size/_block_size dict updates and HeteroTPTransferConfig instantiation (idempotent, like original) - _from_engine_id() calls → get_engine_info() + raw-int method calls - transfer_cfg.method() → transfer_topo.method(engine_id) - _build_fa_remote_for_mamba takes TransferTopology + engine_id - Dead code removed: _transfer_configs, _tp_size, _block_size dicts, HeteroTPTransferConfig/TpKVTopology imports Key changes in utils.py: - HeteroTPTransferConfig deleted (~300 lines, zero consumers) - TransferTopology: added idempotent registration, get_engine_info - Unused `field` import removed TpKVTopology retained for mooncake_connector.py and tests. Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
8b5fc45 to
e4153f3
Compare
… TransferTopology (vllm-project#39529) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
… TransferTopology (vllm-project#39529) Signed-off-by: Zhanqiu Hu <zhu@redhat.com>
… TransferTopology (vllm-project#39529) Signed-off-by: Zhanqiu Hu <zhu@redhat.com> Signed-off-by: Avinash Singh <avinashsingh.rcoem@gmail.com>
…stream breakages: NIXL connector, TpKVTopology rename, MoE refactor, transformers v5 (#1377) ## Summary Compatibility fixes for vLLM bump to `3975eb6de6`. Addresses breakages from multiple upstream PRs affecting NIXL connectors, MoE runner refactor, offloading tests, Qwen3 MoE models, and transformers v5 upgrade. ## Root Cause 1. **NIXL import gate** — Upstream PR vllm-project/vllm#39529 (commit `cc3993b05d`) moved NIXL imports to `vllm/distributed/nixl_utils.py` and changed the platform gate from `if not is_rocm()` to `if is_cuda()`. HPU is neither CUDA nor ROCm, so it falls into the `else` branch → tries `rixl._api` (ROCm-only) → fails → `NixlWrapper = None` → `RuntimeError("NIXL is not available")`. 2. **TpKVTopology rename** — Same upstream PR #39529 unified `TpKVTopology` + `HeteroTPTransferConfig` into `TransferTopology`, breaking vllm-gaudi NIXL connector imports. 3. **Offloading tests** — Upstream PR vllm-project/vllm#36645 changed `OffloadingManager.lookup()` API. 4. **MoE runner refactor** — Upstream PR vllm-project/vllm#35949 (commit `726efe177b`) moved reduce logic into `MoERunnerBase`, removing `reduce_results`, renaming `forward_dispatch` → `_forward_dispatch`, `forward_entry` → `_forward_entry`, `_maybe_reduce_output` → `_maybe_reduce_final_output`. Follow-up PR moved `MoERunnerBase` and `get_layer_from_name` to `moe_runner_base.py`. 5. **Qwen3 MoE** — `SharedFusedMoE` returns a combined tensor (not a tuple), and MoE runner now handles TP reduction internally, causing double-reduce in `qwen3_moe.py` / `qwen3_next.py`. 6. **Transformers v5 — granite tokenizer** — Upstream PR vllm-project/vllm#30566 updated transformers to allow v5. GPT2Tokenizer in v5 now respects `add_bos_token=True` (silently ignored in v4), causing degenerate outputs and 0.0 GSM8K accuracy on granite models. 7. **Transformers v5.6.x — DeepSeek-V2-Lite tokenizer** — In transformers v5.6.x, `LlamaTokenizerFast` was unified into `LlamaTokenizer`, which does not apply the ByteLevel BPE decoder declared in `tokenizer.json`. DeepSeek-V2-Lite-Chat's tokenizer decoding strips all spaces (Ġ chars not converted back), producing garbled output and 0.0 accuracy on GSM8K. Fixed natively in transformers v5.7.0. ## Fix 1. **NIXL import patch**: Add `patch_nixl_utils_for_hpu()` in `register_utils()` to monkey-patch `vllm.distributed.nixl_utils` — imports from `nixl._api` instead of `rixl._api` on HPU. Update `hetero_hpu_nixl_connector.py` to import from `vllm.distributed.nixl_utils` instead of hardcoded `nixl._api`. 2. **TpKVTopology → TransferTopology**: Rename in NIXL connector imports and monkey-patches. 3. **Offloading tests**: Replace `runner.manager.lookup.return_value` with `connector_scheduler._maximal_prefix_lookup`. 4. **MoE refactor**: Update imports (`MoERunnerBase` from `moe_runner_base`), method names (`_forward_dispatch`, `_forward_entry`, `_maybe_reduce_final_output`), remove dead `reduce_results` / `reduce_output()`. 5. **Qwen3 MoE**: Remove incorrect shared_expert tuple indexing and double TP reduction. 6. **Transformers v5 — granite**: Remove hardcoded `add_bos_token=True` from lm-eval model_args to fix GSM8K accuracy regression. 7. **Transformers v5.6.x — DeepSeek-V2-Lite**: Exclude `transformers 5.6.*` in `requirements.txt` to prevent installation of versions with broken ByteLevel BPE tokenizer decoding. Verified on Gaudi2: gsm8k accuracy 0.65 (expected 0.66, within tolerance) with transformers 5.7.0. --------- Signed-off-by: Paweł Olejniczak <pawelx.olejniczak@intel.com>
… TransferTopology (vllm-project#39529) Signed-off-by: Zhanqiu Hu <zhu@redhat.com> Signed-off-by: Adrian <info@zzit.ch>
Summary
TransferTopologyclass that unifiesTpKVTopologyandHeteroTPTransferConfiginto a single source of truthEngineTransferInfo/MambaEngineTransferInfofrozen dataclasses for per-remote-engine transfer stateworker.pyto useTransferTopology, removing redundant_tp_size,_block_size,_transfer_configsdictsHeteroTPTransferConfigclass entirelyDepends on #39354
New:
EngineTransferInfo(frozen dataclass)Stores per-remote-engine transfer facts, computed once at handshake:
remote_tp_sizeremote_block_lenremote_block_sizeremote_physical_blocks_per_logicalNew:
MambaEngineTransferInfo(inheritsEngineTransferInfo)Extends with Mamba-hybrid transfer geometry:
remote_fa_source_ranksremote_all_source_ranksremote_num_fa_readsremote_num_mamba_readsremote_fa_descriptor_bytesis_remote_replicatedremote_physical_headsNew:
TransferTopology(replacesTpKVTopology+HeteroTPTransferConfig)Local info (one copy, set at init):
tp_rank,tp_size,block_sizeengine_idis_mla,is_mamba,total_num_kv_headslocal_physical_headsmax(1, total_num_kv_heads // tp_size)is_kv_layout_blocks_first,cross_layers_blocks,split_k_and_vRemote info (dict, populated via
register_remote_engine()):_engines: dict[EngineId, EngineTransferInfo]_fa_source_sets,_fa_source_indicesMethods (unified from both old classes):
tp_ratio(),block_size_ratio(),handshake_target_ranks(),target_remote_ranks(),is_kv_replicated(),replicates_kv_cache(),get_transfer_cache_regions()should_skip_fa(),fa_head_slot(),fa_rank_offset(),needs_split_handles(),compute_split_handle_data(),filter_block_ids_for_rank(),describe_mamba()register_remote_engine(),get_engine_info(),_get_mamba_info(),_build_mamba_info()worker.pymigrationself.kv_topo: TpKVTopologyself.transfer_topo: TransferTopologyself._tp_size: dictEngineTransferInfo.remote_tp_sizeself._block_size: dictEngineTransferInfo.remote_block_sizeself._transfer_configs: dictMambaEngineTransferInfoHeteroTPTransferConfig()register_remote_engine()callcompute_mamba_phys_ratio()compute_physical_blocks_per_logical()Deleted
HeteroTPTransferConfigclass (entire class, ~300 lines inutils.py)What's NOT changed (kept for now)
TpKVTopologystill exists — used bymooncake_connector.pyand tests_physical_blocks_per_logicaldict still inworker.py(duplicates info for local engine)TransferTopology(Phase 2 will extract into policy)Change Log
Annotated diff: utils.py (click to expand)
Annotated Diff: Full new code in utils.py
File:
vllm/distributed/kv_transfer/kv_connector/utils.pyBranch:
nixl-tpkv-transferconfig-unificationBase: PR #39354 (
9cd664152)All changes are pure additions appended after line 940. No existing code modified.
Legend (on each line or block):
[NEW]— New code, no existing equivalent[COPIED]— Logic identical to existing code[RENAMED]— Same logic, variable/field names changed (d_/p_ → local_/remote_)[SIG CHANGE]— Signature changed but body logic identicalEach section shows the original source being compared to.
Note:
is_kv_replicatedandreplicates_kv_cachecurrently takeremote_engine_idbut only do a single lookup (
remote_tp_size > total_num_kv_heads). They could bereverted to take
remote_tp_size: intfor consistency withtp_ratio/block_size_ratio.Left as-is for now — can be cleaned up in a follow-up.
EngineTransferInfo (lines 946–962)
Source:
[NEW]— fields previously scattered as dicts in worker.pyMambaEngineTransferInfo (lines 965–994)
Source:
[RENAMED]fromHeteroTPTransferConfigderived fields (lines 580–609)Fields moved out:
Old fields NOT carried over:
TransferTopology class + __init__ (lines 1000–1075)
Source:
__init__params fromTpKVTopologyclass attrs (lines 323–337);layout detection from
TpKVTopology.__post_init__(lines 339–378)register_remote_engine (lines 1081–1133)
Source:
[NEW]— replaces scattered dict updates in worker.pyLayout properties (lines 1139–1152)
Source:
[COPIED]fromTpKVTopology(lines 380–401)tp_ratio (lines 1158–1175)
Source:
[COPIED]fromTpKVTopology.tp_ratio(lines 403–427)block_size_ratio (lines 1177–1183)
Source:
[COPIED]fromTpKVTopology.block_size_ratio(lines 429–440)is_kv_replicated (lines 1185–1189)
Source:
[COPIED]fromTpKVTopology.is_kv_replicated(lines 456–464)replicates_kv_cache (lines 1191–1193)
Source:
[COPIED]fromTpKVTopology.replicates_kv_cache(lines 466–468)handshake_target_ranks (lines 1195–1205)
Source:
[NEW]— extracted fromTpKVTopology.get_target_remote_ranks(lines 470–485)target_remote_ranks (lines 1207–1224)
Source:
[SIG CHANGE]fromTpKVTopology.get_target_remote_ranks(lines 470–485),with
[NEW]Mamba branchget_transfer_cache_regions (lines 1226–1249)
Source:
[COPIED]fromTpKVTopology.get_transfer_cache_regions(lines 494–516)_get_mamba_info (lines 1258–1268)
Source:
[NEW]— helper to type-check engine info lookupshould_skip_fa (lines 1270–1272)
Source:
[SIG CHANGE]fromHeteroTPTransferConfig.should_skip_fa(lines 733–735)fa_head_slot (lines 1274–1292)
Source:
[SIG CHANGE]fromHeteroTPTransferConfig.fa_head_slot(lines 737–752)fa_rank_offset (lines 1294–1315)
Source:
[SIG CHANGE]fromHeteroTPTransferConfig.fa_rank_offset(lines 754–770)needs_split_handles (lines 1317–1329)
Source:
[SIG CHANGE]fromHeteroTPTransferConfig.needs_split_handlesproperty (lines 772–779)compute_split_handle_data (lines 1331–1360)
Source:
[SIG CHANGE]fromHeteroTPTransferConfig.compute_split_handle_data(lines 781–810)filter_block_ids_for_rank (lines 1362–1385)
Source:
[SIG CHANGE]fromHeteroTPTransferConfig.filter_block_ids_for_rank(lines 812–834)describe_mamba (lines 1387–1403)
Source:
[SIG CHANGE]fromHeteroTPTransferConfig.describe(lines 836–848)_build_mamba_info (lines 1410–1531)
**Source:
[SIG CHANGE]fromHeteroTPTransferConfig.__post_init__(lines 611–687)HeteroTPTransferConfig._validate(lines 689–729)**Annotated diff: worker.py (click to expand)
Annotated Diff: worker.py — All Changes (Consolidated)
File:
vllm/distributed/kv_transfer/kv_connector/v1/nixl/worker.pyBranch:
nixl-tpkv-transferconfig-unificationBase: PR #39354 (
9cd664152)All changes are in-place modifications to migrate
worker.pyfromTpKVTopology+HeteroTPTransferConfigto the unifiedTransferTopology.Legend:
[RENAME]— Variable/method name change, no logic change[REPLACE]— Swapped to new API, equivalent behavior[DELETE]— Code removed (consolidated into TransferTopology)[NEW]— New code not in original1. Imports
2. Instance variable declarations (
__init__)3. Dead write in
_setup_hmaremoved4. Handshake:
get_target_remote_ranks→handshake_target_ranks5.
register_kv_caches:TpKVTopology→TransferTopologyconstruction6. All
self.kv_topo→self.transfer_toporenamesEvery
self.kv_toporeference is renamed toself.transfer_topo.Every local
kv_topo = self.kv_topois renamed totransfer_topo = self.transfer_topo.Affected locations (all
[RENAME], no logic change):register_kv_cachesself.kv_topo→self.transfer_topoself.kv_topo._cross_layers_blocks→self.transfer_topo._cross_layers_blocksself.kv_topo.is_kv_layout_blocks_first→self.transfer_topo.is_kv_layout_blocks_first_validate_remote_agent_handshakeself.kv_topo→self.transfer_topobuild_remote_descskv_topolocal var →transfer_topo_build_fa_remote_for_mambakv_topoparam →transfer_toposync_recved_kv_to_deviceself.kv_topo→self.transfer_topoget_finishedself.kv_topo→self.transfer_topo_get_new_notifsself.kv_topo→self.transfer_topo_read_blocks_for_reqself.kv_topo→self.transfer_topo_read_blocksself.kv_topo→self.transfer_topoget_backend_aware_kv_block_lenself.kv_topo→self.transfer_topo7. Remote engine registration replaces scattered dict updates
Also,
_physical_blocks_per_logical[engine_id]assignment was moved up to right afterregister_remote_engine:8.
_validate_remote_agent_handshake: dict lookups →get_engine_infoSame pattern for
is_kv_replicatedandreplicates_kv_cache— still takeengine_id(they need Mamba-specific lookup internally):9.
build_remote_descs:block_size_ratiolookup change10. Mamba split handles:
transfer_cfg.method()→transfer_topo.method(eid)Logging also updated to use
MambaEngineTransferInfofields:11.
_build_fa_remote_for_mambasignature changeBody changes:
Call site also updated:
12.
get_finished:block_size_ratio_from_engine_id→ raw int13.
_get_new_notifs:kv_topo.tp_ratio()rename only14.
_read_blocks_for_req: multiple API changes15.
_read_blocks: same patternSummary
[RENAME]kv_topo→transfer_topo,info→remote_info,_mamba_phys_ratio→_physical_blocks_per_logical,compute_mamba_phys_ratio→compute_physical_blocks_per_logical[REPLACE][DELETE]_transfer_configs,_tp_size,_block_sizedicts,HeteroTPTransferConfigconstruction, dead_block_sizewrite[NEW]register_remote_engine()call block,remote_engine_idparam in_build_fa_remote_for_mambaNet lines: roughly −20 (removed boilerplate dicts +
HeteroTPTransferConfigconstruction, replaced with single
register_remote_enginecall).Logic changes: Zero. All computation paths are identical — only the API
surface changed (where data is stored and how it's accessed).
Test plan