[PD][Nixl] Add support for hybrid SSM-FA models by NickLucche · Pull Request #36687 · vllm-project/vllm

NickLucche · 2026-03-10T17:30:51Z

For a comprehensive description of the changes proposed here, check out the corresponding RFC #36780.

This PR adds support for hybrid SSM-based models such as nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 with NixlConnector, enabling KVCache transfer of both FA and Mamba states in disaggregated setups.
Currently it only supports Homogeneous TP sizes on both P and D.

Note that we're only transferring actual mamba states and skipping the padding that may be present, as that might have non-trivial size.

UPDATE:
re this change"

- curr_tensor_size_bytes = cache.numel() * cache.element_size()
+ curr_tensor_size_bytes = num_blocks * physical_page_size

in this PR I am trying to further move away from relying on tensor views while trying to unify usage in code of kv_cache_config as single source of truth.
This is also necessary for Mamba-like models in which tensors (cache above) gives the unpadded tensor size, which doesn't reflect the num_blocks * physical_page_size, as one would need to take into account padding manually.

Important notes

TP > 1 currently require --no-async-scheduling to run correctly. @ZhanqiuHu and I identified a synchronization issue where states may be transferred in a corrupted form, leading to high variance in evaluations. Will address separately as that is likely unrelated to SSMs.
@ZhanqiuHu has identified an issue with current PD workflow in which we're recomputing the first token on D, leading to burning-in that extra step into the SSM state in-place.

Test with

Enable HMA experimental support with --no-disable-hybrid-kv-cache-manager:

# usual P/D command
vllm serve nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8
--trust-remote-code \
--block-size 64 \
--no-enable-prefix-caching \
--no-disable-hybrid-kv-cache-manager \
 --mamba-ssm-cache-dtype float16 \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'

# usual toy_proxy_server.py command

or

HYBRID_SSM=1 bash v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

or check out unit tests added with this PR.

Results from running consecutive full lm-eval runs with no prefix caching:

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5444|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8355|±  |0.0102|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5345|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8340|±  |0.0102|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5398|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8355|±  |0.0102|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5428|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8332|±  |0.0103|

local-completions ({'model': 'nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8', 'base_url': 'http://127.0.0.1:55483/v1/completions', 'num_concurrent': 10, 'max_retries': 3, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.5557|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.8506|±  |0.0098|

TODO

Address kernel<>logical block size mismatch
Benchmark

gemini-code-assist

Code Review

This pull request introduces comprehensive support for hybrid SSM-FA models, a significant and complex feature. While the changes span across test configurations, core KV connector logic, and scheduler behavior, a critical security vulnerability was identified in the scheduler's handling of invalid KV cache blocks for HMA-enabled requests. Specifically, the validation logic fails to check all relevant KV cache groups, which could lead to the use of uninitialized memory and potential PII leakage. Additionally, several debug print statements and potential logic errors in the Nixl connector were observed and should be addressed to ensure production readiness.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

gemini-code-assist · 2026-03-10T17:45:54Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+
+        register_remote_blocks(blocks_data, mamba=False)
+        if self._is_mamba:
+            assert self.num_descs == len(blocks_data)


This assertion is critical for ensuring that the number of descriptors (self.num_descs) matches the actual number of blocks being registered (len(blocks_data)). An inconsistency here could lead to memory corruption or incorrect KV cache transfers. It's important to verify that self.num_descs is always accurately calculated to reflect all registered blocks, including those for Mamba layers and any logical duplications for K/V splits.

gemini-code-assist · 2026-03-10T17:45:54Z

vllm/v1/core/sched/scheduler.py

+            all_req_block_ids = (
+                (block_id for group in req_block_ids for block_id in group)
+                if is_hma
+                else req_block_ids[0]
+            )
            req_num_computed_blocks = (
                req_num_computed_tokens + self.block_size - 1
            ) // self.block_size
-            for idx, block_id in zip(range(req_num_computed_blocks), req_block_ids):
+            for idx, block_id in enumerate(all_req_block_ids):
+                if idx >= req_num_computed_blocks:
+                    break


The scheduler's logic for identifying requests affected by invalid KV cache blocks is incomplete when Hybrid Memory Allocator (HMA) is used. The code flattens all KV cache groups into a single list but only iterates through the first req_num_computed_blocks elements. In HMA mode, this typically corresponds to the blocks in the Full Attention group, causing the validation to skip blocks in other groups (e.g., Sliding Window). If a block in these skipped groups failed to load from a remote source, the scheduler will fail to detect it, potentially leading the model runner to use uninitialized or stale GPU memory, which could result in PII leakage or incorrect model outputs.

Modify the validation loop to ensure all blocks in all KV cache groups that are relevant to the computed tokens are checked against the invalid_block_ids set. Since HMA does not support partial recovery, any invalid block in any group should trigger a full eviction and recomputation for the request.

Suggested change

all_req_block_ids = (

(block_id for group in req_block_ids for block_id in group)

if is_hma

else req_block_ids[0]

)

req_num_computed_blocks = (

req_num_computed_tokens + self.block_size - 1

) // self.block_size

for idx, block_id in zip(range(req_num_computed_blocks), req_block_ids):

for idx, block_id in enumerate(all_req_block_ids):

if idx >= req_num_computed_blocks:

break

if is_hma:

all_req_block_ids = [

block_id for group in req_block_ids for block_id in group

]

req_num_computed_blocks = len(all_req_block_ids)

else:

all_req_block_ids = req_block_ids[0]

req_num_computed_blocks = (

req_num_computed_tokens + self.block_size - 1

) // self.block_size

for idx, block_id in enumerate(all_req_block_ids):

if idx >= req_num_computed_blocks:

break

gemini-code-assist · 2026-03-10T17:45:54Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        print(f"{self.vllm_config.cache_config.mamba_page_size_padded=}\n\n")
+        # block size: 400, the one from the FA spec
+        print(f"block size: {self.block_size}\n\n")
+        print("NUM_BLOCKS: ", self.num_blocks, "\n\n", flush=True)


These print statements appear to be for debugging purposes. They should be removed before merging to avoid unnecessary output in production environments.

gemini-code-assist · 2026-03-10T17:45:55Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            if tensor_size_bytes is None:
+                tensor_size_bytes = curr_tensor_size_bytes
+
+            print(f"{layer_name=}, {[v.shape for v in cache_list]}")


This print statement appears to be for debugging purposes. It should be removed before merging to avoid unnecessary output in production environments.

gemini-code-assist · 2026-03-10T17:45:55Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                local_block_len = self.get_backend_aware_kv_block_len(
+                    layer_idx=i, first_split=True, mamba_view=mamba
+                )
+                print(f"Add agent {i=}, {local_block_len=}\n", flush=True)


This print statement appears to be for debugging purposes. It should be removed before merging to avoid unnecessary output in production environments.

vllm/distributed/kv_transfer/kv_connector/utils.py

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

gemini-code-assist · 2026-03-10T17:45:55Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+                layer_spec.page_size_bytes
+                if isinstance(layer_spec, MambaSpec)
+                else layer_spec.page_size_bytes
+                // self._physical_blocks_per_logical_kv_block


The block_len_per_layer list is populated conditionally for non-Mamba specs and then truncated based on seen_base_addresses. This approach can be fragile. If the order or count of seen_base_addresses does not perfectly align with the non-Mamba layers for which block_len_per_layer was intended, it could lead to incorrect block length assignments. Consider ensuring a more robust mapping or initialization of block_len_per_layer that directly corresponds to the registered regions.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

ZhanqiuHu · 2026-03-10T20:10:30Z

Tested the hybrid SSM P/D disaggregation on 2× H100 with nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 (--enforce-eager --block-size 128 --no-disable-hybrid-kv-cache-manager, NixlConnector kv_role=kv_both). Ran lm_eval gsm8k 5-shot (1319 questions) across four configs:

Config	strict-match	flexible-extract
Direct GPU 0	0.8605 ± 0.0095	0.5603 ± 0.0137
Direct GPU 1	0.8469 ± 0.0099	0.5663 ± 0.0137
P/D GPU 0 → GPU 1	0.8431 ± 0.0100	0.5618 ± 0.0137
P/D GPU 1 → GPU 0	0.8355 ± 0.0102	0.5512 ± 0.0137

All within the CI tolerance of 0.84 (RTOL=0.03).
Prometheus metrics reports the external prefix caching hit correctly.
Local prefic caching disabled automatically, so local prefic caching hit reporting 0 as expected.
Differences exist but are within the range of cross-GPU non-determinism.

mergify · 2026-03-11T08:27:47Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2026-03-11T11:24:38Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-03-11T11:30:10Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

NickLucche · 2026-03-11T11:24:58Z

tests/v1/kv_connector/nixl_integration/config_sweep_accuracy_test.sh

 "DP_EP=1 GPU_MEMORY_UTILIZATION=0.8 PREFILLER_TP_SIZE=2 DECODER_TP_SIZE=2 MODEL_NAMES=deepseek-ai/deepseek-vl2-tiny" # MLA+P-TP2, D-DPEP=2 (TP=1)
 )
+hybrid_ssm_configs=(
+  "ENABLE_HMA_FLAG=1 GPU_MEMORY_UTILIZATION=0.8 MODEL_NAMES=nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 VLLM_SERVE_EXTRA_ARGS=--max-model-len,8192,--trust-remote-code"


I hope this fits on CI

NickLucche · 2026-03-11T11:27:38Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        if self._is_mamba:
+            assert self._is_hma_required
+            mamba_spec = next(


I could probably wrap this bit to reduce bloat

NickLucche · 2026-03-11T11:30:37Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            page_size = (
+                layer_spec.page_size_bytes
+                if isinstance(layer_spec, MambaSpec)
+                else layer_spec.page_size_bytes
+                // self._physical_blocks_per_logical_kv_block
            )
+            num_blocks = (
+                self._logical_num_blocks
+                if isinstance(layer_spec, MambaSpec)
+                else self.num_blocks
+            )
+            # `page_size` accounts for physical blocks, st KVCache is always
+            # [`num_blocks` * `page_size`]
+            if not isinstance(layer_spec, MambaSpec):
+                self.block_len_per_layer.append(page_size)
+            curr_tensor_size_bytes = num_blocks * page_size
+            if tensor_size_bytes is None:
+                tensor_size_bytes = curr_tensor_size_bytes


this logic has been moved from within the inner loop to here, extending work to solely rely on KVCacheConfig rather than tensor view.

NickLucche · 2026-03-11T11:31:31Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+        blocks_data: list[tuple[int, int, int]] = []
+        local_base_addresses = self.kv_caches_base_addr[self.engine_id][self.tp_rank]
+
+        def register_blocks(blocks_data: list[tuple[int, int, int]], mamba: bool):


wrapping the whole block in a functoin to re-use for the mamba descriptors, appended at the end

NickLucche · 2026-03-11T11:31:51Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

        # local mapped:| 0| 1| 2| 3| 4| 5| 6| 7| 8| 9|10|11|12|13|14|15|
        assert self.kv_topo is not None
-        block_size_ratio = self.kv_topo.block_size_ratio_from_engine_id(engine_id)
+        kv_topo = self.kv_topo


mypy was complaining

mergify · 2026-03-11T11:37:00Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

orozery · 2026-03-11T11:37:41Z

I see the scheduler changes related to failure recovery.
Is that intended? I thought you were somehow disabling it altogether for the nixl connector

NickLucche · 2026-03-11T13:41:49Z

@orozery yep sorry took a few minutes to realize I had cruft, removed in ffb48cf

tdoublep

Some initial comments (haven't finished reading it all yet).

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

tdoublep · 2026-03-11T15:27:28Z

tests/v1/kv_connector/nixl_integration/test_accuracy.py

    "deepseek-ai/deepseek-vl2-tiny": 0.19,
    "deepseek-ai/DeepSeek-V2-Lite-Chat": 0.65,
    "google/gemma-3-4b-it": 0.74,
+    "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-FP8": 0.84,


Quite a big model for CI

I can switch to granite

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

tdoublep · 2026-03-11T15:40:07Z

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

+            if not isinstance(layer_spec, MambaSpec):
+                self.block_len_per_layer.append(page_size)


Maybe I miss something but where does self.block_len_per_layer get populated for the Mamba layers?

Good point, there's a comment here where I define the var

# Enable different block lengths for different layers *only* when MLA is used. # This is not used for SSM layers, which use the counterpart `mamba_ssm_size`.

let me know if that should be expanded.
Basically UniformTypeKVCacheSpecs can allow for different page sizes. Currently that is only used for dsv32 Indexer afaik @heheda12345

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

mergify · 2026-03-11T19:52:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @NickLucche.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tdoublep · 2026-03-12T08:49:43Z

vllm/distributed/kv_transfer/kv_connector/utils.py

        # we just mock num_blocks to 1 for the dimension check below.
-        self._is_kv_layout_blocks_first = (
+        # Hybrid SSM models assume a single blocks_first layout
+        self._is_kv_layout_blocks_first = self.is_mamba or (


I wonder if _is_kv_layout_blocks_first could be a property of the attention backend rather than needing to compute it here?

Great point!
We can address it in a scoped PR

vllm/distributed/kv_transfer/kv_connector/utils.py

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

tdoublep · 2026-03-12T08:58:38Z

vllm/distributed/kv_transfer/kv_connector/utils.py

+        # Regular case: backends like FA register K/V in separate regions
+        return cache if self.split_k_and_v else [cache]


Again this is probably just my lack for familiarity with this part of the code, but how does returning the tensor vs. the tensor wrapped in a list relate to registering the K/V separately?

because in the original register_kv_cache code we iterate over the returned value as in
for cache in cache_list

vllm/distributed/kv_transfer/kv_connector/utils.py

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

start refactoring address kernel block size miscmatch by handling 2 num_blocks Signed-off-by: NickLucche <nlucches@redhat.com>

mergify · 2026-03-13T10:26:06Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: NickLucche <nlucches@redhat.com>

mergify · 2026-03-13T18:13:43Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: NickLucche <nlucches@redhat.com>

mergify · 2026-03-13T18:59:24Z

Hi @NickLucche, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

NickLucche · 2026-03-13T19:41:58Z

/gemini review

Signed-off-by: NickLucche <nlucches@redhat.com>

gemini-code-assist

Code Review

This pull request adds support for hybrid SSM-FA models to the NixlConnector, which is a significant feature enhancement. The changes are extensive, touching upon KV cache registration, descriptor management, and metadata handling to accommodate the specific requirements of Mamba-based models alongside traditional attention mechanisms. The addition of comprehensive unit tests is commendable. I've identified a critical issue related to the calculation of page sizes for Mamba layers in the presence of a kernel block size mismatch, which could lead to incorrect behavior. The logic is unnecessarily complex and error-prone. My review includes a suggestion to refactor this for correctness and improved clarity.

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py

Signed-off-by: NickLucche <nlucches@redhat.com>

Signed-off-by: wendyliu235 <wenjun.liu@intel.com>

Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

Signed-off-by: EricccYang <yangyang4991@gmail.com>

mergify bot added v1 kv-connector labels Mar 10, 2026

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 11, 2026

NickLucche mentioned this pull request Mar 11, 2026

[RFC][NixlConnector]: Add support for hybrid SSM-FA models #36780

Open

1 task

NickLucche force-pushed the nixl-ssm-rebase branch from 629d263 to 33b40a3 Compare March 11, 2026 11:20

NickLucche marked this pull request as ready for review March 11, 2026 11:21

NickLucche requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners March 11, 2026 11:21

mergify bot removed the needs-rebase label Mar 11, 2026

NickLucche mentioned this pull request Mar 11, 2026

[PD][Nixl] Add support for hybrid SSM-FA models #34727

Closed

1 task

NickLucche commented Mar 11, 2026

View reviewed changes

tdoublep reviewed Mar 11, 2026

View reviewed changes

mergify bot added the needs-rebase label Mar 11, 2026

tdoublep reviewed Mar 12, 2026

View reviewed changes

hacking ssm

e46626a

start refactoring address kernel block size miscmatch by handling 2 num_blocks Signed-off-by: NickLucche <nlucches@redhat.com>

hop over remote conv size for ssm

e89dcb2

Signed-off-by: NickLucche <nlucches@redhat.com>

generalize FA num_blocks swap

fa91b3e

Signed-off-by: NickLucche <nlucches@redhat.com>

precommit

15065f5

Signed-off-by: NickLucche <nlucches@redhat.com>

gemini-code-assist bot reviewed Mar 13, 2026

View reviewed changes

vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py Show resolved Hide resolved

NickLucche added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 14, 2026

NickLucche added 7 commits March 14, 2026 13:47

fix page_size for flashattn

25ade8c

Signed-off-by: NickLucche <nlucches@redhat.com>

fix cross-layer page_size counting

99c4a21

Signed-off-by: NickLucche <nlucches@redhat.com>

update tests to have matching kv cache config

3831092

Signed-off-by: NickLucche <nlucches@redhat.com>

fix heterogeneous TP

e80bb2e

Signed-off-by: NickLucche <nlucches@redhat.com>

fix cross_layer blocks layout

bd365b1

Signed-off-by: NickLucche <nlucches@redhat.com>

async scheduling issue

900225b

Signed-off-by: NickLucche <nlucches@redhat.com>

fix test

208d6c8

Signed-off-by: NickLucche <nlucches@redhat.com>

tlrmchlsmth approved these changes Mar 16, 2026

View reviewed changes

NickLucche merged commit f5c081d into vllm-project:main Mar 16, 2026
52 of 53 checks passed

NickLucche mentioned this pull request Mar 17, 2026

[Bug]: PD disaggregation for SSM models requires --no-async-scheduling when TP>1 #37285

Open

1 task

ZhanqiuHu mentioned this pull request Mar 17, 2026

[SSM/Mamba] Follow-up: N-1 prefill for P/D disaggregation #37310

Merged

Lucaskabela pushed a commit to Lucaskabela/vllm that referenced this pull request Mar 17, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

71e5ce3

andylolu2 pushed a commit to andylolu2/vllm that referenced this pull request Mar 18, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

4ec56cf

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

accbd41

Signed-off-by: wendyliu235 <wenjun.liu@intel.com>

fxdawnn pushed a commit to fxdawnn/vllm that referenced this pull request Mar 19, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

d29bf41

ZhanqiuHu mentioned this pull request Mar 20, 2026

[Tracking Issue]: Mamba Heterogeneous TP for NIXL P/D Disaggregation #37638

Open

10 tasks

khairulkabir1661 pushed a commit to khairulkabir1661/vllm that referenced this pull request Mar 27, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

61cd41d

Monishver11 pushed a commit to Monishver11/vllm that referenced this pull request Mar 27, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

2ad4d1d

Signed-off-by: Monishver Chandrasekaran <monishverchandrasekaran@gmail.com>

JiantaoXu pushed a commit to JiantaoXu/vllm that referenced this pull request Mar 28, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

6a62fb7

vrdn-23 pushed a commit to vrdn-23/vllm that referenced this pull request Mar 30, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

9176362

Signed-off-by: Vinay Damodaran <vrdn@hey.com>

EricccYang pushed a commit to EricccYang/vllm that referenced this pull request Apr 1, 2026

[PD][Nixl] Add support for hybrid SSM-FA models (vllm-project#36687)

ea9e93c

Signed-off-by: EricccYang <yangyang4991@gmail.com>

		if not isinstance(layer_spec, MambaSpec):
		self.block_len_per_layer.append(page_size)

		# Regular case: backends like FA register K/V in separate regions
		return cache if self.split_k_and_v else [cache]

Uh oh!

Conversation

NickLucche commented Mar 10, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Important notes

Test with

TODO

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ZhanqiuHu commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

orozery commented Mar 11, 2026

Uh oh!

NickLucche commented Mar 11, 2026

Uh oh!

tdoublep left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

NickLucche commented Mar 10, 2026 •

edited by github-actions bot

Loading

ZhanqiuHu commented Mar 10, 2026 •

edited

Loading