Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)" by orozery · Pull Request #33241 · vllm-project/vllm

orozery · 2026-01-28T09:52:09Z

Fully reverts #30207 and its #33052 follow-up.

…roject#30207)" This reverts commit 64e3d67. Signed-off-by: Or Ozeri <oro@il.ibm.com>

mergify · 2026-01-28T09:52:52Z

Documentation preview: https://vllm--33241.org.readthedocs.build/en/33241/

gemini-code-assist

Code Review

This pull request reverts the "Cross layers KV cache layout" feature from the NIXL Connector, along with its follow-up. The changes correctly remove the feature's implementation, associated tests, and documentation. The code is reverted to its state before the feature was introduced, which also simplifies some logic by removing lazy initializations in favor of initialization in the constructor. The revert appears to be complete and correct.

NickLucche

Reverting as per discussion on slack.
Looking forward to get this feature back in on the next release!

…" (#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: PiratePai <416932041@qq.com> Signed-off-by: Pai <416932041@qq.com>

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

vadiklyutiy · 2026-03-14T13:13:32Z

@orozery @liranschour don't we need cross layers kv-cache for performance reasons?

Seems #34158 and #36867 will solve the original issue.

orozery · 2026-03-15T09:57:05Z

@orozery @liranschour don't we need cross layers kv-cache for performance reasons?

Seems #34158 and #36867 will solve the original issue.

#33339 re-introduced it, but this time it is off by default.
It can be tested using:

--kv-transfer-config '{..., "kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'

vadiklyutiy · 2026-03-15T11:38:36Z

#33339 re-introduced it, but this time it is off by default. It can be tested using:
--kv-transfer-config '{..., "kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'

don't we want to enable it by default?

vadiklyutiy · 2026-03-15T11:45:09Z

and I see the following code

    @property
    def prefer_cross_layer_blocks(self) -> bool:
        backend = get_current_attn_backend(self._vllm_config)
        if backend.get_name() not in (
            "FLASH_ATTN",
            "FLASHINFER",
        ):
            return False

Don't know what problem with FLASH_ATTN. But we can enable it for FLASHINFER after 2 PRs mentioned above.

orozery · 2026-03-15T12:04:21Z

don't we want to enable it by default?

Eventually yes.
I think the plan is to to let users test it for a while before making it the default.

vadiklyutiy · 2026-03-15T13:09:29Z

don't we want to enable it by default?

Eventually yes. I think the plan is to to let users test it for a while before making it the default.

Just wondering, the change was merged 5 weeks ago. Did you get any positive or negative feedback?

orozery · 2026-03-15T13:19:34Z

Just wondering, the change was merged 5 weeks ago. Did you get any positive or negative feedback?

Good point.
@NickLucche we should make sure users are aware of it so we can get feedback.

vadiklyutiy · 2026-03-15T13:34:06Z

My concern about this approach

I think the plan is to to let users test it for a while before making it the default.

is that users aren't aware about the functionality and even if we tell about it to users who we are working directly, we will cover only minority of users who can get benefits from the functionality.

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)

orozery requested review from ApostaC and NickLucche as code owners January 28, 2026 09:52

Revert "Enable Cross layers KV cache layout at NIXL Connector (vllm-p…

a37185f

…roject#30207)" This reverts commit 64e3d67. Signed-off-by: Or Ozeri <oro@il.ibm.com>

mergify bot added documentation Improvements or additions to documentation v1 kv-connector labels Jan 28, 2026

orozery force-pushed the revert-nixl-cross-layers branch from 3a9f961 to a37185f Compare January 28, 2026 09:53

gemini-code-assist bot reviewed Jan 28, 2026

View reviewed changes

khluu added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 28, 2026

Merge branch 'main' into revert-nixl-cross-layers

a6ae327

NickLucche approved these changes Jan 28, 2026

View reviewed changes

NickLucche enabled auto-merge (squash) January 28, 2026 11:17

NickLucche merged commit 2e8de86 into vllm-project:main Jan 28, 2026
49 checks passed

khluu added a commit that referenced this pull request Jan 28, 2026

Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)…

fe18ce4

…" (#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)

Shaoting-Feng mentioned this pull request Jan 30, 2026

[Core] Enable LMCache connector support for cross-layer KV cache layout LMCache/LMCache#2498

Open

7 tasks

apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026

Revert "Enable Cross layers KV cache layout at NIXL Connector (vllm-p…

5cf644d

…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>

ZhanqiuHu mentioned this pull request Feb 18, 2026

Revert "[Bugfix] Disable TRTLLM attention with KV transfer enabled (#33192)" #34832

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)"#33241

Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)"#33241
NickLucche merged 2 commits intovllm-project:mainfrom
orozery:revert-nixl-cross-layers

orozery commented Jan 28, 2026

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

NickLucche left a comment

Uh oh!

Uh oh!

vadiklyutiy commented Mar 14, 2026

Uh oh!

orozery commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

orozery commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

orozery commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

orozery commented Jan 28, 2026

Uh oh!

mergify bot commented Jan 28, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vadiklyutiy commented Mar 14, 2026

Uh oh!

orozery commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

orozery commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

orozery commented Mar 15, 2026

Uh oh!

vadiklyutiy commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants