Skip to content

Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)"#33241

Merged
NickLucche merged 2 commits intovllm-project:mainfrom
orozery:revert-nixl-cross-layers
Jan 28, 2026
Merged

Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)"#33241
NickLucche merged 2 commits intovllm-project:mainfrom
orozery:revert-nixl-cross-layers

Conversation

@orozery
Copy link
Copy Markdown
Collaborator

@orozery orozery commented Jan 28, 2026

Fully reverts #30207 and its #33052 follow-up.

…roject#30207)"

This reverts commit 64e3d67.

Signed-off-by: Or Ozeri <oro@il.ibm.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Jan 28, 2026

Documentation preview: https://vllm--33241.org.readthedocs.build/en/33241/

@mergify mergify bot added documentation Improvements or additions to documentation v1 kv-connector labels Jan 28, 2026
@orozery orozery force-pushed the revert-nixl-cross-layers branch from 3a9f961 to a37185f Compare January 28, 2026 09:53
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts the "Cross layers KV cache layout" feature from the NIXL Connector, along with its follow-up. The changes correctly remove the feature's implementation, associated tests, and documentation. The code is reverted to its state before the feature was introduced, which also simplifies some logic by removing lazy initializations in favor of initialization in the constructor. The revert appears to be complete and correct.

@khluu khluu added the ready ONLY add when PR is ready to merge/full CI is needed label Jan 28, 2026
Copy link
Copy Markdown
Collaborator

@NickLucche NickLucche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverting as per discussion on slack.
Looking forward to get this feature back in on the next release!

@NickLucche NickLucche enabled auto-merge (squash) January 28, 2026 11:17
@NickLucche NickLucche merged commit 2e8de86 into vllm-project:main Jan 28, 2026
49 checks passed
khluu added a commit that referenced this pull request Jan 28, 2026
…" (#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
apd10 pushed a commit to apd10/vllm that referenced this pull request Jan 31, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
PiratePai pushed a commit to PiratePai/epd_shm that referenced this pull request Feb 3, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Signed-off-by: PiratePai <416932041@qq.com>
Signed-off-by: Pai <416932041@qq.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 15, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
ItzDEXX pushed a commit to ItzDEXX/vllm that referenced this pull request Feb 19, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 22, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 23, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 26, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 27, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 28, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Feb 28, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)

feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format)

\nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n

feat(openai): streaming improvements for completions API

- max_tokens: null support - remove asserts in serving_completion.py so
  null max_tokens uses computed max_model_len - prompt_length value

- UTF-8 streaming fix - hold back any delta containing U+FFFD replacement
  char in incremental detokenizer to prevent streaming corrupted multi-byte
  chars like "�️�️" for emojis

- gpt-oss special tokens - default skip_special_tokens=False for harmony
  models in chat completions when caller doesn't explicitly set it, so
  protocol framing tokens are preserved in output

Tested on /v1/completions stream=true endpoint:
- max_tokens:null streams successfully
- stream_options.include_usage returns usage chunks
- emoji/UTF-8 streaming produces clean output (no U+FFFD)
- skip_special_tokens:false accepted without error

feat(kimi): comprehensive Kimi K2 tool call streaming fixes

- Fix same-tool-multiple-times streaming using re.finditer for correct match indexing
- Fix previous_texts[i] accumulation for named tool_choice streaming
- Add Kimi K2 marker detection and extraction helpers
- Handle raw JSON output when reasoning parser returns it as reasoning content
- Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls
- Strip all tool markers from leaked content after section ends
- Add finish-time handling for tools whose names weren't sent during streaming
- Handle string vs dict arguments in remaining args logic
- Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation

Test results:
- 15/15 parallel tool calls
- 10/10 concurrent streaming requests
- 121/121 full tool calling suite checks
- ~95% edge case tests (failures are model behavior, not bugs)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@vadiklyutiy
Copy link
Copy Markdown
Collaborator

@orozery @liranschour don't we need cross layers kv-cache for performance reasons?

Seems #34158 and #36867 will solve the original issue.

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Mar 15, 2026

@orozery @liranschour don't we need cross layers kv-cache for performance reasons?

Seems #34158 and #36867 will solve the original issue.

#33339 re-introduced it, but this time it is off by default.
It can be tested using:

--kv-transfer-config '{..., "kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

#33339 re-introduced it, but this time it is off by default. It can be tested using:

--kv-transfer-config '{..., "kv_connector_extra_config": {"enable_cross_layers_blocks": "True"}}'

don't we want to enable it by default?

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

and I see the following code

    @property
    def prefer_cross_layer_blocks(self) -> bool:
        backend = get_current_attn_backend(self._vllm_config)
        if backend.get_name() not in (
            "FLASH_ATTN",
            "FLASHINFER",
        ):
            return False

Don't know what problem with FLASH_ATTN. But we can enable it for FLASHINFER after 2 PRs mentioned above.

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Mar 15, 2026

don't we want to enable it by default?

Eventually yes.
I think the plan is to to let users test it for a while before making it the default.

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

don't we want to enable it by default?

Eventually yes. I think the plan is to to let users test it for a while before making it the default.

Just wondering, the change was merged 5 weeks ago. Did you get any positive or negative feedback?

@orozery
Copy link
Copy Markdown
Collaborator Author

orozery commented Mar 15, 2026

Just wondering, the change was merged 5 weeks ago. Did you get any positive or negative feedback?

Good point.
@NickLucche we should make sure users are aware of it so we can get feedback.

@vadiklyutiy
Copy link
Copy Markdown
Collaborator

My concern about this approach

I think the plan is to to let users test it for a while before making it the default.

is that users aren't aware about the functionality and even if we tell about it to users who we are working directly, we will cover only minority of users who can get benefits from the functionality.

manojkilaru97 pushed a commit to manojkilaru97/vllm that referenced this pull request Mar 19, 2026
…roject#30207)" (vllm-project#33241)

Signed-off-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
(cherry picked from commit 2e8de86)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants