Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)"#33241
Revert "Enable Cross layers KV cache layout at NIXL Connector (#30207)"#33241NickLucche merged 2 commits intovllm-project:mainfrom
Conversation
…roject#30207)" This reverts commit 64e3d67. Signed-off-by: Or Ozeri <oro@il.ibm.com>
|
Documentation preview: https://vllm--33241.org.readthedocs.build/en/33241/ |
3a9f961 to
a37185f
Compare
There was a problem hiding this comment.
Code Review
This pull request reverts the "Cross layers KV cache layout" feature from the NIXL Connector, along with its follow-up. The changes correctly remove the feature's implementation, associated tests, and documentation. The code is reverted to its state before the feature was introduced, which also simplifies some logic by removing lazy initializations in favor of initialization in the constructor. The revert appears to be complete and correct.
NickLucche
left a comment
There was a problem hiding this comment.
Reverting as per discussion on slack.
Looking forward to get this feature back in on the next release!
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: PiratePai <416932041@qq.com> Signed-off-by: Pai <416932041@qq.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86) feat(otel): production-ready OpenTelemetry logging (streaming, SGLang-compatible format) \nSquashed commits:\n- feat: Production-ready OpenTelemetry logging with streaming support\n- fix(otel): match SGLang logging format with proper kvlistValue structure\n feat(openai): streaming improvements for completions API - max_tokens: null support - remove asserts in serving_completion.py so null max_tokens uses computed max_model_len - prompt_length value - UTF-8 streaming fix - hold back any delta containing U+FFFD replacement char in incremental detokenizer to prevent streaming corrupted multi-byte chars like "�️�️" for emojis - gpt-oss special tokens - default skip_special_tokens=False for harmony models in chat completions when caller doesn't explicitly set it, so protocol framing tokens are preserved in output Tested on /v1/completions stream=true endpoint: - max_tokens:null streams successfully - stream_options.include_usage returns usage chunks - emoji/UTF-8 streaming produces clean output (no U+FFFD) - skip_special_tokens:false accepted without error feat(kimi): comprehensive Kimi K2 tool call streaming fixes - Fix same-tool-multiple-times streaming using re.finditer for correct match indexing - Fix previous_texts[i] accumulation for named tool_choice streaming - Add Kimi K2 marker detection and extraction helpers - Handle raw JSON output when reasoning parser returns it as reasoning content - Add per-tool name tracking (tool_name_sent_arr) for parallel tool calls - Strip all tool markers from leaked content after section ends - Add finish-time handling for tools whose names weren't sent during streaming - Handle string vs dict arguments in remaining args logic - Update KIMI_K2_VLLM_CHANGES.md with comprehensive porting documentation Test results: - 15/15 parallel tool calls - 10/10 concurrent streaming requests - 121/121 full tool calling suite checks - ~95% edge case tests (failures are model behavior, not bugs) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
@orozery @liranschour don't we need cross layers kv-cache for performance reasons? |
#33339 re-introduced it, but this time it is off by default. |
don't we want to enable it by default? |
|
and I see the following code @property
def prefer_cross_layer_blocks(self) -> bool:
backend = get_current_attn_backend(self._vllm_config)
if backend.get_name() not in (
"FLASH_ATTN",
"FLASHINFER",
):
return FalseDon't know what problem with |
Eventually yes. |
Just wondering, the change was merged 5 weeks ago. Did you get any positive or negative feedback? |
Good point. |
|
My concern about this approach
is that users aren't aware about the functionality and even if we tell about it to users who we are working directly, we will cover only minority of users who can get benefits from the functionality. |
…roject#30207)" (vllm-project#33241) Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> (cherry picked from commit 2e8de86)
Fully reverts #30207 and its #33052 follow-up.