Enable Cross layers KV cache layout at NIXL Connector V2#33339
Enable Cross layers KV cache layout at NIXL Connector V2#33339NickLucche merged 94 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Co-authored-by: Or Ozeri <or@ozery.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
| "CROSS_LAYERS_BLOCKS=True GPU_MEMORY_UTILIZATION=0.8 MODEL_NAMES=deepseek-ai/deepseek-vl2-tiny" # MLA case | ||
| "CROSS_LAYERS_BLOCKS=True GPU_MEMORY_UTILIZATION=0.8 PREFILLER_TP_SIZE=1 DECODER_TP_SIZE=2 MODEL_NAMES=deepseek-ai/deepseek-vl2-tiny" | ||
| "CROSS_LAYERS_BLOCKS=True GPU_MEMORY_UTILIZATION=0.8 PREFILLER_TP_SIZE=2 DECODER_TP_SIZE=1 MODEL_NAMES=deepseek-ai/deepseek-vl2-tiny" | ||
| ) |
There was a problem hiding this comment.
you can refactor to just add CROSS_LAYERS_BLOCKS=True to tp_configs, assuming all above are compatible.
| else | ||
| echo "CROSS_LAYERS_BLOCKS is not set, skipping --enable-cross-layers runs." | ||
| fi |
There was a problem hiding this comment.
nit: no need to echo out disabled options imo
There was a problem hiding this comment.
Removed that echo
| if current_platform.device_type != "cpu" | ||
| else -2 |
There was a problem hiding this comment.
qq: this is untested on cpu right?
There was a problem hiding this comment.
I don't think we we need this special case.
We should be able to correctly set block_size_position using test_shape even when running on CPU.
There was a problem hiding this comment.
Removed this special case.
Setting block_size_position only by kv_cache_shape.
| expected_base_addrs: list[int] | ||
| expected_num_entries: int | ||
| kv_caches: dict[str, torch.Tensor] | ||
| if connector.prefer_cross_layer_blocks: |
There was a problem hiding this comment.
This assumes that connector.prefer_cross_layer_blocks was correctly parsed of the test enable_cross_layers parameter.
Can you assert that?
There was a problem hiding this comment.
Added an assert for that
| if current_platform.device_type != "cpu" | ||
| else -2 |
There was a problem hiding this comment.
I don't think we we need this special case.
We should be able to correctly set block_size_position using test_shape even when running on CPU.
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
Signed-off-by: Liran Schour <lirans@il.ibm.com>
…t#33339) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
…t#33339) Signed-off-by: Liran Schour <lirans@il.ibm.com> Signed-off-by: liranschour <liranschour@users.noreply.github.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>
Purpose
Enable NIXL Connector to us the new continuous cross layer KV cache layout described in RFC and implemented in #27743
Demonstrate performance improvement of more the 2x in Tok/sec and TTFT due to dramatic reduction of fragmentation of transfer buffers.
Tested with P!=D with run_accuracy_test.sh P=1 D=2
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.