[Experimental] Prefix Caching Support#1669
Conversation
|
@DouHappy Can you try with calling |
My test script: I find a bug that when using two GPUs is slower than single GPU. Prefix‘s state 'on_gpu' is always False before prepare_inputs() When using two GPUs. and it works nice on single gpu. It mean multi_query_cached_kv_attention never be used when running on multi-gpus. My last test is also pass on single gpu. but it cost about 60s on two gpus. |
vllm/worker/worker.py
Outdated
| prompt_len - 1) | ||
| selected_token_start_idx += max_seq_len | ||
|
|
||
| # set the prefix state |
There was a problem hiding this comment.
when tp>1, seq_group_metadata.prefix here is copied by ray workers, so on_gpu=true won't work on multi gpus.
There was a problem hiding this comment.
Thank you for your reply. This help me a lot.
Co-authored-by: DouHappy <2278958187@qq.com>
zhuohan123
left a comment
There was a problem hiding this comment.
Thanks for the great work! Can you also merge with the latest main branch as well? I will test the PR after the merge.
There was a problem hiding this comment.
Consider adding another example just for prefix caching?
|
Thanks a lot for this great feature! Hi @DouHappy , did you observe any speed improvement afterwards? |
Yes,i got observe speed up. Could you should me your test script? Maybe you forgot warmup? BTW, I am trying to introduce prefix but only chinese version now. See this vLLM-prefix浅析(System Prompt,大模型推理加速) @franklyd |
zhuohan123
left a comment
There was a problem hiding this comment.
LGTM! Thanks for the great work! Pushed some style refactors by myself.
Co-authored-by: DouHappy <2278958187@qq.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
Co-authored-by: DouHappy <2278958187@qq.com> Co-authored-by: Zhuohan Li <zhuohan123@gmail.com>
HaiShaw
left a comment
There was a problem hiding this comment.
I think this merge now invalidates FP8 KV cache (#2279).
Look at kernels in prefix_prefill.py, when FP8 KV cache is ON, K/V and K_cache/V_cache are different types now.
Please let know what is the best way to move forward, thanks!
Could you provide a test script for the speedup? |
+1 |
|
Hi @HaiShaw Triton doesn't seem to support mixed precision dot product, so this kernel here fails if the |
Hi, @AlpinDale. Are you using prefix caching with FP8 KVCache? PyTorch and Triton used by vLLM could not support FP8 KVCache. Here are more information about prefix caching and FP8 KVCache in #3234. |
|
brilliant feature, thx! |
) ### What this PR does / why we need it? Improve - Keep the same file name format as v1, `offline_inference_npu_v0.py`, `offline_inference_npu_v1.py` - Use `VLLM_USE_V1` = 0/1 clearly in py scripts - Fix some run errors in `offline_inference_npu_v1.py`, e.g. `deepseekv3-lite-base-latest` not exists in modescope or hf. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - vLLM version: v0.9.2 - vLLM main: vllm-project@baed180 Signed-off-by: xleoken <xleoken@163.com>


add prefix caching support
Section 1 (Basic Functionality):
Todo:
Automatic Prefix Caching Support -- SGLang RadixAttention