[Server] Support openai prefix cache#2515
Conversation
|
@esmeetu I also created a PR (https://github.com/vllm-project/vllm/pull/2516/files |
|
@Avinash-Raj I see, but there seems few overlap changes between us. I also consider whether adding auto computing |
|
Hi, @Avinash-Raj. I introduced a new param |
|
I think it's a good point, but I am still mixed about adding all these non-standard options. |
Yeah, what you said is a better design. But maybe not related with this feature. Could you create a PR for your refactor? |
|
Yes, probably after the merge of #2488 |
|
@esmeetu do you encounter an assertion error when using the prefix caching feature? |
|
@Avinash-Raj No, I didn't test on v0.3.0, but it's ok when this PR was submitted. |
|
Why was this PR closed? |
We implemented automatic prefix caching in #2762 and this API is no longer needed. |
Adding support
prefix_posandprefix_stopparameter for api server.prefix_posis the position of prefix string length - 1prefix_stopis the prefix stop string of prompt.If we have a prompt
Below is an instruction that describes a task. Write a response that appropriately completes the request.\nHello.We can use prefix caching feature for the prefix string
Below is an instruction that describes a task. Write a response that appropriately completes the request..So we can add
prefix_stopwith something special like<|prefix|>, and the real prompt should beBelow is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello.There is a example about how to use:
Bootstrap the openai server. Then request the server with below chat completion json:
{ "model": "your-model-name", "messages": { "role": "user", "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello" }, "prefix_stop": "<|prefix|>" }Furthermore, if some model's system prompt already contains the special str, like
<|endoftext|>which indicating the end of the system prompt, this prefix caching feature could be smoothly used.