[Server] Support openai prefix cache by esmeetu · Pull Request #2515 · vllm-project/vllm

esmeetu · 2024-01-20T00:59:25Z

Adding support prefix_pos and prefix_stop parameter for api server.

prefix_pos is the position of prefix string length - 1
prefix_stop is the prefix stop string of prompt.
If we have a prompt
Below is an instruction that describes a task. Write a response that appropriately completes the request.\nHello.
We can use prefix caching feature for the prefix string
Below is an instruction that describes a task. Write a response that appropriately completes the request..
So we can add prefix_stop with something special like <|prefix|>, and the real prompt should be
Below is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello.

There is a example about how to use:
Bootstrap the openai server. Then request the server with below chat completion json:

{
    "model": "your-model-name",
    "messages":  {
          "role": "user",
          "content": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n<|prefix|>Hello"
    },
    "prefix_stop": "<|prefix|>"
}

Furthermore, if some model's system prompt already contains the special str, like <|endoftext|> which indicating the end of the system prompt, this prefix caching feature could be smoothly used.

Avinash-Raj · 2024-01-20T05:48:51Z

@esmeetu I also created a PR (https://github.com/vllm-project/vllm/pull/2516/files
) which auto detects the prefix_pos if in-case the passed prompt is of type str.

esmeetu · 2024-01-20T12:27:03Z

@Avinash-Raj I see, but there seems few overlap changes between us. I also consider whether adding auto computing prefix_pos by something else but it seems adding a few complexity to the server, where we should split prompt into input and prefix two parts. So i just simply add this param to support it. I need more ideas about prefix caching use case. @simon-mo Any ideas?

esmeetu · 2024-01-21T06:00:52Z

Hi, @Avinash-Raj. I introduced a new param prefix_stop, I think it should be better than just putting all prefix str into to request. What do you think?

FlorianJoncour · 2024-01-21T23:59:29Z

I think it's a good point, but I am still mixed about adding all these non-standard options.
Shouldn't all that be placed in an object like extensions or maybe vllm ?

esmeetu · 2024-01-22T00:22:45Z

I think it's a good point, but I am still mixed about adding all these non-standard options.

Shouldn't all that be placed in an object like extensions or maybe vllm ?

Yeah, what you said is a better design. But maybe not related with this feature. Could you create a PR for your refactor?

FlorianJoncour · 2024-01-22T00:37:02Z

Yes, probably after the merge of #2488

Avinash-Raj · 2024-01-31T12:02:49Z

@esmeetu do you encounter an assertion error when using the prefix caching feature?

File "/python3.11/site-packages/vllm/worker/model_runner.py", line 783, in _pad_to_max
    assert len(x) <= max_len
           ^^^^^^^^^^^^^^^^^
AssertionError

esmeetu · 2024-01-31T13:21:19Z

@Avinash-Raj No, I didn't test on v0.3.0, but it's ok when this PR was submitted.

xyfZzz · 2024-03-02T10:11:19Z

Why was this PR closed?

zhuohan123 · 2024-03-04T00:00:01Z

Why was this PR closed?

We implemented automatic prefix caching in #2762 and this API is no longer needed.

esmeetu added 2 commits January 20, 2024 08:53

support openai server prefix pos

815e190

format

4db6cd4

add prefix_stop str

897de6a

esmeetu added 3 commits January 28, 2024 15:19

fix

a7b90d7

Merge remote-tracking branch 'upstream/main' into openai-prefix-cache

3ff80ca

format

6f9f0d2

esmeetu closed this Feb 29, 2024

esmeetu deleted the openai-prefix-cache branch March 23, 2024 11:11

esmeetu mentioned this pull request Apr 11, 2024

[Frontend] Support Tool and RAG #3971

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Server] Support openai prefix cache#2515

[Server] Support openai prefix cache#2515
esmeetu wants to merge 6 commits intovllm-project:mainfrom
esmeetu:openai-prefix-cache

esmeetu commented Jan 20, 2024 •

edited

Loading

Uh oh!

Avinash-Raj commented Jan 20, 2024 •

edited

Loading

Uh oh!

esmeetu commented Jan 20, 2024

Uh oh!

esmeetu commented Jan 21, 2024

Uh oh!

FlorianJoncour commented Jan 21, 2024

Uh oh!

esmeetu commented Jan 22, 2024

Uh oh!

FlorianJoncour commented Jan 22, 2024

Uh oh!

Avinash-Raj commented Jan 31, 2024

Uh oh!

esmeetu commented Jan 31, 2024

Uh oh!

xyfZzz commented Mar 2, 2024

Uh oh!

zhuohan123 commented Mar 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Conversation

esmeetu commented Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Avinash-Raj commented Jan 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

esmeetu commented Jan 20, 2024

Uh oh!

esmeetu commented Jan 21, 2024

Uh oh!

FlorianJoncour commented Jan 21, 2024

Uh oh!

esmeetu commented Jan 22, 2024

Uh oh!

FlorianJoncour commented Jan 22, 2024

Uh oh!

Avinash-Raj commented Jan 31, 2024

Uh oh!

esmeetu commented Jan 31, 2024

Uh oh!

xyfZzz commented Mar 2, 2024

Uh oh!

zhuohan123 commented Mar 4, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

esmeetu commented Jan 20, 2024 •

edited

Loading

Avinash-Raj commented Jan 20, 2024 •

edited

Loading