Skip to content

Commit 0d003f8

Browse files
authored
[KVCache] Add implicit KVCache reuse, disable stateful option for chatCompletion (#359)
We introduced the field `stateful` in `chatCompletion()` earlier to allow easier multi-round chatting in #330. However, this is not ideal since we would prefer APIs that are functional in behavior, giving us various benefits (e.g. better fault tolerance for future use cases). Therefore, in this PR: - We disable `chatCompletionRequest.stateful`, and ask users to maintain the chat history explicitly - Instead, we introduce implicit KVCache reuse for multi-round chatting - When we detect users are doing multi-round chatting, we will not reset the KV cache, so only the new message will be prefilled - To detect multi-round chatting, we instantiate a `Conversation` instance for each request, and compare it with the current internal `Conversation`. If they match, it means that we can safely not reset the internal state, and only prefill the new input. To see the behavior, check out `mainMultiroundChat()` in `examples/openai-api/src/openai_api.ts`. Implementation details: - Instantiate `Conversation` object in `ChatModule.prefill()`, since this is the place where various workflows meet (streaming, non-streaming, n > 1, etc.) - The object's state is determined by system prompt, message history, and function calling usages - Inside `prefill()`, we then compare the two objects with `compareConversationObject()`, reset all internal states if false - Another detail is that, instead of overriding `conversation.config.system_message`, we add a field `conversation.override_system_message`, making `conversation.config` protected - We further remove all methods in `ChatModule` that overrides `this.getPipeline().conversation` by changing `updateConversationWithChatCompletionMessages()` to `getConversationFromChatCompletionRequest()`, keeping things more functional internally
1 parent 0ca0c58 commit 0d003f8

File tree

11 files changed

+436
-216
lines changed

11 files changed

+436
-216
lines changed

examples/openai-api/src/openai_api.ts

Lines changed: 43 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -94,45 +94,66 @@ async function mainStreaming() {
9494
}
9595

9696
/**
97-
* We domnstrate stateful chat completion, where chat history is preserved across requests.
97+
* We domnstrate multiround chatting. Though users are required to maintain chat history, internally
98+
* we compare provided `messages` with the internal chat history. If it matches, we will reuse KVs
99+
* and hence save computation -- essentially an implicit internal optimization.
98100
*/
99-
async function mainStateful() {
101+
async function mainMultiroundChat() {
100102
const chat: webllm.ChatInterface = new webllm.ChatModule();
101-
102103
chat.setInitProgressCallback((report: webllm.InitProgressReport) => {
103104
setLabel("init-label", report.text);
104105
});
105106

106107
await chat.reload("Llama-2-7b-chat-hf-q4f32_1");
107108

109+
// Round 0
110+
const messages: webllm.ChatCompletionMessageParam[] = [
111+
{
112+
"role": "system",
113+
"content": "[INST] <<SYS>>\n\nYou are a helpful, respectful and honest assistant. " +
114+
"Be as happy as you can when speaking please.\n<</SYS>>\n\n "
115+
},
116+
{ "role": "user", "content": "Provide me three US states." },
117+
];
118+
108119
const request0: webllm.ChatCompletionRequest = {
109-
stateful: true,
110-
// stream: true, // works with and without streaming
111-
messages: [
112-
{
113-
"role": "system",
114-
"content": "[INST] <<SYS>>\n\nYou are a helpful, respectful and honest assistant. " +
115-
"Be as happy as you can when speaking please.\n<</SYS>>\n\n "
116-
},
117-
{ "role": "user", "content": "Provide me three US states." },
118-
],
120+
stream: false, // can be streaming, same behavior
121+
messages: messages,
119122
};
120123

121124
const reply0 = await chat.chatCompletion(request0);
125+
const replyMessage0 = await chat.getMessage();
122126
console.log(reply0);
123-
console.log(await chat.getMessage());
127+
console.log(replyMessage0);
128+
129+
// Round 1
130+
// Append generated response to messages
131+
messages.push({ "role": "assistant", "content": replyMessage0 });
132+
// Append new user input
133+
messages.push({ "role": "user", "content": "Two more please!" });
134+
// Below line would cause an internal reset (clear KV cache, etc.) since the history no longer
135+
// matches the new request
136+
// messages[0].content = "Another system prompt";
124137

125138
const request1: webllm.ChatCompletionRequest = {
126-
stateful: true,
127-
// stream: true, // works with and without streaming
128-
messages: [
129-
{ "role": "user", "content": "Two more please!" },
130-
],
139+
stream: false, // can be streaming, same behavior
140+
messages: messages
131141
};
132142

133143
const reply1 = await chat.chatCompletion(request1);
144+
const replyMessage1 = await chat.getMessage();
134145
console.log(reply1);
135-
console.log(await chat.getMessage());
146+
console.log(replyMessage1);
147+
148+
// If we used multiround chat, request1 should only prefill a small number of tokens
149+
const prefillTokens0 = reply0.usage?.prompt_tokens;
150+
const prefillTokens1 = reply1.usage?.prompt_tokens;
151+
console.log("Requset 0 prompt tokens: ", prefillTokens0);
152+
console.log("Requset 1 prompt tokens: ", prefillTokens1);
153+
if (prefillTokens0 === undefined || prefillTokens1 === undefined ||
154+
prefillTokens1 > prefillTokens0) {
155+
throw Error("Multi-round chat is not triggered as expected.");
156+
}
136157

137158
console.log(await chat.runtimeStatsText());
138159
}
@@ -195,4 +216,5 @@ async function mainFunctionCalling() {
195216
// Run one of the functions
196217
// mainNonStreaming();
197218
// mainStreaming();
198-
mainFunctionCalling();
219+
// mainFunctionCalling();
220+
mainMultiroundChat();

examples/web-worker/src/main.ts

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,6 @@ async function mainOpenAIAPINonStreaming() {
6262
await chat.reload("Llama-2-7b-chat-hf-q4f32_1");
6363

6464
const request: webllm.ChatCompletionRequest = {
65-
// stateful: true, // set this optionally to preserve chat history
6665
messages: [
6766
{
6867
"role": "system",
@@ -102,7 +101,6 @@ async function mainOpenAIAPIStreaming() {
102101
await chat.reload("Llama-2-7b-chat-hf-q4f32_1");
103102

104103
const request: webllm.ChatCompletionRequest = {
105-
// stateful: true, // set this optionally to preserve chat history
106104
stream: true,
107105
messages: [
108106
{

0 commit comments

Comments
 (0)