standardcompletions · vgel · Jun 2, 2025 · Jun 2, 2025 · Jun 10, 2025 · Jun 11, 2025
diff --git a/rfcs/001-assistant-prefix.md b/rfcs/001-assistant-prefix.md
@@ -0,0 +1,248 @@
+* Start Date: 2025-06-01
+* RFC PR: #8
+* RFC Issue: #1
+* Status: Draft
+
+# Summary
+Assistant prefixes (aka prefills) are a way for a user to supply a preset prefix for a
+model to continue generating from.
+
+Example:
+```json
+"messages": [
+    { "role": "user", "content": "Tell me a joke." },
+    { "role": "assistant", "content": "Why did the chicken", "prefix": true }
+]
+```
+
+Example Response:
+```
+.choices[0].message.content = " cross the road? To get to the other side!" 
+```
+
+# Motivation
+Assistant prefixes are useful for constraining model outputs, branched generation, and
+exploring the tree of possible model generations. They are widely used on providers that
+support them, such as Anthropic and OpenRouter.
+
+However, there are currently three interfaces for requesting an assistant prefix:
+
+* A trailing (final) assistant message
+* `"prefix": true` on a trailing assistant message
+* `"continue_final_message": true` on the request body
+
+## Proposal goals
+
+* Develop a standard interface for assistant prefixes.
+* Disambiguate requests for assistant prefixes from requests for consecutive
+  assistant messages.
+* Maintain backwards compatibility with implementations using current interfaces.
+
+# Detailed design
+
+Introduces a flag, `prefix`, on the final message, with three possible values: `true`,
+`false`, and unset.
+
+## "prefix": true
+
+If the value of "prefix" on the trailing assistant message is true and assistant
+prefixes are supported for the requested model, the server MUST render the message
+content to the chat template without a new turn:
+
+Example:
+```json
+"messages": [
+  { "role": "user", "content": "Tell me a joke." },
+  { "role": "assistant", "content": "Why did the chicken", "prefix": true }
+]
+```
+
+Example tokenized sequence for generation:
+
+```
+<begin_turn><role>user</role>
+Tell me a joke.
+
+<begin_turn><role>assistant</role>
+Why did the chicken
+```
+
+The server MUST respond with only the generated content, not including the prefix:
+
+Example:
+```json
+"messages": [
+  {
+    "role": "user",
+    "content": "Classify this color as either { type: light } or { type: dark }: blue"
+  },  
+  { "role": "assistant", "content": "```json\n", "prefix": true }
+],
+"stop": "\n```"
+```
+
+Example Response:
+```
+.choices[0].message.content = '{ "type": "dark" }'
+```
+
+### Token counting
+The server MUST count assistant prefix tokens as prompt tokens, not completion tokens,
+for the purposes of `max_tokens` and `usage`.
+
+### Errors
+
+If assistant prefixes are not supported for the requested model, the server MUST respond
+with `invalid_request_error`.
+
+## "prefix": false
+
+If the value of "prefix" on the trailing assistant message is false, the server MUST NOT
+render the message as a prefix. The message MUST be rendered as a separate turn. The
+message is not an assistant prefix.
+
+Example:
+```json
+"messages": [
+  { "role": "user", "content": "Tell me a joke." },
+  { "role": "assistant", "content": "Why did the chicken cross the road?", "prefix": false }
+]
+```
+
+Example tokenized sequence for generation:
+```
+<begin_turn><role>user</role>
+Tell me a joke.
+
+<begin_turn><role>assistant</role>
+Why did the chicken cross the road?
+
+<begin_turn><role>assistant</role>
+```
+
+### Errors
+
+If the requested model does not support multiple consecutive assistant messages, the
+server MUST return an `invalid_request_error`.
+
+## "prefix" unset
+
+If the prefix flag is unset, the result is implementation-defined. The server MAY treat
+the trailing assistant message as an assistant prefix, MAY render it as a new turn, or
+MAY return invalid_request_error.
+
+The server MAY use non-standard flags (such as `continue_final_message` on the request
+body) to decide how to handle a trailing assistant message with an unset prefix flag.
+
+*Note: This is left implementation-defined for backwards compatibility.*
+
+## Non-boolean "prefix" values
+
+Values of "prefix" that are not `true` or `false` are an error. The server MUST respond
+with an `invalid_request_error`. In the future, some non-boolean values may be
+supported, see *Unresolved Questions*.
+
+## "prefix" on user messages
+
+A "prefix" field on a user message is an error. The server MUST respond with an
+`invalid_request_error`.
+
+## "prefix" on non-trailing assistant messages
+
+A "prefix" field on a non-trailing assistant message MUST be ignored. The server MAY
+emit a warning.
+
+Example:
+```json
+"messages": [
+  { "role": "user", "content": "Tell me a joke." },
+  {
+    "role": "assistant",
+    "content": "Why did the chicken cross the road? To get to the other side!",
+    "prefix": true
+  },
+  { "role": "user", "content": "Good one!" }
+]
+```
+
+Example tokenized sequence for generation:
+```
+<begin_turn><role>user</role>
+Tell me a joke.
+
+<begin_turn><role>assistant</role>
+Why did the chicken cross the road? To get to the other side!
+
+<begin_turn><role>user</role>
+Good one!
+
+<begin_turn><role>assistant</role>
+```
+
+# Alternatives
+
+## Prefix via a trailing assistant message, without a flag
+
+This approach is currently the most popular, however it does not allow for consecutive
+assistant messages, which many models now support. Additionally, it is ambiguous when
+the server does not support assistant prefixes for a model—the user cannot easily tell
+if a trailing assistant message is actually being rendered without a new turn, or the
+model is simply acting as if it was, for the given prompt.
+
+The unset prefix behavior allows for backwards compatibility with this approach.
+
+## Prefix via a flag on the body, such as `continue_final_message`
+
+This approach is less flexible for users that wish to structure their code such that
+sampling parameters are centralized in a single method:
+
+```python
+def make_request(client, messages: list[dict[str, str]]) -> str:
+  response = client.chat.completions.create(
+    messages=messages,
+    ...
+  )
+  return response.choices[0].message.content
+
+joke = make_request([
+  { "role": "user", "content": "Tell me a joke" },
+  { "role": "assistant", "content": "Sure! Here's a knock-knock joke:", prefix: True },
+])
+
+humor_level = make_request([
+  { "role": "user", "content": f'Rate this joke from 0-10: {joke}' },
+  { "role": "assistant", "content": 'I would rate this joke "', prefix: True },
+])
+```
+
+The unset prefix behavior allows for backwards compatibility with this approach—servers
+may use non-standard flags on the request body to interpret trailing assistant messages
+with unset "prefix" flags.
+
+## Different flag names, such as "prefill" or "continue_message"
+
+This RFC attempts to avoid the name "prefill", as it clashes with
+[prefill](https://docs.vllm.ai/en/latest/examples/offline_inference/disaggregated_prefill.html?h=prefill)
+in the context of inference serving.
+
+Additionally, the Deepseek API
+[already uses the "prefix" flag](https://api-docs.deepseek.com/guides/chat_prefix_completion)
+for this behavior, so by adopting it we will have compatibility with Deepseek by default.
+
+# Unresolved questions
+
+## Feature flags
+
+Servers should have a way to signal that a model supports assistant prefixes. This will
+be considered in a future feature flag RFC. See issue #7.
+
+## Reasoning prefixes
+
+This RFC does not attempt to standardize how implementations should handle prefixes on
+reasoning models. This is a very useful capability, so we should attempt to do so,
+either in this RFC or in the future. One possibility would be to add a third value for
+"prefix", `"prefix": "reasoning"`, that signals the prefix should be inserted into the
+reasoning section.
+
+We should also standardize how non-reasoning prefills work on reasoning models. Perhaps
+they should be inserted after the end of reasoning.