diff --git a/rfcs/001-assistant-prefix.md b/rfcs/001-assistant-prefix.md new file mode 100644 index 0000000..3a4f141 --- /dev/null +++ b/rfcs/001-assistant-prefix.md @@ -0,0 +1,248 @@ +* Start Date: 2025-06-01 +* RFC PR: #8 +* RFC Issue: #1 +* Status: Draft + +# Summary +Assistant prefixes (aka prefills) are a way for a user to supply a preset prefix for a +model to continue generating from. + +Example: +```json +"messages": [ + { "role": "user", "content": "Tell me a joke." }, + { "role": "assistant", "content": "Why did the chicken", "prefix": true } +] +``` + +Example Response: +``` +.choices[0].message.content = " cross the road? To get to the other side!" +``` + +# Motivation +Assistant prefixes are useful for constraining model outputs, branched generation, and +exploring the tree of possible model generations. They are widely used on providers that +support them, such as Anthropic and OpenRouter. + +However, there are currently three interfaces for requesting an assistant prefix: + +* Any trailing (final) assistant message, with no special signifer +* `"prefix": true` on a trailing assistant message +* `"continue_final_message": true` on the request body + +## Proposal goals + +* Develop a standard interface for assistant prefixes. +* Disambiguate requests for assistant prefixes from requests for consecutive + assistant messages. +* Maintain backwards compatibility with implementations using current interfaces. + +# Detailed design + +Introduces a flag, `prefix`, on the final message, with three possible values: `true`, +`false`, and unset. + +## "prefix": true + +If the value of "prefix" on the trailing assistant message is true and assistant +prefixes are supported for the requested model, the server MUST render the message +content to the chat template without a new turn: + +Example: +```json +"messages": [ + { "role": "user", "content": "Tell me a joke." }, + { "role": "assistant", "content": "Why did the chicken", "prefix": true } +] +``` + +Example tokenized sequence for generation: + +``` +user +Tell me a joke. + +assistant +Why did the chicken +``` + +The server MUST respond with only the generated content, not including the prefix: + +Example: +```json +"messages": [ + { + "role": "user", + "content": "Classify this color as either { type: light } or { type: dark }: blue" + }, + { "role": "assistant", "content": "```json\n", "prefix": true } +], +"stop": "\n```" +``` + +Example Response: +``` +.choices[0].message.content = '{ "type": "dark" }' +``` + +### Token counting +The server MUST count assistant prefix tokens as prompt tokens, not completion tokens, +for the purposes of `max_tokens` and `usage`. + +### Errors + +If assistant prefixes are not supported for the requested model, the server MUST respond +with `invalid_request_error`. + +## "prefix": false + +If the value of "prefix" on the trailing assistant message is false, the server MUST NOT +render the message as a prefix. The message MUST be rendered as a separate turn. The +message is not an assistant prefix. + +Example: +```json +"messages": [ + { "role": "user", "content": "Tell me a joke." }, + { "role": "assistant", "content": "Why did the chicken cross the road?", "prefix": false } +] +``` + +Example tokenized sequence for generation: +``` +user +Tell me a joke. + +assistant +Why did the chicken cross the road? + +assistant +``` + +### Errors + +If the requested model does not support multiple consecutive assistant messages, the +server MUST return an `invalid_request_error`. + +## "prefix" unset + +If the prefix flag is unset, the result is implementation-defined. The server MAY treat +the trailing assistant message as an assistant prefix, MAY render it as a new turn, or +MAY return invalid_request_error. + +The server MAY use non-standard flags (such as `continue_final_message` on the request +body) to decide how to handle a trailing assistant message with an unset prefix flag. + +*Note: This is left implementation-defined for backwards compatibility.* + +## Non-boolean "prefix" values + +Values of "prefix" that are not `true` or `false` are an error. The server MUST respond +with an `invalid_request_error`. In the future, some non-boolean values may be +supported, see *Unresolved Questions*. + +## "prefix" on user messages + +A "prefix" field on a user message is an error. The server MUST respond with an +`invalid_request_error`. + +## "prefix" on non-trailing assistant messages + +A "prefix" field on a non-trailing assistant message MUST be ignored. The server MAY +emit a warning. + +Example: +```json +"messages": [ + { "role": "user", "content": "Tell me a joke." }, + { + "role": "assistant", + "content": "Why did the chicken cross the road? To get to the other side!", + "prefix": true + }, + { "role": "user", "content": "Good one!" } +] +``` + +Example tokenized sequence for generation: +``` +user +Tell me a joke. + +assistant +Why did the chicken cross the road? To get to the other side! + +user +Good one! + +assistant +``` + +# Alternatives considered + +## Prefix via a trailing assistant message, without a flag + +This approach is currently the most popular, however it does not allow for consecutive +assistant messages, which many models now support. Additionally, it is ambiguous when +the server does not support assistant prefixes for a model—the user cannot easily tell +if a trailing assistant message is actually being rendered without a new turn, or the +model is simply acting as if it was, for the given prompt. + +The unset prefix behavior allows for backwards compatibility with this approach. + +## Prefix via a flag on the body, such as `continue_final_message` + +This approach is less flexible for users that wish to structure their code such that +sampling parameters are centralized in a single method: + +```python +def make_request(client, messages: list[dict[str, str]]) -> str: + response = client.chat.completions.create( + messages=messages, + ... + ) + return response.choices[0].message.content + +joke = make_request([ + { "role": "user", "content": "Tell me a joke" }, + { "role": "assistant", "content": "Sure! Here's a knock-knock joke:", prefix: True }, +]) + +humor_level = make_request([ + { "role": "user", "content": f'Rate this joke from 0-10: {joke}' }, + { "role": "assistant", "content": 'I would rate this joke "', prefix: True }, +]) +``` + +The unset prefix behavior allows for backwards compatibility with this approach—servers +may use non-standard flags on the request body to interpret trailing assistant messages +with unset "prefix" flags. + +## Different flag names, such as "prefill" or "continue_message" + +This RFC attempts to avoid the name "prefill", as it clashes with +[prefill](https://docs.vllm.ai/en/latest/examples/offline_inference/disaggregated_prefill.html?h=prefill) +in the context of inference serving. + +Additionally, the Deepseek API +[already uses the "prefix" flag](https://api-docs.deepseek.com/guides/chat_prefix_completion) +for this behavior, so by adopting it we will have compatibility with Deepseek by default. + +# Unresolved questions + +## Feature flags + +Servers should have a way to signal that a model supports assistant prefixes. This will +be considered in a future feature flag RFC. See issue #7. + +## Reasoning prefixes + +This RFC does not attempt to standardize how implementations should handle prefixes on +reasoning models. This is a very useful capability, so we should attempt to do so, +either in this RFC or in the future. One possibility would be to add a third value for +"prefix", `"prefix": "reasoning"`, that signals the prefix should be inserted into the +reasoning section. + +We should also standardize how non-reasoning prefills work on reasoning models. Perhaps +they should be inserted after the end of reasoning. \ No newline at end of file