- Start Date: 2025-06-01
- RFC PR: #8
- RFC Issue: #1
- Status: Draft
Summary
Assistant prefixes (aka prefills) are a way for a user to supply a preset prefix for a model to continue generating from.
Example:
"messages": [ { "role": "user", "content": "Tell me a joke." }, { "role": "assistant", "content": "Why did the chicken", "prefix": true } ]
Example Response:
.choices[0].message.content = " cross the road? To get to the other side!"
Motivation
Assistant prefixes are useful for constraining model outputs, branched generation, and exploring the tree of possible model generations. They are widely used on providers that support them, such as Anthropic and OpenRouter.
However, there are currently three interfaces for requesting an assistant prefix:
- Any trailing (final) assistant message, with no special signifer
-
"prefix": trueon a trailing assistant message -
"continue_final_message": trueon the request body
Proposal goals
- Develop a standard interface for assistant prefixes.
- Disambiguate requests for assistant prefixes from requests for consecutive assistant messages.
- Maintain backwards compatibility with implementations using current interfaces.
Detailed design
Introduces a flag, prefix, on the final message, with three possible values: true, false, and unset.
"prefix": true
If the value of "prefix" on the trailing assistant message is true and assistant prefixes are supported for the requested model, the server MUST render the message content to the chat template without a new turn:
Example:
"messages": [ { "role": "user", "content": "Tell me a joke." }, { "role": "assistant", "content": "Why did the chicken", "prefix": true } ]
Example tokenized sequence for generation:
<begin_turn><role>user</role>
Tell me a joke.
<begin_turn><role>assistant</role>
Why did the chicken
The server MUST respond with only the generated content, not including the prefix:
Example:
"messages": [ { "role": "user", "content": "Classify this color as either { type: light } or { type: dark }: blue" }, { "role": "assistant", "content": "```json\n", "prefix": true } ], "stop": "\n```"
Example Response:
.choices[0].message.content = '{ "type": "dark" }'
Token counting
The server MUST count assistant prefix tokens as prompt tokens, not completion tokens, for the purposes of max_tokens and usage.
Errors
If assistant prefixes are not supported for the requested model, the server MUST respond with invalid_request_error.
"prefix": false
If the value of "prefix" on the trailing assistant message is false, the server MUST NOT render the message as a prefix. The message MUST be rendered as a separate turn. The message is not an assistant prefix.
Example:
"messages": [ { "role": "user", "content": "Tell me a joke." }, { "role": "assistant", "content": "Why did the chicken cross the road?", "prefix": false } ]
Example tokenized sequence for generation:
<begin_turn><role>user</role>
Tell me a joke.
<begin_turn><role>assistant</role>
Why did the chicken cross the road?
<begin_turn><role>assistant</role>
Errors
If the requested model does not support multiple consecutive assistant messages, the server MUST return an invalid_request_error.
"prefix" unset
If the prefix flag is unset, the result is implementation-defined. The server MAY treat the trailing assistant message as an assistant prefix, MAY render it as a new turn, or MAY return invalid_request_error.
The server MAY use non-standard flags (such as continue_final_message on the request body) to decide how to handle a trailing assistant message with an unset prefix flag.
Note: This is left implementation-defined for backwards compatibility.
Non-boolean "prefix" values
Values of "prefix" that are not true or false are an error. The server MUST respond with an invalid_request_error. In the future, some non-boolean values may be supported, see Unresolved Questions.
"prefix" on user messages
A "prefix" field on a user message is an error. The server MUST respond with an invalid_request_error.
"prefix" on non-trailing assistant messages
A "prefix" field on a non-trailing assistant message MUST be ignored. The server MAY emit a warning.
Example:
"messages": [ { "role": "user", "content": "Tell me a joke." }, { "role": "assistant", "content": "Why did the chicken cross the road? To get to the other side!", "prefix": true }, { "role": "user", "content": "Good one!" } ]
Example tokenized sequence for generation:
<begin_turn><role>user</role>
Tell me a joke.
<begin_turn><role>assistant</role>
Why did the chicken cross the road? To get to the other side!
<begin_turn><role>user</role>
Good one!
<begin_turn><role>assistant</role>
Alternatives considered
Prefix via a trailing assistant message, without a flag
This approach is currently the most popular, however it does not allow for consecutive assistant messages, which many models now support. Additionally, it is ambiguous when the server does not support assistant prefixes for a model—the user cannot easily tell if a trailing assistant message is actually being rendered without a new turn, or the model is simply acting as if it was, for the given prompt.
The unset prefix behavior allows for backwards compatibility with this approach.
Prefix via a flag on the body, such as continue_final_message
This approach is less flexible for users that wish to structure their code such that sampling parameters are centralized in a single method:
def make_request(client, messages: list[dict[str, str]]) -> str: response = client.chat.completions.create( messages=messages, ... ) return response.choices[0].message.content joke = make_request([ { "role": "user", "content": "Tell me a joke" }, { "role": "assistant", "content": "Sure! Here's a knock-knock joke:", prefix: True }, ]) humor_level = make_request([ { "role": "user", "content": f'Rate this joke from 0-10: {joke}' }, { "role": "assistant", "content": 'I would rate this joke "', prefix: True }, ])
The unset prefix behavior allows for backwards compatibility with this approach—servers may use non-standard flags on the request body to interpret trailing assistant messages with unset "prefix" flags.
Different flag names, such as "prefill" or "continue_message"
This RFC attempts to avoid the name "prefill", as it clashes with prefill in the context of inference serving.
Additionally, the Deepseek API already uses the "prefix" flag for this behavior, so by adopting it we will have compatibility with Deepseek by default.
Unresolved questions
Feature flags
Servers should have a way to signal that a model supports assistant prefixes. This will be considered in a future feature flag RFC. See issue #7.
Reasoning prefixes
This RFC does not attempt to standardize how implementations should handle prefixes on reasoning models. This is a very useful capability, so we should attempt to do so, either in this RFC or in the future. One possibility would be to add a third value for "prefix", "prefix": "reasoning", that signals the prefix should be inserted into the reasoning section.
We should also standardize how non-reasoning prefills work on reasoning models. Perhaps they should be inserted after the end of reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What other examples of per-turn extra attributes exist in the ecosystem already? While this is a clear solution (as
prefixis properly an attribute of the turn it is describing), is this a pattern that can be followed throughout the standard?Tool calls might actually be another example, but maybe that is another point for another RFC, since it does seem a bit inconsistent and poorly defined.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's a decent pattern for clarifying the intent of a message, especially given Deepseek already uses it, though I definitely want to avoid a "cat came back from Berkeley waving flags" situation :-) We definitely don't want 5 mandatory flags on every message.