-
Notifications
You must be signed in to change notification settings - Fork 1
RFC 001 - Assistant Prefix #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,248 @@ | ||
| * Start Date: 2025-06-01 | ||
| * RFC PR: #8 | ||
| * RFC Issue: #1 | ||
| * Status: Draft | ||
|
|
||
| # Summary | ||
| Assistant prefixes (aka prefills) are a way for a user to supply a preset prefix for a | ||
| model to continue generating from. | ||
|
|
||
| Example: | ||
| ```json | ||
| "messages": [ | ||
| { "role": "user", "content": "Tell me a joke." }, | ||
| { "role": "assistant", "content": "Why did the chicken", "prefix": true } | ||
| ] | ||
| ``` | ||
|
|
||
| Example Response: | ||
| ``` | ||
| .choices[0].message.content = " cross the road? To get to the other side!" | ||
| ``` | ||
|
|
||
| # Motivation | ||
| Assistant prefixes are useful for constraining model outputs, branched generation, and | ||
| exploring the tree of possible model generations. They are widely used on providers that | ||
| support them, such as Anthropic and OpenRouter. | ||
|
|
||
| However, there are currently three interfaces for requesting an assistant prefix: | ||
|
|
||
| * A trailing (final) assistant message | ||
| * `"prefix": true` on a trailing assistant message | ||
| * `"continue_final_message": true` on the request body | ||
|
|
||
| ## Proposal goals | ||
|
|
||
| * Develop a standard interface for assistant prefixes. | ||
| * Disambiguate requests for assistant prefixes from requests for consecutive | ||
| assistant messages. | ||
| * Maintain backwards compatibility with implementations using current interfaces. | ||
|
|
||
| # Detailed design | ||
|
|
||
| Introduces a flag, `prefix`, on the final message, with three possible values: `true`, | ||
| `false`, and unset. | ||
|
|
||
| ## "prefix": true | ||
|
|
||
| If the value of "prefix" on the trailing assistant message is true and assistant | ||
| prefixes are supported for the requested model, the server MUST render the message | ||
| content to the chat template without a new turn: | ||
|
|
||
| Example: | ||
| ```json | ||
| "messages": [ | ||
| { "role": "user", "content": "Tell me a joke." }, | ||
| { "role": "assistant", "content": "Why did the chicken", "prefix": true } | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What other examples of per-turn extra attributes exist in the ecosystem already? While this is a clear solution (as Tool calls might actually be another example, but maybe that is another point for another RFC, since it does seem a bit inconsistent and poorly defined.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's a decent pattern for clarifying the intent of a message, especially given Deepseek already uses it, though I definitely want to avoid a "cat came back from Berkeley waving flags" situation :-) We definitely don't want 5 mandatory flags on every message. |
||
| ] | ||
| ``` | ||
|
|
||
| Example tokenized sequence for generation: | ||
|
|
||
| ``` | ||
| <begin_turn><role>user</role> | ||
| Tell me a joke. | ||
|
|
||
| <begin_turn><role>assistant</role> | ||
| Why did the chicken | ||
| ``` | ||
|
|
||
| The server MUST respond with only the generated content, not including the prefix: | ||
|
|
||
| Example: | ||
| ```json | ||
| "messages": [ | ||
| { | ||
| "role": "user", | ||
| "content": "Classify this color as either { type: light } or { type: dark }: blue" | ||
| }, | ||
| { "role": "assistant", "content": "```json\n", "prefix": true } | ||
| ], | ||
| "stop": "\n```" | ||
| ``` | ||
|
|
||
| Example Response: | ||
| ``` | ||
| .choices[0].message.content = '{ "type": "dark" }' | ||
| ``` | ||
|
|
||
| ### Token counting | ||
| The server MUST count assistant prefix tokens as prompt tokens, not completion tokens, | ||
| for the purposes of `max_tokens` and `usage`. | ||
|
|
||
| ### Errors | ||
|
|
||
| If assistant prefixes are not supported for the requested model, the server MUST respond | ||
| with `invalid_request_error`. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally a base layer of this spec would define an error categorization, including a differentiation between invalid requests and unsupported requests. Right now you're just using the undefined term
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed, one of the next RFCs should be laying out at least the basic definitions for these sorts of API primitives.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Issue for errors: #10 |
||
|
|
||
| ## "prefix": false | ||
|
|
||
| If the value of "prefix" on the trailing assistant message is false, the server MUST NOT | ||
| render the message as a prefix. The message MUST be rendered as a separate turn. The | ||
| message is not an assistant prefix. | ||
|
|
||
| Example: | ||
| ```json | ||
| "messages": [ | ||
| { "role": "user", "content": "Tell me a joke." }, | ||
| { "role": "assistant", "content": "Why did the chicken cross the road?", "prefix": false } | ||
| ] | ||
| ``` | ||
|
|
||
| Example tokenized sequence for generation: | ||
| ``` | ||
| <begin_turn><role>user</role> | ||
| Tell me a joke. | ||
|
|
||
| <begin_turn><role>assistant</role> | ||
| Why did the chicken cross the road? | ||
|
|
||
| <begin_turn><role>assistant</role> | ||
| ``` | ||
|
|
||
| ### Errors | ||
|
|
||
| If the requested model does not support multiple consecutive assistant messages, the | ||
| server MUST return an `invalid_request_error`. | ||
|
|
||
| ## "prefix" unset | ||
|
|
||
| If the prefix flag is unset, the result is implementation-defined. The server MAY treat | ||
| the trailing assistant message as an assistant prefix, MAY render it as a new turn, or | ||
| MAY return invalid_request_error. | ||
|
|
||
| The server MAY use non-standard flags (such as `continue_final_message` on the request | ||
| body) to decide how to handle a trailing assistant message with an unset prefix flag. | ||
|
|
||
| *Note: This is left implementation-defined for backwards compatibility.* | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Open Question: How much do we want to consider backwards compatibility for current vendors? If we start with cut outs for back compat then we are starting with a shaky foundation. If an app is built with this standard in mind, but then a standard-compliant vendor (because of this backwards compatibility) can return undefined behavior, then I don't think the standard is doing its job.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, that's a good question. My current feeling is that we should enable backwards compatibility when we can / if it doesn't adversely impact the standard too much, but also be willing to break things when needed. Basically I'm hoping we can do a Python 3.x -> 3.x+1 upgrade here, and not a Python 2 -> Python 3 :-) That said, I do anticipate we'll need to break some things, like token strings in top_logprobs. I just want to be thoughtful about it. My thought on trailing assistant messages here is that specifically OpenRouter is huge, and they currently support prefixes via an unmarked trailing assistant message. It would be a ton of work on their end to migrate every client over to explicitly marking the prefix, and it seems like a minor cut-out to allow this to be implementation-defined (which is the status quo right now anyways), especially because we can potentially add type checking / warnings in the standard client as a client-level check in the future, or even default to prefix: false instead of unset in the client. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree that leaving this ambiguous is pretty unfortunate. This is basically stating that clients which want interoperable outputs will need to always include a trailing
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @domenic I agree that is unfortunate, though if we were to standardize the behavior of a trailing assistant message without a flag, what would the behavior be? I'm actually not sure which would be the better behavior to default to, which may be a point in favor of requiring it to be explicit. (Which we should be able to do with type hints in the client even if we allow it to be ambiguous in the API.) But I'm not sure about this / am open to being convinced otherwise.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We could also just require it in the API as well, if we were willing to break backwards compatibility. Maybe another middle ground would be upgrading the MAY to SHOULD NOT allow a trailing message without a flag? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that generally for booleans, you want omitting the boolean to mean the same as But yeah, I see there's tricky constraints here about whether you can afford to be strict when creating a new spec that's hoping to get adoption. One possible solution (at the cost of complexity) is conformance levels. You can have "strict standard completions compliant" which might be hard for existing providers to meet the standards of, and "loose standard completions compliant" which has a lot more wiggle room and undefined/underdefined behavior to allow existing providers to match it. So e.g. OpenRouter or Anthropic could be loose-compliant until/unless they're willing to take a breaking change and switch the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I like the idea of conformance levels! |
||
|
|
||
| ## Non-boolean "prefix" values | ||
|
|
||
| Values of "prefix" that are not `true` or `false` are an error. The server MUST respond | ||
| with an `invalid_request_error`. In the future, some non-boolean values may be | ||
| supported, see *Unresolved Questions*. | ||
|
|
||
| ## "prefix" on user messages | ||
|
|
||
| A "prefix" field on a user message is an error. The server MUST respond with an | ||
| `invalid_request_error`. | ||
|
|
||
| ## "prefix" on non-trailing assistant messages | ||
|
|
||
| A "prefix" field on a non-trailing assistant message MUST be ignored. The server MAY | ||
| emit a warning. | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Reviewing this, I'm a bit unsure why we would have this be ignored, instead of generating invalid_request_error. It seems like a pretty clear developer coding error, and catching those earlier is probably good?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, good question. My thought was that people may have code like this: messages = [
{ "role": "user", "content": "blah blah blah" },
{ "role": "assistant", "content": "Certainly:", "prefix": True },
]
response_str = call_llm(messages)
messages[-1]["content"] = response
messages.append({ "role": "user", "content": "blah blah blah" })
response2_str = call_llm(messages)And it might be nice to not require them to also clear the prefix flag. But I'm open to changing it if people think it'd be better to make this an error, it's not too onerous I don't think. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think that's a pretty good reason to keep it as-is, thanks for explaining. Although, did you check what DeepSeek does? |
||
|
|
||
| Example: | ||
| ```json | ||
| "messages": [ | ||
| { "role": "user", "content": "Tell me a joke." }, | ||
| { | ||
| "role": "assistant", | ||
| "content": "Why did the chicken cross the road? To get to the other side!", | ||
| "prefix": true | ||
| }, | ||
| { "role": "user", "content": "Good one!" } | ||
| ] | ||
| ``` | ||
|
|
||
| Example tokenized sequence for generation: | ||
| ``` | ||
| <begin_turn><role>user</role> | ||
| Tell me a joke. | ||
|
|
||
| <begin_turn><role>assistant</role> | ||
| Why did the chicken cross the road? To get to the other side! | ||
|
|
||
| <begin_turn><role>user</role> | ||
| Good one! | ||
|
|
||
| <begin_turn><role>assistant</role> | ||
| ``` | ||
|
|
||
| # Alternatives | ||
vgel marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
| ## Prefix via a trailing assistant message, without a flag | ||
|
|
||
| This approach is currently the most popular, however it does not allow for consecutive | ||
| assistant messages, which many models now support. Additionally, it is ambiguous when | ||
| the server does not support assistant prefixes for a model—the user cannot easily tell | ||
| if a trailing assistant message is actually being rendered without a new turn, or the | ||
| model is simply acting as if it was, for the given prompt. | ||
|
|
||
| The unset prefix behavior allows for backwards compatibility with this approach. | ||
|
|
||
| ## Prefix via a flag on the body, such as `continue_final_message` | ||
|
|
||
| This approach is less flexible for users that wish to structure their code such that | ||
| sampling parameters are centralized in a single method: | ||
|
|
||
| ```python | ||
| def make_request(client, messages: list[dict[str, str]]) -> str: | ||
| response = client.chat.completions.create( | ||
| messages=messages, | ||
| ... | ||
| ) | ||
| return response.choices[0].message.content | ||
|
|
||
| joke = make_request([ | ||
| { "role": "user", "content": "Tell me a joke" }, | ||
| { "role": "assistant", "content": "Sure! Here's a knock-knock joke:", prefix: True }, | ||
| ]) | ||
|
|
||
| humor_level = make_request([ | ||
| { "role": "user", "content": f'Rate this joke from 0-10: {joke}' }, | ||
| { "role": "assistant", "content": 'I would rate this joke "', prefix: True }, | ||
| ]) | ||
| ``` | ||
|
|
||
| The unset prefix behavior allows for backwards compatibility with this approach—servers | ||
| may use non-standard flags on the request body to interpret trailing assistant messages | ||
| with unset "prefix" flags. | ||
|
|
||
| ## Different flag names, such as "prefill" or "continue_message" | ||
|
|
||
| This RFC attempts to avoid the name "prefill", as it clashes with | ||
| [prefill](https://docs.vllm.ai/en/latest/examples/offline_inference/disaggregated_prefill.html?h=prefill) | ||
| in the context of inference serving. | ||
|
|
||
| Additionally, the Deepseek API | ||
| [already uses the "prefix" flag](https://api-docs.deepseek.com/guides/chat_prefix_completion) | ||
| for this behavior, so by adopting it we will have compatibility with Deepseek by default. | ||
|
|
||
| # Unresolved questions | ||
|
|
||
| ## Feature flags | ||
|
|
||
| Servers should have a way to signal that a model supports assistant prefixes. This will | ||
| be considered in a future feature flag RFC. See issue #7. | ||
|
|
||
| ## Reasoning prefixes | ||
|
|
||
| This RFC does not attempt to standardize how implementations should handle prefixes on | ||
| reasoning models. This is a very useful capability, so we should attempt to do so, | ||
| either in this RFC or in the future. One possibility would be to add a third value for | ||
| "prefix", `"prefix": "reasoning"`, that signals the prefix should be inserted into the | ||
| reasoning section. | ||
|
|
||
| We should also standardize how non-reasoning prefills work on reasoning models. Perhaps | ||
| they should be inserted after the end of reasoning. | ||
Uh oh!
There was an error while loading. Please reload this page.