Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 248 additions & 0 deletions rfcs/001-assistant-prefix.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
* Start Date: 2025-06-01
* RFC PR: #8
* RFC Issue: #1
* Status: Draft

# Summary
Assistant prefixes (aka prefills) are a way for a user to supply a preset prefix for a
model to continue generating from.

Example:
```json
"messages": [
{ "role": "user", "content": "Tell me a joke." },
{ "role": "assistant", "content": "Why did the chicken", "prefix": true }
]
```

Example Response:
```
.choices[0].message.content = " cross the road? To get to the other side!"
```

# Motivation
Assistant prefixes are useful for constraining model outputs, branched generation, and
exploring the tree of possible model generations. They are widely used on providers that
support them, such as Anthropic and OpenRouter.

However, there are currently three interfaces for requesting an assistant prefix:

* Any trailing (final) assistant message, with no special signifer
* `"prefix": true` on a trailing assistant message
* `"continue_final_message": true` on the request body

## Proposal goals

* Develop a standard interface for assistant prefixes.
* Disambiguate requests for assistant prefixes from requests for consecutive
assistant messages.
* Maintain backwards compatibility with implementations using current interfaces.

# Detailed design

Introduces a flag, `prefix`, on the final message, with three possible values: `true`,
`false`, and unset.

## "prefix": true

If the value of "prefix" on the trailing assistant message is true and assistant
prefixes are supported for the requested model, the server MUST render the message
content to the chat template without a new turn:

Example:
```json
"messages": [
{ "role": "user", "content": "Tell me a joke." },
{ "role": "assistant", "content": "Why did the chicken", "prefix": true }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What other examples of per-turn extra attributes exist in the ecosystem already? While this is a clear solution (as prefix is properly an attribute of the turn it is describing), is this a pattern that can be followed throughout the standard?

Tool calls might actually be another example, but maybe that is another point for another RFC, since it does seem a bit inconsistent and poorly defined.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a decent pattern for clarifying the intent of a message, especially given Deepseek already uses it, though I definitely want to avoid a "cat came back from Berkeley waving flags" situation :-) We definitely don't want 5 mandatory flags on every message.

]
```

Example tokenized sequence for generation:

```
<begin_turn><role>user</role>
Tell me a joke.

<begin_turn><role>assistant</role>
Why did the chicken
```

The server MUST respond with only the generated content, not including the prefix:

Example:
```json
"messages": [
{
"role": "user",
"content": "Classify this color as either { type: light } or { type: dark }: blue"
},
{ "role": "assistant", "content": "```json\n", "prefix": true }
],
"stop": "\n```"
```

Example Response:
```
.choices[0].message.content = '{ "type": "dark" }'
```

### Token counting
The server MUST count assistant prefix tokens as prompt tokens, not completion tokens,
for the purposes of `max_tokens` and `usage`.

### Errors

If assistant prefixes are not supported for the requested model, the server MUST respond
with `invalid_request_error`.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally a base layer of this spec would define an error categorization, including a differentiation between invalid requests and unsupported requests. Right now you're just using the undefined term invalid_request_error, and I guess that's fine to start out, but it relates to some discussions in the group chat about starting with cutting-edge features vs. starting with the foundations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, one of the next RFCs should be laying out at least the basic definitions for these sorts of API primitives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue for errors: #10


## "prefix": false

If the value of "prefix" on the trailing assistant message is false, the server MUST NOT
render the message as a prefix. The message MUST be rendered as a separate turn. The
message is not an assistant prefix.

Example:
```json
"messages": [
{ "role": "user", "content": "Tell me a joke." },
{ "role": "assistant", "content": "Why did the chicken cross the road?", "prefix": false }
]
```

Example tokenized sequence for generation:
```
<begin_turn><role>user</role>
Tell me a joke.

<begin_turn><role>assistant</role>
Why did the chicken cross the road?

<begin_turn><role>assistant</role>
```

### Errors

If the requested model does not support multiple consecutive assistant messages, the
server MUST return an `invalid_request_error`.

## "prefix" unset

If the prefix flag is unset, the result is implementation-defined. The server MAY treat
the trailing assistant message as an assistant prefix, MAY render it as a new turn, or
MAY return invalid_request_error.

The server MAY use non-standard flags (such as `continue_final_message` on the request
body) to decide how to handle a trailing assistant message with an unset prefix flag.

*Note: This is left implementation-defined for backwards compatibility.*

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open Question: How much do we want to consider backwards compatibility for current vendors? If we start with cut outs for back compat then we are starting with a shaky foundation. If an app is built with this standard in mind, but then a standard-compliant vendor (because of this backwards compatibility) can return undefined behavior, then I don't think the standard is doing its job.

Copy link
Contributor Author

@vgel vgel Jun 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a good question. My current feeling is that we should enable backwards compatibility when we can / if it doesn't adversely impact the standard too much, but also be willing to break things when needed. Basically I'm hoping we can do a Python 3.x -> 3.x+1 upgrade here, and not a Python 2 -> Python 3 :-)

That said, I do anticipate we'll need to break some things, like token strings in top_logprobs. I just want to be thoughtful about it. My thought on trailing assistant messages here is that specifically OpenRouter is huge, and they currently support prefixes via an unmarked trailing assistant message. It would be a ton of work on their end to migrate every client over to explicitly marking the prefix, and it seems like a minor cut-out to allow this to be implementation-defined (which is the status quo right now anyways), especially because we can potentially add type checking / warnings in the standard client as a client-level check in the future, or even default to prefix: false instead of unset in the client.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that leaving this ambiguous is pretty unfortunate. This is basically stating that clients which want interoperable outputs will need to always include a trailing prefix: false or prefix: true.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@domenic I agree that is unfortunate, though if we were to standardize the behavior of a trailing assistant message without a flag, what would the behavior be? I'm actually not sure which would be the better behavior to default to, which may be a point in favor of requiring it to be explicit. (Which we should be able to do with type hints in the client even if we allow it to be ambiguous in the API.) But I'm not sure about this / am open to being convinced otherwise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also just require it in the API as well, if we were willing to break backwards compatibility. Maybe another middle ground would be upgrading the MAY to SHOULD NOT allow a trailing message without a flag?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that generally for booleans, you want omitting the boolean to mean the same as thatBoolean: false. So starting a new turn would be best.

But yeah, I see there's tricky constraints here about whether you can afford to be strict when creating a new spec that's hoping to get adoption.

One possible solution (at the cost of complexity) is conformance levels. You can have "strict standard completions compliant" which might be hard for existing providers to meet the standards of, and "loose standard completions compliant" which has a lot more wiggle room and undefined/underdefined behavior to allow existing providers to match it. So e.g. OpenRouter or Anthropic could be loose-compliant until/unless they're willing to take a breaking change and switch the prefix-omitted behavior to match prefix: false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of conformance levels!


## Non-boolean "prefix" values

Values of "prefix" that are not `true` or `false` are an error. The server MUST respond
with an `invalid_request_error`. In the future, some non-boolean values may be
supported, see *Unresolved Questions*.

## "prefix" on user messages

A "prefix" field on a user message is an error. The server MUST respond with an
`invalid_request_error`.

## "prefix" on non-trailing assistant messages

A "prefix" field on a non-trailing assistant message MUST be ignored. The server MAY
emit a warning.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this, I'm a bit unsure why we would have this be ignored, instead of generating invalid_request_error. It seems like a pretty clear developer coding error, and catching those earlier is probably good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good question. My thought was that people may have code like this:

messages = [
  { "role": "user", "content": "blah blah blah" },
  { "role": "assistant", "content": "Certainly:", "prefix": True },
]
response_str = call_llm(messages)
messages[-1]["content"] = response

messages.append({ "role": "user", "content": "blah blah blah" })
response2_str = call_llm(messages)

And it might be nice to not require them to also clear the prefix flag. But I'm open to changing it if people think it'd be better to make this an error, it's not too onerous I don't think.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's a pretty good reason to keep it as-is, thanks for explaining.

Although, did you check what DeepSeek does?


Example:
```json
"messages": [
{ "role": "user", "content": "Tell me a joke." },
{
"role": "assistant",
"content": "Why did the chicken cross the road? To get to the other side!",
"prefix": true
},
{ "role": "user", "content": "Good one!" }
]
```

Example tokenized sequence for generation:
```
<begin_turn><role>user</role>
Tell me a joke.

<begin_turn><role>assistant</role>
Why did the chicken cross the road? To get to the other side!

<begin_turn><role>user</role>
Good one!

<begin_turn><role>assistant</role>
```

# Alternatives considered

## Prefix via a trailing assistant message, without a flag

This approach is currently the most popular, however it does not allow for consecutive
assistant messages, which many models now support. Additionally, it is ambiguous when
the server does not support assistant prefixes for a model—the user cannot easily tell
if a trailing assistant message is actually being rendered without a new turn, or the
model is simply acting as if it was, for the given prompt.

The unset prefix behavior allows for backwards compatibility with this approach.

## Prefix via a flag on the body, such as `continue_final_message`

This approach is less flexible for users that wish to structure their code such that
sampling parameters are centralized in a single method:

```python
def make_request(client, messages: list[dict[str, str]]) -> str:
response = client.chat.completions.create(
messages=messages,
...
)
return response.choices[0].message.content

joke = make_request([
{ "role": "user", "content": "Tell me a joke" },
{ "role": "assistant", "content": "Sure! Here's a knock-knock joke:", prefix: True },
])

humor_level = make_request([
{ "role": "user", "content": f'Rate this joke from 0-10: {joke}' },
{ "role": "assistant", "content": 'I would rate this joke "', prefix: True },
])
```

The unset prefix behavior allows for backwards compatibility with this approach—servers
may use non-standard flags on the request body to interpret trailing assistant messages
with unset "prefix" flags.

## Different flag names, such as "prefill" or "continue_message"

This RFC attempts to avoid the name "prefill", as it clashes with
[prefill](https://docs.vllm.ai/en/latest/examples/offline_inference/disaggregated_prefill.html?h=prefill)
in the context of inference serving.

Additionally, the Deepseek API
[already uses the "prefix" flag](https://api-docs.deepseek.com/guides/chat_prefix_completion)
for this behavior, so by adopting it we will have compatibility with Deepseek by default.

# Unresolved questions

## Feature flags

Servers should have a way to signal that a model supports assistant prefixes. This will
be considered in a future feature flag RFC. See issue #7.

## Reasoning prefixes

This RFC does not attempt to standardize how implementations should handle prefixes on
reasoning models. This is a very useful capability, so we should attempt to do so,
either in this RFC or in the future. One possibility would be to add a third value for
"prefix", `"prefix": "reasoning"`, that signals the prefix should be inserted into the
reasoning section.

We should also standardize how non-reasoning prefills work on reasoning models. Perhaps
they should be inserted after the end of reasoning.