Skip to content

server: fix data race in to_json_anthropic#18283

Merged
ngxson merged 1 commit intoggml-org:masterfrom
ngxson:xsn/server_anthropic_fix
Dec 22, 2025
Merged

server: fix data race in to_json_anthropic#18283
ngxson merged 1 commit intoggml-org:masterfrom
ngxson:xsn/server_anthropic_fix

Conversation

@ngxson
Copy link
Contributor

@ngxson ngxson commented Dec 22, 2025

@ngxson ngxson merged commit 3997c78 into ggml-org:master Dec 22, 2025
68 of 71 checks passed
Comment on lines +1156 to 1159
bool text_block_started = false;

if (first) {
text_block_started = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this logic seems broken - @noname22 PTAL as the author of this code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have a look tomorrow

@noname22
Copy link
Contributor

The original code used static bool to track state across multiple calls to to_json_anthropic() during streaming. On the first streaming chunk, it should emit content_block_start (and set text_block_started = true). On subsequent chunks, it should skip content_block_start since the block is already started

With this changed to just a bool the variable is now reset to false on every call, which means that every single streaming chunk will emit a content_block_start event.

I agree that the static bool was a bad solution and problematic with concurrent users/calls but we can't solve it like this.

Could we maybe store text_block_started as a member variable of the server_task_result_cmpl_partial class (or in whatever struct tracks per-request state), so each request has its own isolated state without sharing it across concurrent requests?

@ngxson
Copy link
Contributor Author

ngxson commented Dec 23, 2025

then what about this line of code?

bool first = (n_decoded == 1);

@noname22
Copy link
Contributor

I'm not sure what you mean. Yes, we could add text_block_started to server_task_result_cmpl_partial just like n_decoded.

If you mean to do something like:

if (first && !diff.content_delta.empty()) {
      // emit content_block_start
  }

That could work too but only if the first token is guaranteed to be non-empty. Otherwise we would fail to generate the block start.

Is there any situation when models produce empty deltas?

@ggerganov
Copy link
Member

Could we maybe store text_block_started as a member variable of the server_task_result_cmpl_partial class (or in whatever struct tracks per-request state), so each request has its own isolated state without sharing it across concurrent requests?

@noname22 I think the closest logic to this is the server_slot::has_new_line state:

bool has_next_token = true;
bool has_new_line = false;
bool truncated = false;

You can keep track of this in a similar way and populate the server_task_result_cmpl_partial as needed, in a similar way as we populate the server_task_result_cmpl_final here:

res->n_tokens_cached = slot.prompt.n_tokens();
res->has_new_line = slot.has_new_line;
res->stopping_word = slot.stopping_word;

@ngxson
Copy link
Contributor Author

ngxson commented Jan 5, 2026

@noname22 I don't get why this logic is needed if we already had the same logic as on OAI, using the bool first that I pointed out earlier.

To remind, OAI also send some "starting" chunks before sending the stream of tokens. I believe you are trying to reinvent the same thing here.

If you really need to have extra logic to track the response state, use response->update() that was documented here

@ggerganov
Copy link
Member

If you really need to have extra logic to track the response state, use response->update() that was documented here

@ngxson Should we move all task-related state from server_slot (such as has_new_line) to the task_result_state?

@ngxson
Copy link
Contributor Author

ngxson commented Jan 5, 2026

Yes, I think it's also useful to move anything related to response formatting (e.g. res_type, oaicompat_cmpl_id, etc) to task_result_state, as they are currently simply copied to from task to result each time we create a new result.

The long-term direction would be to also move string-related logic like validate_utf8, slot.generated_text and maybe even detokenizing to task_result_state, so they will be handled by HTTP thread. Though, I'm not sure if it worth doing, it can be quite complicated while not improving performance as much.

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants