server: fix data race in to_json_anthropic#18283
Conversation
| bool text_block_started = false; | ||
|
|
||
| if (first) { | ||
| text_block_started = false; |
There was a problem hiding this comment.
Hm, this logic seems broken - @noname22 PTAL as the author of this code.
There was a problem hiding this comment.
I'll have a look tomorrow
|
The original code used static bool to track state across multiple calls to to_json_anthropic() during streaming. On the first streaming chunk, it should emit content_block_start (and set text_block_started = true). On subsequent chunks, it should skip content_block_start since the block is already started With this changed to just a bool the variable is now reset to false on every call, which means that every single streaming chunk will emit a content_block_start event. I agree that the static bool was a bad solution and problematic with concurrent users/calls but we can't solve it like this. Could we maybe store text_block_started as a member variable of the server_task_result_cmpl_partial class (or in whatever struct tracks per-request state), so each request has its own isolated state without sharing it across concurrent requests? |
|
then what about this line of code? llama.cpp/tools/server/server-task.cpp Line 1153 in 849d021 |
|
I'm not sure what you mean. Yes, we could add text_block_started to server_task_result_cmpl_partial just like n_decoded. If you mean to do something like: That could work too but only if the first token is guaranteed to be non-empty. Otherwise we would fail to generate the block start. Is there any situation when models produce empty deltas? |
@noname22 I think the closest logic to this is the llama.cpp/tools/server/server-context.cpp Lines 111 to 115 in e66ed75 You can keep track of this in a similar way and populate the llama.cpp/tools/server/server-context.cpp Lines 1448 to 1450 in e66ed75 |
|
@noname22 I don't get why this logic is needed if we already had the same logic as on OAI, using the To remind, OAI also send some "starting" chunks before sending the stream of tokens. I believe you are trying to reinvent the same thing here. If you really need to have extra logic to track the response state, use |
|
Yes, I think it's also useful to move anything related to response formatting (e.g. The long-term direction would be to also move string-related logic like |
Fix #17570 (comment)