server: fix n_cmpl not skipping processing prompt#18663
server: fix n_cmpl not skipping processing prompt#18663ngxson merged 10 commits intoggml-org:masterfrom
Conversation
|
There is still a problem: when I run this command 2 consecutive times, the server gets stuck in infinite loop: curl -XPOST "localhost:8013/completion" -d '{"model": "fim", "prompt": "hello", "n_cmpl": 3, "n_predict": 1}' -H "Content-Type: application/json"
curl -XPOST "localhost:8013/completion" -d '{"model": "fim", "prompt": "hello", "n_cmpl": 3, "n_predict": 1}' -H "Content-Type: application/json"DetailsIt works for |
|
I didn't notice this case, thanks for reporting. It should be fixed in the last commit |
|
hmm, seems like backend sampling case is affected by this change. Would you mind having a look @ggerganov ? Or I can temporary disable this test case if you want |
|
Yes, I'll take a look. |
|
yes, sounds good to me |
|
@ngxson I changed the logic to avoid early |
ggerganov
left a comment
There was a problem hiding this comment.
Tested the n_cmpl functionality with latest llama.vscode and seems to work great now.
|
|
||
| // note: a slot can also be either a parent or a child | ||
| bool is_parent() const { | ||
| return is_processing() && task->n_children > 0; |
There was a problem hiding this comment.
This is_processing() check also seemed redundant so removed it.
- launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache
tools/server/server-context.cpp
Outdated
| } | ||
|
|
||
| void clear_slot(server_slot & slot, bool allow_processing = false) const { | ||
| static void clear_slot(server_slot & slot, bool allow_processing = false) { |
There was a problem hiding this comment.
I think this function be removed now, as the logic can be moved to slot.clear(bool allow_processing = false)
A static function with the first argument being class instance can always be converted to a class method
There was a problem hiding this comment.
Hm, I think I broke something in 9ceb268 - server tests are failing locally.
tools/server/server-context.cpp
Outdated
| // wait for all children to be launched | ||
| if (slot.is_parent()) { | ||
| int n_launched = 0; | ||
| for (auto & other : slots) { |
There was a problem hiding this comment.
I'm a little bit worry that this nested loop will be invoked on each new token of the parent slot. Probably move this inside the slot.state == SLOT_STATE_PROCESSING_PROMPT || slot.state == SLOT_STATE_STARTED below, so it only run prompt processing?
The idea is that transition from SLOT_STATE_STARTED to SLOT_STATE_PROCESSING_PROMPT is only permitted if all child slots are launched
There was a problem hiding this comment.
I have a few more changes to improve this logic, but will push them in a follow-up PR because they restructure the if logic and the diff will become too unrelated to the current PR.
There was a problem hiding this comment.
Yes that sounds good to me. After your refactoring, I will attempt to break the transition down inside small segments. Actually I talked about this point earlier via DM:
server_slot::run_pre_decode(...) {
if (state == SLOT_STATE_A) {
// do transition_A_to_B
return SLOT_STATE_B;
}
...
}And inside update_slots()
for (auto & slot : slots) {
slot.state = slot.run_pre_decode(batch, ...);
}
llama_decode(batch);
for (auto & slot : slots) {
slot.state = slot.run_post_decode(batch, ...);
}The main benefit would be that (most) state transitions will be bound to / isolated to one slot, since transition functions will now be slot's member.
The biggest benefit of this will be to define error boundary inside a transition function. So if one slot got an exception (currently, grammar system can throw one), we can shutdown one single slot instead of letting the server crash.
|
@ngxson I'm still playing with the First the repro: # basic FIM server with 4 unified slots
llama-server --host 127.0.0.1 --mmap --port 8013 --alias fim --hf-repo ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF --kv-unified --parallel 4Client sends two requests at the same time with # put these in a script "test.sh" and run it: "bash test.sh"
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &Note that we request explicitly Also note that we send the This fails because the parent task for the second completion gets deferred until slot 0 becomes free. But one of the child tasks of the second completion gets placed in slot 0 instead, preventing the parent task to obtain the desired slot. So the server enters infinite loop of the child tasks waiting for their parent: Any suggestions? I tried various fixes, but the logic seems to always break in some way. The only way that I can think of is to introduce a mechanism that launches the parent and the children simultaneously. I.e., either all get deferred, or all get launched. Otherwise, it's very difficult to handle all edge cases of multiple parents and children fighting for the slots. |
|
@ggerganov yes I think we need an extra logic inside the task/slot scheduler, aka I think we may also need a notion of "locked" or "reserved" slot, if we have a lot of requests coming in, the less chance we can have enough |
|
@ggerganov I think I get it.. think of slots as parallel thread, if we want to launch N tasks at the same time, we will need to introduce a notion of "barrier", the same idea as "thread barrier" I'm still working on the implementation today, will be quite some changes |
* server: fix n_cmpl not skipping processing * fix infinite loop on empty batch * cont : init child samplers + modify child logic * cont : cleanup * cont : improve n_cmpl logic - launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache * cont : remove redundant function * cont : reduce parent checks * fix : nullptr task dereference --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server: fix n_cmpl not skipping processing * fix infinite loop on empty batch * cont : init child samplers + modify child logic * cont : cleanup * cont : improve n_cmpl logic - launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache * cont : remove redundant function * cont : reduce parent checks * fix : nullptr task dereference --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Ref: #17775 (comment)
When using
-vverbose log, we should now see this line: