Skip to content

server: fix n_cmpl not skipping processing prompt#18663

Merged
ngxson merged 10 commits intoggml-org:masterfrom
ngxson:xsn/fix_n_cmpl
Jan 9, 2026
Merged

server: fix n_cmpl not skipping processing prompt#18663
ngxson merged 10 commits intoggml-org:masterfrom
ngxson:xsn/fix_n_cmpl

Conversation

@ngxson
Copy link
Contributor

@ngxson ngxson commented Jan 7, 2026

Ref: #17775 (comment)

When using -v verbose log, we should now see this line:

slot update_slots: id  0 | task 18 | prompt done, n_tokens = 8, batch.n_tokens = 8
slot update_slots: id  1 | task 21 | waiting for parent slot to complete
slot update_slots: id  2 | task 19 | waiting for parent slot to complete
slot update_slots: id  3 | task 20 | waiting for parent slot to complete

@ngxson ngxson requested a review from ggerganov as a code owner January 7, 2026 12:14
@ngxson ngxson changed the title server: fix n_cmpl not skipping processing server: fix n_cmpl not skipping processing prompt Jan 7, 2026
@ggerganov
Copy link
Member

There is still a problem: when I run this command 2 consecutive times, the server gets stuck in infinite loop:

curl -XPOST "localhost:8013/completion" -d '{"model": "fim", "prompt": "hello", "n_cmpl": 3, "n_predict": 1}' -H "Content-Type: application/json"

curl -XPOST "localhost:8013/completion" -d '{"model": "fim", "prompt": "hello", "n_cmpl": 3, "n_predict": 1}' -H "Content-Type: application/json"
Details
[64629] 0.04.008.644 D que          post: new task, id = 19041, front = 0
[64629] 0.04.008.645 W srv  update_slots: no tokens to decode
[64629] 0.04.008.645 D que    start_loop: waiting for new tasks
[64629] 0.04.008.645 D que    start_loop: processing new tasks
[64629] 0.04.008.645 D que    start_loop: processing task, id = 19041
[64629] 0.04.008.645 D que    start_loop: update slots
[64629] 0.04.008.645 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.646 D que          post: new task, id = 19042, front = 0
[64629] 0.04.008.646 W srv  update_slots: no tokens to decode
[64629] 0.04.008.646 D que    start_loop: waiting for new tasks
[64629] 0.04.008.646 D que    start_loop: processing new tasks
[64629] 0.04.008.646 D que    start_loop: processing task, id = 19042
[64629] 0.04.008.646 D que    start_loop: update slots
[64629] 0.04.008.647 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.647 D que          post: new task, id = 19043, front = 0
[64629] 0.04.008.647 W srv  update_slots: no tokens to decode
[64629] 0.04.008.647 D que    start_loop: waiting for new tasks
[64629] 0.04.008.647 D que    start_loop: processing new tasks
[64629] 0.04.008.647 D que    start_loop: processing task, id = 19043
[64629] 0.04.008.648 D que    start_loop: update slots
[64629] 0.04.008.648 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.653 D que          post: new task, id = 19044, front = 0
[64629] 0.04.008.653 W srv  update_slots: no tokens to decode
[64629] 0.04.008.653 D que    start_loop: waiting for new tasks
[64629] 0.04.008.653 D que    start_loop: processing new tasks
[64629] 0.04.008.653 D que    start_loop: processing task, id = 19044
[64629] 0.04.008.653 D que    start_loop: update slots
[64629] 0.04.008.654 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.654 D que          post: new task, id = 19045, front = 0
[64629] 0.04.008.654 W srv  update_slots: no tokens to decode
[64629] 0.04.008.654 D que    start_loop: waiting for new tasks
[64629] 0.04.008.654 D que    start_loop: processing new tasks
[64629] 0.04.008.654 D que    start_loop: processing task, id = 19045
[64629] 0.04.008.655 D que    start_loop: update slots
[64629] 0.04.008.655 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.655 D que          post: new task, id = 19046, front = 0
[64629] 0.04.008.655 W srv  update_slots: no tokens to decode
[64629] 0.04.008.655 D que    start_loop: waiting for new tasks
[64629] 0.04.008.656 D que    start_loop: processing new tasks
[64629] 0.04.008.656 D que    start_loop: processing task, id = 19046
[64629] 0.04.008.656 D que    start_loop: update slots
[64629] 0.04.008.656 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.656 D que          post: new task, id = 19047, front = 0
[64629] 0.04.008.657 W srv  update_slots: no tokens to decode
[64629] 0.04.008.657 D que    start_loop: waiting for new tasks
[64629] 0.04.008.657 D que    start_loop: processing new tasks
[64629] 0.04.008.657 D que    start_loop: processing task, id = 19047
[64629] 0.04.008.657 D que    start_loop: update slots
[64629] 0.04.008.658 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.658 D que          post: new task, id = 19048, front = 0
[64629] 0.04.008.658 W srv  update_slots: no tokens to decode

It works for n_predict > 1, but for less or equal to 1 it gets stuck.

@ngxson
Copy link
Contributor Author

ngxson commented Jan 7, 2026

I didn't notice this case, thanks for reporting. It should be fixed in the last commit

@ngxson
Copy link
Contributor Author

ngxson commented Jan 7, 2026

hmm, seems like backend sampling case is affected by this change. Would you mind having a look @ggerganov ? Or I can temporary disable this test case if you want

@ggerganov
Copy link
Member

Yes, I'll take a look.

@ggerganov
Copy link
Member

@ngxson Let's first merge #18700 and then I'll update this PR and merge.

@ngxson
Copy link
Contributor Author

ngxson commented Jan 8, 2026

yes, sounds good to me

@ggerganov
Copy link
Member

ggerganov commented Jan 9, 2026

@ngxson I changed the logic to avoid early return in update_slots(), since this makes it more difficult to follow what is happening. This should be good to merge I think (waiting for CI).

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested the n_cmpl functionality with latest llama.vscode and seems to work great now.


// note: a slot can also be either a parent or a child
bool is_parent() const {
return is_processing() && task->n_children > 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is_processing() check also seemed redundant so removed it.

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache
}

void clear_slot(server_slot & slot, bool allow_processing = false) const {
static void clear_slot(server_slot & slot, bool allow_processing = false) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function be removed now, as the logic can be moved to slot.clear(bool allow_processing = false)

A static function with the first argument being class instance can always be converted to a class method

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I think I broke something in 9ceb268 - server tests are failing locally.

// wait for all children to be launched
if (slot.is_parent()) {
int n_launched = 0;
for (auto & other : slots) {
Copy link
Contributor Author

@ngxson ngxson Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little bit worry that this nested loop will be invoked on each new token of the parent slot. Probably move this inside the slot.state == SLOT_STATE_PROCESSING_PROMPT || slot.state == SLOT_STATE_STARTED below, so it only run prompt processing?

The idea is that transition from SLOT_STATE_STARTED to SLOT_STATE_PROCESSING_PROMPT is only permitted if all child slots are launched

Copy link
Member

@ggerganov ggerganov Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few more changes to improve this logic, but will push them in a follow-up PR because they restructure the if logic and the diff will become too unrelated to the current PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that sounds good to me. After your refactoring, I will attempt to break the transition down inside small segments. Actually I talked about this point earlier via DM:

server_slot::run_pre_decode(...) {
    if (state == SLOT_STATE_A) {
        // do transition_A_to_B
        return SLOT_STATE_B;
    }
    ...
}

And inside update_slots()

for (auto & slot : slots) {
    slot.state = slot.run_pre_decode(batch, ...);
}
llama_decode(batch);
for (auto & slot : slots) {
    slot.state = slot.run_post_decode(batch, ...);
}

The main benefit would be that (most) state transitions will be bound to / isolated to one slot, since transition functions will now be slot's member.

The biggest benefit of this will be to define error boundary inside a transition function. So if one slot got an exception (currently, grammar system can throw one), we can shutdown one single slot instead of letting the server crash.

@ngxson ngxson merged commit 9ac2693 into ggml-org:master Jan 9, 2026
76 checks passed
@ggerganov
Copy link
Member

ggerganov commented Jan 11, 2026

@ngxson I'm still playing with the n_cmpl functionality and found a failure case that I can't think of a good way to fix. Wonder if you have any ideas.

First the repro:

# basic FIM server with 4 unified slots
llama-server --host 127.0.0.1 --mmap --port 8013 --alias fim --hf-repo ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF --kv-unified --parallel 4

Client sends two requests at the same time with n_cmpl = 3 and parent forced on id_slot: 0:

# put these in a script "test.sh" and run it: "bash test.sh"
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &

Note that we request explicitly id_slot: 0. The use case is that we want to keep the parent task always on a fixed slot in order to utilize better the prefix cache.

Also note that we send the curl requests in parallel (the & at the end of the commands).

This fails because the parent task for the second completion gets deferred until slot 0 becomes free. But one of the child tasks of the second completion gets placed in slot 0 instead, preventing the parent task to obtain the desired slot. So the server enters infinite loop of the child tasks waiting for their parent:

0.03.575.713 D srv  update_slots: run slots completed
0.03.575.713 D que    start_loop: waiting for new tasks
0.03.575.713 D que    start_loop: processing new tasks
0.03.575.713 D que    start_loop: processing task, id = 19
0.03.575.713 D que    start_loop: update slots
0.03.575.713 D srv  update_slots: posting NEXT_RESPONSE
0.03.575.714 D que          post: new task, id = 20, front = 0
0.03.575.714 D slot update_slots: id  0 | task 5 | waiting for parent slot to complete
0.03.575.714 D slot update_slots: id  1 | task 4 | waiting for parent slot to complete
0.03.575.714 D srv  update_slots: decoding batch, n_tokens = 0
0.03.575.714 W srv  update_slots: no tokens to decode

Any suggestions? I tried various fixes, but the logic seems to always break in some way. The only way that I can think of is to introduce a mechanism that launches the parent and the children simultaneously. I.e., either all get deferred, or all get launched. Otherwise, it's very difficult to handle all edge cases of multiple parents and children fighting for the slots.

@ngxson
Copy link
Contributor Author

ngxson commented Jan 11, 2026

@ggerganov yes I think we need an extra logic inside the task/slot scheduler, aka process_single_task, which will defer the task if it cannot provision enough slots. I'll have a look today.

I think we may also need a notion of "locked" or "reserved" slot, if we have a lot of requests coming in, the less chance we can have enough n_cmpl slots to launch it (so n_cmpl task may be deferred forever). The idea will be something like "tell the slot not to receive other tasks when it's free". But I'll need to think more if this logic is really necessary.

@ngxson
Copy link
Contributor Author

ngxson commented Jan 12, 2026

@ggerganov I think I get it.. think of slots as parallel thread, if we want to launch N tasks at the same time, we will need to introduce a notion of "barrier", the same idea as "thread barrier"

I'm still working on the implementation today, will be quite some changes

gary149 pushed a commit to gary149/llama-agent that referenced this pull request Jan 13, 2026
* server: fix n_cmpl not skipping processing

* fix infinite loop on empty batch

* cont : init child samplers + modify child logic

* cont : cleanup

* cont : improve n_cmpl logic

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache

* cont : remove redundant function

* cont : reduce parent checks

* fix : nullptr task dereference

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
dillon-blake pushed a commit to Boxed-Logic/llama.cpp that referenced this pull request Jan 15, 2026
* server: fix n_cmpl not skipping processing

* fix infinite loop on empty batch

* cont : init child samplers + modify child logic

* cont : cleanup

* cont : improve n_cmpl logic

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache

* cont : remove redundant function

* cont : reduce parent checks

* fix : nullptr task dereference

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants