server: fix n_cmpl not skipping processing prompt by ngxson · Pull Request #18663 · ggml-org/llama.cpp

ngxson · 2026-01-07T12:14:56Z

When using -v verbose log, we should now see this line:

slot update_slots: id  0 | task 18 | prompt done, n_tokens = 8, batch.n_tokens = 8
slot update_slots: id  1 | task 21 | waiting for parent slot to complete
slot update_slots: id  2 | task 19 | waiting for parent slot to complete
slot update_slots: id  3 | task 20 | waiting for parent slot to complete

ggerganov · 2026-01-07T12:43:39Z

There is still a problem: when I run this command 2 consecutive times, the server gets stuck in infinite loop:

curl -XPOST "localhost:8013/completion" -d '{"model": "fim", "prompt": "hello", "n_cmpl": 3, "n_predict": 1}' -H "Content-Type: application/json"

curl -XPOST "localhost:8013/completion" -d '{"model": "fim", "prompt": "hello", "n_cmpl": 3, "n_predict": 1}' -H "Content-Type: application/json"

Details

[64629] 0.04.008.644 D que          post: new task, id = 19041, front = 0
[64629] 0.04.008.645 W srv  update_slots: no tokens to decode
[64629] 0.04.008.645 D que    start_loop: waiting for new tasks
[64629] 0.04.008.645 D que    start_loop: processing new tasks
[64629] 0.04.008.645 D que    start_loop: processing task, id = 19041
[64629] 0.04.008.645 D que    start_loop: update slots
[64629] 0.04.008.645 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.646 D que          post: new task, id = 19042, front = 0
[64629] 0.04.008.646 W srv  update_slots: no tokens to decode
[64629] 0.04.008.646 D que    start_loop: waiting for new tasks
[64629] 0.04.008.646 D que    start_loop: processing new tasks
[64629] 0.04.008.646 D que    start_loop: processing task, id = 19042
[64629] 0.04.008.646 D que    start_loop: update slots
[64629] 0.04.008.647 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.647 D que          post: new task, id = 19043, front = 0
[64629] 0.04.008.647 W srv  update_slots: no tokens to decode
[64629] 0.04.008.647 D que    start_loop: waiting for new tasks
[64629] 0.04.008.647 D que    start_loop: processing new tasks
[64629] 0.04.008.647 D que    start_loop: processing task, id = 19043
[64629] 0.04.008.648 D que    start_loop: update slots
[64629] 0.04.008.648 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.653 D que          post: new task, id = 19044, front = 0
[64629] 0.04.008.653 W srv  update_slots: no tokens to decode
[64629] 0.04.008.653 D que    start_loop: waiting for new tasks
[64629] 0.04.008.653 D que    start_loop: processing new tasks
[64629] 0.04.008.653 D que    start_loop: processing task, id = 19044
[64629] 0.04.008.653 D que    start_loop: update slots
[64629] 0.04.008.654 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.654 D que          post: new task, id = 19045, front = 0
[64629] 0.04.008.654 W srv  update_slots: no tokens to decode
[64629] 0.04.008.654 D que    start_loop: waiting for new tasks
[64629] 0.04.008.654 D que    start_loop: processing new tasks
[64629] 0.04.008.654 D que    start_loop: processing task, id = 19045
[64629] 0.04.008.655 D que    start_loop: update slots
[64629] 0.04.008.655 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.655 D que          post: new task, id = 19046, front = 0
[64629] 0.04.008.655 W srv  update_slots: no tokens to decode
[64629] 0.04.008.655 D que    start_loop: waiting for new tasks
[64629] 0.04.008.656 D que    start_loop: processing new tasks
[64629] 0.04.008.656 D que    start_loop: processing task, id = 19046
[64629] 0.04.008.656 D que    start_loop: update slots
[64629] 0.04.008.656 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.656 D que          post: new task, id = 19047, front = 0
[64629] 0.04.008.657 W srv  update_slots: no tokens to decode
[64629] 0.04.008.657 D que    start_loop: waiting for new tasks
[64629] 0.04.008.657 D que    start_loop: processing new tasks
[64629] 0.04.008.657 D que    start_loop: processing task, id = 19047
[64629] 0.04.008.657 D que    start_loop: update slots
[64629] 0.04.008.658 D srv  update_slots: posting NEXT_RESPONSE
[64629] 0.04.008.658 D que          post: new task, id = 19048, front = 0
[64629] 0.04.008.658 W srv  update_slots: no tokens to decode

It works for n_predict > 1, but for less or equal to 1 it gets stuck.

ngxson · 2026-01-07T13:08:33Z

I didn't notice this case, thanks for reporting. It should be fixed in the last commit

ngxson · 2026-01-07T16:05:47Z

hmm, seems like backend sampling case is affected by this change. Would you mind having a look @ggerganov ? Or I can temporary disable this test case if you want

ggerganov · 2026-01-07T16:09:27Z

Yes, I'll take a look.

ggerganov · 2026-01-08T15:43:08Z

@ngxson Let's first merge #18700 and then I'll update this PR and merge.

ngxson · 2026-01-08T16:41:15Z

yes, sounds good to me

ggerganov · 2026-01-09T09:25:39Z

@ngxson I changed the logic to avoid early return in update_slots(), since this makes it more difficult to follow what is happening. This should be good to merge I think (waiting for CI).

ggerganov

Tested the n_cmpl functionality with latest llama.vscode and seems to work great now.

ggerganov · 2026-01-09T11:23:40Z

tools/server/server-context.cpp


    // note: a slot can also be either a parent or a child
    bool is_parent() const {
-        return is_processing() && task->n_children > 0;


This is_processing() check also seemed redundant so removed it.

- launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache

ngxson · 2026-01-09T14:34:36Z

tools/server/server-context.cpp

    }

-    void clear_slot(server_slot & slot, bool allow_processing = false) const {
+    static void clear_slot(server_slot & slot, bool allow_processing = false) {


I think this function be removed now, as the logic can be moved to slot.clear(bool allow_processing = false)

A static function with the first argument being class instance can always be converted to a class method

Hm, I think I broke something in 9ceb268 - server tests are failing locally.

ngxson · 2026-01-09T14:41:39Z

tools/server/server-context.cpp

+                // wait for all children to be launched
+                if (slot.is_parent()) {
+                    int n_launched = 0;
+                    for (auto & other : slots) {


I'm a little bit worry that this nested loop will be invoked on each new token of the parent slot. Probably move this inside the slot.state == SLOT_STATE_PROCESSING_PROMPT || slot.state == SLOT_STATE_STARTED below, so it only run prompt processing?

The idea is that transition from SLOT_STATE_STARTED to SLOT_STATE_PROCESSING_PROMPT is only permitted if all child slots are launched

I have a few more changes to improve this logic, but will push them in a follow-up PR because they restructure the if logic and the diff will become too unrelated to the current PR.

Yes that sounds good to me. After your refactoring, I will attempt to break the transition down inside small segments. Actually I talked about this point earlier via DM:

server_slot::run_pre_decode(...) { if (state == SLOT_STATE_A) { // do transition_A_to_B return SLOT_STATE_B; } ... }

And inside update_slots()

for (auto & slot : slots) { slot.state = slot.run_pre_decode(batch, ...); } llama_decode(batch); for (auto & slot : slots) { slot.state = slot.run_post_decode(batch, ...); }

The main benefit would be that (most) state transitions will be bound to / isolated to one slot, since transition functions will now be slot's member.

The biggest benefit of this will be to define error boundary inside a transition function. So if one slot got an exception (currently, grammar system can throw one), we can shutdown one single slot instead of letting the server crash.

ggerganov · 2026-01-11T09:35:51Z

@ngxson I'm still playing with the n_cmpl functionality and found a failure case that I can't think of a good way to fix. Wonder if you have any ideas.

First the repro:

# basic FIM server with 4 unified slots
llama-server --host 127.0.0.1 --mmap --port 8013 --alias fim --hf-repo ggml-org/Qwen2.5-Coder-0.5B-Q8_0-GGUF --kv-unified --parallel 4

Client sends two requests at the same time with n_cmpl = 3 and parent forced on id_slot: 0:

# put these in a script "test.sh" and run it: "bash test.sh"
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &
curl -XPOST "localhost:8013/completion" -d '{"id_slot": 0, "prompt": "Hello", "n_cmpl": 3, "n_predict": 4}' -H "Content-Type: application/json" &

Note that we request explicitly id_slot: 0. The use case is that we want to keep the parent task always on a fixed slot in order to utilize better the prefix cache.

Also note that we send the curl requests in parallel (the & at the end of the commands).

This fails because the parent task for the second completion gets deferred until slot 0 becomes free. But one of the child tasks of the second completion gets placed in slot 0 instead, preventing the parent task to obtain the desired slot. So the server enters infinite loop of the child tasks waiting for their parent:

0.03.575.713 D srv  update_slots: run slots completed
0.03.575.713 D que    start_loop: waiting for new tasks
0.03.575.713 D que    start_loop: processing new tasks
0.03.575.713 D que    start_loop: processing task, id = 19
0.03.575.713 D que    start_loop: update slots
0.03.575.713 D srv  update_slots: posting NEXT_RESPONSE
0.03.575.714 D que          post: new task, id = 20, front = 0
0.03.575.714 D slot update_slots: id  0 | task 5 | waiting for parent slot to complete
0.03.575.714 D slot update_slots: id  1 | task 4 | waiting for parent slot to complete
0.03.575.714 D srv  update_slots: decoding batch, n_tokens = 0
0.03.575.714 W srv  update_slots: no tokens to decode

Any suggestions? I tried various fixes, but the logic seems to always break in some way. The only way that I can think of is to introduce a mechanism that launches the parent and the children simultaneously. I.e., either all get deferred, or all get launched. Otherwise, it's very difficult to handle all edge cases of multiple parents and children fighting for the slots.

ngxson · 2026-01-11T11:19:04Z

@ggerganov yes I think we need an extra logic inside the task/slot scheduler, aka process_single_task, which will defer the task if it cannot provision enough slots. I'll have a look today.

I think we may also need a notion of "locked" or "reserved" slot, if we have a lot of requests coming in, the less chance we can have enough n_cmpl slots to launch it (so n_cmpl task may be deferred forever). The idea will be something like "tell the slot not to receive other tasks when it's free". But I'll need to think more if this logic is really necessary.

ngxson · 2026-01-12T14:05:40Z

@ggerganov I think I get it.. think of slots as parallel thread, if we want to launch N tasks at the same time, we will need to introduce a notion of "barrier", the same idea as "thread barrier"

I'm still working on the implementation today, will be quite some changes

* server: fix n_cmpl not skipping processing * fix infinite loop on empty batch * cont : init child samplers + modify child logic * cont : cleanup * cont : improve n_cmpl logic - launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache * cont : remove redundant function * cont : reduce parent checks * fix : nullptr task dereference --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

server: fix n_cmpl not skipping processing

a9d7bcb

ngxson requested a review from ggerganov as a code owner January 7, 2026 12:14

ngxson changed the title ~~server: fix n_cmpl not skipping processing~~ server: fix n_cmpl not skipping processing prompt Jan 7, 2026

fix infinite loop on empty batch

d7c27d4

github-actions bot added examples server labels Jan 7, 2026

ggerganov mentioned this pull request Jan 8, 2026

server : use different seeds for child completions #18700

Merged

ggerganov added 2 commits January 9, 2026 09:35

Merge branch 'master' into HEAD

59dda88

cont : init child samplers + modify child logic

439c3b5

loci-dev mentioned this pull request Jan 9, 2026

UPSTREAM PR #18663: server: fix n_cmpl not skipping processing prompt auroralabs-loci/llama.cpp#864

Open

ggerganov added 2 commits January 9, 2026 13:05

Merge branch 'master' into pr/18663

91fd50b

cont : cleanup

f2d988d

ggerganov approved these changes Jan 9, 2026

View reviewed changes

cont : improve n_cmpl logic

a4854f0

- launch the parent task first so it finds the slot with best cache - parent task waits for child tasks to be launched - when a child task finishes - remove its cache

ngxson commented Jan 9, 2026

View reviewed changes

ggerganov added 3 commits January 9, 2026 16:42

cont : remove redundant function

9ceb268

cont : reduce parent checks

aef22e7

fix : nullptr task dereference

cc5cafe

ngxson merged commit 9ac2693 into ggml-org:master Jan 9, 2026
76 checks passed

ggerganov mentioned this pull request Jan 11, 2026

Use fixed slot id for FIM requests ggml-org/llama.vscode#155

Merged

ngxson mentioned this pull request Jan 12, 2026

server: improve slots scheduling for n_cmpl #18789

Merged

loci-dev mentioned this pull request Jan 15, 2026

UPSTREAM PR #18789: server: improve slots scheduling for n_cmpl auroralabs-loci/llama.cpp#928

Open

Conversation

ngxson commented Jan 7, 2026

Uh oh!

ggerganov commented Jan 7, 2026

Uh oh!

ngxson commented Jan 7, 2026

Uh oh!

ngxson commented Jan 7, 2026

Uh oh!

ggerganov commented Jan 7, 2026

Uh oh!

ggerganov commented Jan 8, 2026

Uh oh!

ngxson commented Jan 8, 2026

Uh oh!

ggerganov commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ggerganov commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Jan 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Jan 9, 2026 •

edited

Loading

ngxson Jan 9, 2026 •

edited

Loading

ggerganov Jan 9, 2026 •

edited

Loading

ggerganov commented Jan 11, 2026 •

edited

Loading

ngxson commented Jan 11, 2026 •

edited

Loading