server: support multiple generations from one prompt (OAI "n" option) by ngxson · Pull Request #17775 · ggml-org/llama.cpp

ngxson · 2025-12-05T00:11:22Z

Implementation

The requirement is that number of slots must be equal or larger than number of "n" completion choices.

When task is created, we create N-1 child tasks and 1 parent task
Parent task is guaranteed to be loaded first (because we push all tasks into queue under one lock acquired). Child tasks are also loaded into slots at this point, but state will be set to SLOT_STATE_WAIT_OTHER
We begin processing parent's prompt
When parent prompt processing is done, we gather all children having SLOT_STATE_WAIT_OTHER state, then copy parent's state into these slots via llama_memory_seq_cp
- Note: at this point, if we cannot yet gather all children (maybe one of slot is busying with another task), we wait until the task is done and all children are on their slots. This can potentially decrease the overall throughput, but makes the implementation easier to understand
Continue to sampling and token generation as usual

TODO:

fix "invalid input batch" error
do not allow context shifting
add OAI output format
add tests

ngxson · 2025-12-05T15:43:42Z

@allozaur @ServeurpersoCom one application of this feature can be having multiple response choices on web UI. Kinda a low-prio feature, I think could be quite nice to add!

Edit: we could technically also add per-response sampling control, for example one response with temperature=0.0 and another response with 1.0; there are many possibilities, but we need to see what's the use case exactly

Example on chatgpt:

allozaur · 2025-12-05T15:51:07Z

@allozaur @ServeurpersoCom one application of this feature can be having multiple response choices on web UI. Kinda a low-prio feature, I think could be quite nice to add!

Edit: we could technically also add per-response sampling control, for example one response with temperature=0.0 and another response with 1.0; there are many possibilities, but we need to see what's the use case exactly

Example on chatgpt:

Oh, absolutely! I would love to take over this one, maybe still this year?

ngxson · 2025-12-05T15:53:49Z

Oh, absolutely! I would love to take over this one, maybe still this year?

yeah no rush! feel free to start the task as soon as this PR is merged

ITankForCAD · 2025-12-05T18:27:52Z

This is more of an idea than a desired feature, at least for the moment but, multiple generations from the same prompt would allow for "best-of-n" scenarios. optillm is a good example of this.

ngxson · 2025-12-06T10:00:48Z

@ggerganov pinging in case you missed this PR

ggerganov

Very nice! The implementation is much simpler than I anticipated.

ggerganov · 2025-12-06T11:19:20Z

tools/server/server-common.cpp

+server_tokens server_tokens::clone() const {
+    server_tokens res;
+    res.has_mtmd = has_mtmd;
+    res.tokens   = tokens;
+    for (auto it = map_idx_to_media.begin(); it != map_idx_to_media.end(); ++it) {
+        size_t idx = it->first;
+        const mtmd::input_chunk_ptr & chunk = it->second;
+        res.map_idx_to_media[idx] = mtmd::input_chunk_ptr(mtmd_input_chunk_copy(chunk.get()));
+    }
+    return res;
+}
+


Now that we have this function, I think we can enable host-memory prompt caching with mtmd:

Update this code:

llama.cpp/tools/server/server-task.cpp

Lines 1396 to 1404 in 6fb3226

// TODO: for some reason we can't copy server_tokens, so we have to do this workaround

auto & cur = states.emplace_back();

cur = {

/*.tokens =*/ server_tokens(prompt.tokens.get_text_tokens(), false),

/*.data =*/ std::move(state_data),

/*.checkpoints =*/ prompt.checkpoints,

};

Remove this condition:

llama.cpp/tools/server/server-context.cpp

Lines 886 to 889 in 6fb3226

// TODO: mtmd does not support prompt cache

update_cache = update_cache && (ret->mctx == nullptr);

I haven't tested, but I think the only reason that prompt caching didn't work was because wasn't sure how to copy the server_tokens. So it's worth giving it a try after these changes.

Yes it will be nice to enable RAM cache for mtmd. I created an issue so we can have a look later on: #17821

ggerganov · 2025-12-06T11:23:41Z

tools/server/server-context.cpp

+                if (slot.is_parent() || slot.is_child()) {
+                    send_error(slot, "context shift cannot be used for shared prompt", ERROR_TYPE_SERVER);
+                    slot.release();
+                    continue;
+                }
+


Hm, what is the reason to not support context shift here?

Not quite sure about this, but IIUC llama_kv_cache::seq_add does not have a notion of copy-on-write. For example, if a KV cell is both used by 2 sequences, one seq shifting it will also cause the second to also be shifted

This is fine if the current (generating) token position is synchronized among all sequence, but we don't have an explicit logic to guarantee that this will always happen

Also, the generation length of each sequence an be different, which can be quite difficult to keep track

I see, that is correct. The problem is that some of the tokens are shared when we use unified KV cache. It would work with split KV cache, but maybe it's not worth the extra logic branching.

Either way, context shifting is probably something that we should remove at some point - it does not have much value with today's models with more than 128k token contexts.

ggerganov · 2025-12-06T11:27:40Z

tools/server/server-context.cpp

+                            slot.copy_state_to(*child);
+                            child->state = SLOT_STATE_DONE_PROMPT;
+                        }
+                        slot.state = SLOT_STATE_DONE_PROMPT;


Is this line needed?

removed in ea7f066

ggerganov · 2025-12-06T11:33:12Z

tools/server/server-context.cpp

+                    states.push_back(child.params.oaicompat_chat_syntax);
+                    tasks.push_back(std::move(child));


I think we should improve this by making tasks and states more associated with each other - feel like this is currently error-prone because one might forget to update the states when adding a new task.

Does it make sense to have the task_result_state be part of the server_task itself?

Does it make sense to have the task_result_state be part of the server_task itself?

The principle is that server_task will be std::move to task queue, and eventually be moved to slot, so it cannot hold task_result_state because the state need to stays in the HTTP thread

What I'm thinking is that we can just allow server_response_reader to create the state for each task, because currently tasks need to be posted by server_response_reader anyway

Btw, the further plan is to only expose server_response_reader to HTTP handlers as the API is easier to follow and it's also safer than managing directly the server_queue/response. WDYT?

I'll implement this in a follow-up PR

ggerganov · 2025-12-06T11:42:22Z

~~Can we make this work with the /completions and /infill endpoints?~~

Edit: nvm it works - just use "n_cmpl" instead of "n"

ngxson · 2025-12-06T12:57:40Z

btw for /completions and /infill, I added support for both n_cmpl and n fields

jacekpoplawski · 2025-12-06T15:17:52Z

Do I understand correctly that with this change, instead of sending multiple separate requests with the same prompt, I can now send a single request and it will be faster?

ServeurpersoCom · 2025-12-06T15:20:37Z

Do I understand correctly that with this change, instead of sending multiple separate requests with the same prompt, I can now send a single request and it will be faster?

Try it with -np, --parallel (not tested yet, I'm not sure)

ngxson · 2025-12-06T15:40:14Z

Do I understand correctly that with this change, instead of sending multiple separate requests with the same prompt, I can now send a single request and it will be faster?

Yes it will be faster - the "n" option allow prompt to be process exactly once

ServeurpersoCom · 2025-12-06T15:40:57Z

Great work on this PR!

I can confirm parallel sequences work perfectly. Here's my test setup:

# Server with 4 parallel slots
/root/llama.cpp.pascal/build/bin/llama-server \
  --port 8082 -ngl 999 \
  -ctk q8_0 -ctv q8_0 -fa on --mlock \
  -np 4 -kvu --ctx-size 32768 \
  --models-dir /var/www/ia/models

# Testing n=4 with streaming
curl -N https://www.serveurperso.com/ia/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF",
    "messages": [{"role": "user", "content": "Raconte une histoire"}],
    "n": 4,
    "stream": true
  }'
  
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"rena"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":1,"delta":{"content":"érie"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":2,"delta":{"content":" ton"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":3,"delta":{"content":"res"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

The implementation correctly:

Processes the prompt once (shared via llama_memory_seq_cp)
Generates 4 different completions in parallel using 4 slots
Returns proper SSE stream with "index": 0-3 for each choice

This is especially efficient when memory-bound since parallel batching allows better compute utilization while waiting for memory bandwidth: getting 3 to 4x total throughput!

…ggml-org#17775) * backend support * server: support multiple generations from one prompt (OAI "n" option) * fix invalid batch * format oai * clean up * disable ctx shift * add test * update comments * fix style * add n_cmpl to docs [no ci] * allowing using both n_cmpl and n

ggerganov · 2026-01-07T12:01:18Z

I think there is a problem with this implementation - each of the parallel completions appears to processes the same input prompt. I'll take a deeper look later to confirm, but from a quick test the computation is more than it should be (i.e. we compute the same prompt n_cmpl times instead of one).

ngxson · 2026-01-07T12:06:52Z

tools/server/server-context.cpp

+    }
+
+    bool is_child() const {
+        return is_processing() && task->id_parent >= 0;


Suggested change

return is_processing() && task->id_parent >= 0;

return task->id_parent >= 0;

@ggerganov yeah right, I think the problem is here: we check for is_child() to see if it should be set to a waiting state, but at the time of check, is_processing() is false

ServeurpersoCom · 2026-01-07T12:20:45Z

Our 'return_progress = true' can show the issue clearly with a script :

(root|~) cat bench.sh
#!/bin/bash
echo "Testing llama.cpp n=4 redundant prompt processing"
echo ""
TIMESTAMP=$(date +%s%N)
PROMPT="Timestamp: $TIMESTAMP. Numbers: $(seq 1 500 | tr '\n' ' ')"
echo "Prompt: unique timestamp + 500 numbers"
echo "Launching request with n=4..."
echo ""
echo "Tracking tokens computed per index:"
echo ""
curl -N https://www.serveurperso.com/ia/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Dense-Uncensored-Dolphin-Mistral-24B-Venice-Edition\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": \"$PROMPT\"
    }],
    \"n\": 4,
    \"stream\": true,
    \"max_tokens\": 20,
    \"return_progress\": true
  }" 2>/dev/null | while read -r line; do
  if [[ "$line" == data:* ]]; then
    INDEX=$(echo "$line" | grep -oP '"index":\K[0-9]+' | head -1)
    PROCESSED=$(echo "$line" | grep -oP '"processed":\K[0-9]+' | head -1)
    CACHED=$(echo "$line" | grep -oP '"cache":\K[0-9]+' | head -1)
    TIME=$(echo "$line" | grep -oP '"time_ms":\K[0-9]+' | head -1)
    if [ ! -z "$INDEX" ] && [ ! -z "$PROCESSED" ] && [ ! -z "$CACHED" ]; then
      COMPUTED=$((PROCESSED - CACHED))
      printf "[index %s] computed %4d tokens (%4dms)\n" "$INDEX" "$COMPUTED" "$TIME"
    fi
  fi
done
echo ""
echo "Expected: one index computes all tokens, others clone"
echo "Actual: all indices compute independently"

(root|~) ./bench.sh
Testing llama.cpp n=4 redundant prompt processing

Prompt: unique timestamp + 500 numbers
Launching request with n=4...

Tracking tokens computed per index:

[index 0] computed    0 tokens (   0ms)
[index 0] computed  128 tokens (  14ms)
[index 0] computed  256 tokens (  65ms)
[index 0] computed  384 tokens ( 116ms)
[index 0] computed  512 tokens ( 167ms)
[index 0] computed  640 tokens ( 217ms)
[index 0] computed  768 tokens ( 268ms)
[index 0] computed  896 tokens ( 319ms)
[index 0] computed 1024 tokens ( 370ms)
[index 0] computed 1152 tokens ( 421ms)
[index 0] computed 1280 tokens ( 472ms)
[index 0] computed 1408 tokens ( 523ms)
[index 0] computed 1536 tokens ( 574ms)
[index 0] computed 1664 tokens ( 625ms)
[index 0] computed 1792 tokens ( 676ms)
[index 1] computed    0 tokens (   0ms)
[index 0] computed 1908 tokens ( 727ms)
[index 1] computed   12 tokens (  89ms)
[index 1] computed  139 tokens ( 143ms)
[index 1] computed  266 tokens ( 196ms)
[index 1] computed  393 tokens ( 249ms)
[index 1] computed  520 tokens ( 302ms)
[index 1] computed  647 tokens ( 356ms)
[index 1] computed  774 tokens ( 409ms)
[index 1] computed  901 tokens ( 463ms)
[index 1] computed 1028 tokens ( 516ms)
[index 1] computed 1155 tokens ( 569ms)
[index 1] computed 1282 tokens ( 623ms)
[index 1] computed 1409 tokens ( 676ms)
[index 1] computed 1536 tokens ( 730ms)
[index 1] computed 1663 tokens ( 783ms)
[index 1] computed 1790 tokens ( 837ms)
[index 2] computed    0 tokens (   0ms)
[index 1] computed 1908 tokens ( 891ms)
[index 2] computed    9 tokens (  54ms)
[index 2] computed  135 tokens ( 107ms)
[index 2] computed  261 tokens ( 161ms)
[index 2] computed  387 tokens ( 214ms)
[index 2] computed  513 tokens ( 268ms)
[index 2] computed  640 tokens ( 322ms)
[index 2] computed  767 tokens ( 375ms)
[index 2] computed  894 tokens ( 428ms)
[index 2] computed 1021 tokens ( 482ms)
[index 2] computed 1148 tokens ( 535ms)
[index 2] computed 1275 tokens ( 588ms)
[index 2] computed 1402 tokens ( 642ms)
[index 2] computed 1529 tokens ( 695ms)
[index 2] computed 1656 tokens ( 749ms)
[index 2] computed 1783 tokens ( 803ms)
[index 3] computed    0 tokens (   0ms)
[index 2] computed 1908 tokens ( 857ms)
[index 3] computed    2 tokens (  54ms)
[index 3] computed  128 tokens ( 107ms)
[index 3] computed  254 tokens ( 161ms)
[index 3] computed  380 tokens ( 214ms)
[index 3] computed  506 tokens ( 267ms)
[index 3] computed  633 tokens ( 321ms)
[index 3] computed  760 tokens ( 374ms)
[index 3] computed  887 tokens ( 428ms)
[index 3] computed 1014 tokens ( 481ms)
[index 3] computed 1141 tokens ( 535ms)
[index 3] computed 1268 tokens ( 588ms)
[index 3] computed 1395 tokens ( 642ms)
[index 3] computed 1522 tokens ( 695ms)
[index 3] computed 1649 tokens ( 748ms)
[index 3] computed 1776 tokens ( 802ms)
[index 3] computed 1903 tokens ( 855ms)
[index 3] computed 1908 tokens ( 884ms)

Expected: one index computes all tokens, others clone
Actual: all indices compute independently
(root|~)

…ggml-org#17775) * backend support * server: support multiple generations from one prompt (OAI "n" option) * fix invalid batch * format oai * clean up * disable ctx shift * add test * update comments * fix style * add n_cmpl to docs [no ci] * allowing using both n_cmpl and n

… (#17775) * backend support * server: support multiple generations from one prompt (OAI "n" option) * fix invalid batch * format oai * clean up * disable ctx shift * add test * update comments * fix style * add n_cmpl to docs [no ci] * allowing using both n_cmpl and n

ngxson added 2 commits December 4, 2025 23:56

backend support

15ce574

server: support multiple generations from one prompt (OAI "n" option)

0d842cb

loci-dev mentioned this pull request Dec 5, 2025

UPSTREAM PR #17775: server: support multiple generations from one prompt (OAI "n" option) auroralabs-loci/llama.cpp#444

Open

4 tasks

github-actions bot added examples server labels Dec 5, 2025

ngxson added 6 commits December 5, 2025 10:04

fix invalid batch

bf33d13

format oai

a768a5e

clean up

5cc3156

disable ctx shift

2a7728f

add test

e066071

update comments

46f6fd2

ngxson marked this pull request as ready for review December 5, 2025 15:40

ngxson requested a review from ggerganov as a code owner December 5, 2025 15:40

github-actions bot added the python python script changes label Dec 5, 2025

fix style

b65ee64

allozaur mentioned this pull request Dec 5, 2025

Feature Request: Add suuport for Multilple Responses in WebUI #17798

Open

4 tasks

add n_cmpl to docs [no ci]

6fb3226

ggerganov approved these changes Dec 6, 2025

View reviewed changes

ngxson mentioned this pull request Dec 6, 2025

Feature request (server): enable host-memory prompt caching with mtmd #17821

Closed

allowing using both n_cmpl and n

ea7f066

ngxson merged commit c42712b into ggml-org:master Dec 6, 2025
64 of 75 checks passed

ngxson mentioned this pull request Dec 6, 2025

server: delegate result_state creation to server_task #17835

Merged

loci-dev mentioned this pull request Dec 6, 2025

UPSTREAM PR #17835: server: delegate result_state creation to server_task auroralabs-loci/llama.cpp#473

Open

Beinsezii mentioned this pull request Dec 8, 2025

"N" support for llama.cpp SillyTavern/SillyTavern#4869

Merged

gabe-l-hart mentioned this pull request Dec 10, 2025

feat: llama.cpp bump (17f7f4) for SSM performance improvements ollama/ollama#13408

Merged

ngxson commented Jan 7, 2026

View reviewed changes

ngxson mentioned this pull request Jan 7, 2026

server: fix n_cmpl not skipping processing prompt #18663

Merged

loci-dev mentioned this pull request Jan 9, 2026

UPSTREAM PR #18663: server: fix n_cmpl not skipping processing prompt auroralabs-loci/llama.cpp#864

Open


	// TODO: for some reason we can't copy server_tokens, so we have to do this workaround
	auto & cur = states.emplace_back();
	cur = {
	/.tokens =/ server_tokens(prompt.tokens.get_text_tokens(), false),
	/.data =/ std::move(state_data),
	/.checkpoints =/ prompt.checkpoints,
	};


	// TODO: mtmd does not support prompt cache
	update_cache = update_cache && (ret->mctx == nullptr);

		states.push_back(child.params.oaicompat_chat_syntax);
		tasks.push_back(std::move(child));

	return is_processing() && task->id_parent >= 0;
	return task->id_parent >= 0;

Conversation

ngxson commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation

Uh oh!

ngxson commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

allozaur commented Dec 5, 2025

Uh oh!

ngxson commented Dec 5, 2025

Uh oh!

ITankForCAD commented Dec 5, 2025

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ngxson Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ggerganov commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

Uh oh!

jacekpoplawski commented Dec 6, 2025

Uh oh!

ServeurpersoCom commented Dec 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Dec 6, 2025

Uh oh!

ServeurpersoCom commented Dec 6, 2025

Uh oh!

ggerganov commented Jan 7, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ServeurpersoCom commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

ngxson commented Dec 5, 2025 •

edited

Loading

ngxson commented Dec 5, 2025 •

edited

Loading

ngxson Dec 6, 2025 •

edited

Loading

ggerganov commented Dec 6, 2025 •

edited

Loading

ServeurpersoCom commented Dec 6, 2025 •

edited

Loading

ServeurpersoCom commented Jan 7, 2026 •

edited

Loading