Skip to content

server: support multiple generations from one prompt (OAI "n" option)#17775

Merged
ngxson merged 11 commits intoggml-org:masterfrom
ngxson:xsn/add_n_support
Dec 6, 2025
Merged

server: support multiple generations from one prompt (OAI "n" option)#17775
ngxson merged 11 commits intoggml-org:masterfrom
ngxson:xsn/add_n_support

Conversation

@ngxson
Copy link
Contributor

@ngxson ngxson commented Dec 5, 2025

Fix #11142


Implementation

The requirement is that number of slots must be equal or larger than number of "n" completion choices.

  1. When task is created, we create N-1 child tasks and 1 parent task
  2. Parent task is guaranteed to be loaded first (because we push all tasks into queue under one lock acquired). Child tasks are also loaded into slots at this point, but state will be set to SLOT_STATE_WAIT_OTHER
  3. We begin processing parent's prompt
  4. When parent prompt processing is done, we gather all children having SLOT_STATE_WAIT_OTHER state, then copy parent's state into these slots via llama_memory_seq_cp
    • Note: at this point, if we cannot yet gather all children (maybe one of slot is busying with another task), we wait until the task is done and all children are on their slots. This can potentially decrease the overall throughput, but makes the implementation easier to understand
  5. Continue to sampling and token generation as usual

TODO:

  • fix "invalid input batch" error
  • do not allow context shifting
  • add OAI output format
  • add tests

@ngxson ngxson marked this pull request as ready for review December 5, 2025 15:40
@ngxson ngxson requested a review from ggerganov as a code owner December 5, 2025 15:40
@ngxson
Copy link
Contributor Author

ngxson commented Dec 5, 2025

@allozaur @ServeurpersoCom one application of this feature can be having multiple response choices on web UI. Kinda a low-prio feature, I think could be quite nice to add!

Edit: we could technically also add per-response sampling control, for example one response with temperature=0.0 and another response with 1.0; there are many possibilities, but we need to see what's the use case exactly

Example on chatgpt:

image

@allozaur
Copy link
Contributor

allozaur commented Dec 5, 2025

@allozaur @ServeurpersoCom one application of this feature can be having multiple response choices on web UI. Kinda a low-prio feature, I think could be quite nice to add!

Edit: we could technically also add per-response sampling control, for example one response with temperature=0.0 and another response with 1.0; there are many possibilities, but we need to see what's the use case exactly

Example on chatgpt:

image

Oh, absolutely! I would love to take over this one, maybe still this year?

@ngxson
Copy link
Contributor Author

ngxson commented Dec 5, 2025

Oh, absolutely! I would love to take over this one, maybe still this year?

yeah no rush! feel free to start the task as soon as this PR is merged

@github-actions github-actions bot added the python python script changes label Dec 5, 2025
@ITankForCAD
Copy link

This is more of an idea than a desired feature, at least for the moment but, multiple generations from the same prompt would allow for "best-of-n" scenarios. optillm is a good example of this.

@ngxson
Copy link
Contributor Author

ngxson commented Dec 6, 2025

@ggerganov pinging in case you missed this PR

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! The implementation is much simpler than I anticipated.

Comment on lines +497 to +508
server_tokens server_tokens::clone() const {
server_tokens res;
res.has_mtmd = has_mtmd;
res.tokens = tokens;
for (auto it = map_idx_to_media.begin(); it != map_idx_to_media.end(); ++it) {
size_t idx = it->first;
const mtmd::input_chunk_ptr & chunk = it->second;
res.map_idx_to_media[idx] = mtmd::input_chunk_ptr(mtmd_input_chunk_copy(chunk.get()));
}
return res;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that we have this function, I think we can enable host-memory prompt caching with mtmd:

  • Update this code:

// TODO: for some reason we can't copy server_tokens, so we have to do this workaround
auto & cur = states.emplace_back();
cur = {
/*.tokens =*/ server_tokens(prompt.tokens.get_text_tokens(), false),
/*.data =*/ std::move(state_data),
/*.checkpoints =*/ prompt.checkpoints,
};

  • Remove this condition:

// TODO: mtmd does not support prompt cache
update_cache = update_cache && (ret->mctx == nullptr);

I haven't tested, but I think the only reason that prompt caching didn't work was because wasn't sure how to copy the server_tokens. So it's worth giving it a try after these changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it will be nice to enable RAM cache for mtmd. I created an issue so we can have a look later on: #17821

Comment on lines +1710 to +1715
if (slot.is_parent() || slot.is_child()) {
send_error(slot, "context shift cannot be used for shared prompt", ERROR_TYPE_SERVER);
slot.release();
continue;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, what is the reason to not support context shift here?

Copy link
Contributor Author

@ngxson ngxson Dec 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not quite sure about this, but IIUC llama_kv_cache::seq_add does not have a notion of copy-on-write. For example, if a KV cell is both used by 2 sequences, one seq shifting it will also cause the second to also be shifted

This is fine if the current (generating) token position is synchronized among all sequence, but we don't have an explicit logic to guarantee that this will always happen

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the generation length of each sequence an be different, which can be quite difficult to keep track

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, that is correct. The problem is that some of the tokens are shared when we use unified KV cache. It would work with split KV cache, but maybe it's not worth the extra logic branching.

Either way, context shifting is probably something that we should remove at some point - it does not have much value with today's models with more than 128k token contexts.

slot.copy_state_to(*child);
child->state = SLOT_STATE_DONE_PROMPT;
}
slot.state = SLOT_STATE_DONE_PROMPT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this line needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed in ea7f066

Comment on lines +2673 to +2674
states.push_back(child.params.oaicompat_chat_syntax);
tasks.push_back(std::move(child));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should improve this by making tasks and states more associated with each other - feel like this is currently error-prone because one might forget to update the states when adding a new task.

Does it make sense to have the task_result_state be part of the server_task itself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have the task_result_state be part of the server_task itself?

The principle is that server_task will be std::move to task queue, and eventually be moved to slot, so it cannot hold task_result_state because the state need to stays in the HTTP thread

What I'm thinking is that we can just allow server_response_reader to create the state for each task, because currently tasks need to be posted by server_response_reader anyway

Btw, the further plan is to only expose server_response_reader to HTTP handlers as the API is easier to follow and it's also safer than managing directly the server_queue/response. WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll implement this in a follow-up PR

@ggerganov
Copy link
Member

ggerganov commented Dec 6, 2025

Can we make this work with the /completions and /infill endpoints?

Edit: nvm it works - just use "n_cmpl" instead of "n"

@ngxson
Copy link
Contributor Author

ngxson commented Dec 6, 2025

btw for /completions and /infill, I added support for both n_cmpl and n fields

@ngxson ngxson merged commit c42712b into ggml-org:master Dec 6, 2025
64 of 75 checks passed
@jacekpoplawski
Copy link
Contributor

Do I understand correctly that with this change, instead of sending multiple separate requests with the same prompt, I can now send a single request and it will be faster?

@ServeurpersoCom
Copy link
Contributor

ServeurpersoCom commented Dec 6, 2025

Do I understand correctly that with this change, instead of sending multiple separate requests with the same prompt, I can now send a single request and it will be faster?

Try it with -np, --parallel (not tested yet, I'm not sure)

@ngxson
Copy link
Contributor Author

ngxson commented Dec 6, 2025

Do I understand correctly that with this change, instead of sending multiple separate requests with the same prompt, I can now send a single request and it will be faster?

Yes it will be faster - the "n" option allow prompt to be process exactly once

@ServeurpersoCom
Copy link
Contributor

Great work on this PR!

I can confirm parallel sequences work perfectly. Here's my test setup:

# Server with 4 parallel slots
/root/llama.cpp.pascal/build/bin/llama-server \
  --port 8082 -ngl 999 \
  -ctk q8_0 -ctv q8_0 -fa on --mlock \
  -np 4 -kvu --ctx-size 32768 \
  --models-dir /var/www/ia/models

# Testing n=4 with streaming
curl -N https://www.serveurperso.com/ia/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF",
    "messages": [{"role": "user", "content": "Raconte une histoire"}],
    "n": 4,
    "stream": true
  }'
  
data: {"choices":[{"finish_reason":null,"index":0,"delta":{"content":"rena"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":1,"delta":{"content":"érie"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":2,"delta":{"content":" ton"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

data: {"choices":[{"finish_reason":null,"index":3,"delta":{"content":"res"}}],"created":1765035211,"id":"chatcmpl-6sAuIJ14qwTCRYq7lgPcbPvisCDZv6A5","model":"unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF","system_fingerprint":"b7359-7b09f44a5","object":"chat.completion.chunk"}

The implementation correctly:

  • Processes the prompt once (shared via llama_memory_seq_cp)
  • Generates 4 different completions in parallel using 4 slots
  • Returns proper SSE stream with "index": 0-3 for each choice

This is especially efficient when memory-bound since parallel batching allows better compute utilization while waiting for memory bandwidth: getting 3 to 4x total throughput!

JayZenith pushed a commit to JayZenith/llama.cpp that referenced this pull request Dec 7, 2025
…ggml-org#17775)

* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
0Marble pushed a commit to 0Marble/llama.cpp that referenced this pull request Dec 18, 2025
…ggml-org#17775)

* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
@ggerganov
Copy link
Member

I think there is a problem with this implementation - each of the parallel completions appears to processes the same input prompt. I'll take a deeper look later to confirm, but from a quick test the computation is more than it should be (i.e. we compute the same prompt n_cmpl times instead of one).

}

bool is_child() const {
return is_processing() && task->id_parent >= 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return is_processing() && task->id_parent >= 0;
return task->id_parent >= 0;

@ggerganov yeah right, I think the problem is here: we check for is_child() to see if it should be set to a waiting state, but at the time of check, is_processing() is false

@ServeurpersoCom
Copy link
Contributor

ServeurpersoCom commented Jan 7, 2026

Our 'return_progress = true' can show the issue clearly with a script :

(root|~) cat bench.sh
#!/bin/bash
echo "Testing llama.cpp n=4 redundant prompt processing"
echo ""
TIMESTAMP=$(date +%s%N)
PROMPT="Timestamp: $TIMESTAMP. Numbers: $(seq 1 500 | tr '\n' ' ')"
echo "Prompt: unique timestamp + 500 numbers"
echo "Launching request with n=4..."
echo ""
echo "Tracking tokens computed per index:"
echo ""
curl -N https://www.serveurperso.com/ia/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d "{
    \"model\": \"Dense-Uncensored-Dolphin-Mistral-24B-Venice-Edition\",
    \"messages\": [{
      \"role\": \"user\",
      \"content\": \"$PROMPT\"
    }],
    \"n\": 4,
    \"stream\": true,
    \"max_tokens\": 20,
    \"return_progress\": true
  }" 2>/dev/null | while read -r line; do
  if [[ "$line" == data:* ]]; then
    INDEX=$(echo "$line" | grep -oP '"index":\K[0-9]+' | head -1)
    PROCESSED=$(echo "$line" | grep -oP '"processed":\K[0-9]+' | head -1)
    CACHED=$(echo "$line" | grep -oP '"cache":\K[0-9]+' | head -1)
    TIME=$(echo "$line" | grep -oP '"time_ms":\K[0-9]+' | head -1)
    if [ ! -z "$INDEX" ] && [ ! -z "$PROCESSED" ] && [ ! -z "$CACHED" ]; then
      COMPUTED=$((PROCESSED - CACHED))
      printf "[index %s] computed %4d tokens (%4dms)\n" "$INDEX" "$COMPUTED" "$TIME"
    fi
  fi
done
echo ""
echo "Expected: one index computes all tokens, others clone"
echo "Actual: all indices compute independently"
(root|~) ./bench.sh
Testing llama.cpp n=4 redundant prompt processing

Prompt: unique timestamp + 500 numbers
Launching request with n=4...

Tracking tokens computed per index:

[index 0] computed    0 tokens (   0ms)
[index 0] computed  128 tokens (  14ms)
[index 0] computed  256 tokens (  65ms)
[index 0] computed  384 tokens ( 116ms)
[index 0] computed  512 tokens ( 167ms)
[index 0] computed  640 tokens ( 217ms)
[index 0] computed  768 tokens ( 268ms)
[index 0] computed  896 tokens ( 319ms)
[index 0] computed 1024 tokens ( 370ms)
[index 0] computed 1152 tokens ( 421ms)
[index 0] computed 1280 tokens ( 472ms)
[index 0] computed 1408 tokens ( 523ms)
[index 0] computed 1536 tokens ( 574ms)
[index 0] computed 1664 tokens ( 625ms)
[index 0] computed 1792 tokens ( 676ms)
[index 1] computed    0 tokens (   0ms)
[index 0] computed 1908 tokens ( 727ms)
[index 1] computed   12 tokens (  89ms)
[index 1] computed  139 tokens ( 143ms)
[index 1] computed  266 tokens ( 196ms)
[index 1] computed  393 tokens ( 249ms)
[index 1] computed  520 tokens ( 302ms)
[index 1] computed  647 tokens ( 356ms)
[index 1] computed  774 tokens ( 409ms)
[index 1] computed  901 tokens ( 463ms)
[index 1] computed 1028 tokens ( 516ms)
[index 1] computed 1155 tokens ( 569ms)
[index 1] computed 1282 tokens ( 623ms)
[index 1] computed 1409 tokens ( 676ms)
[index 1] computed 1536 tokens ( 730ms)
[index 1] computed 1663 tokens ( 783ms)
[index 1] computed 1790 tokens ( 837ms)
[index 2] computed    0 tokens (   0ms)
[index 1] computed 1908 tokens ( 891ms)
[index 2] computed    9 tokens (  54ms)
[index 2] computed  135 tokens ( 107ms)
[index 2] computed  261 tokens ( 161ms)
[index 2] computed  387 tokens ( 214ms)
[index 2] computed  513 tokens ( 268ms)
[index 2] computed  640 tokens ( 322ms)
[index 2] computed  767 tokens ( 375ms)
[index 2] computed  894 tokens ( 428ms)
[index 2] computed 1021 tokens ( 482ms)
[index 2] computed 1148 tokens ( 535ms)
[index 2] computed 1275 tokens ( 588ms)
[index 2] computed 1402 tokens ( 642ms)
[index 2] computed 1529 tokens ( 695ms)
[index 2] computed 1656 tokens ( 749ms)
[index 2] computed 1783 tokens ( 803ms)
[index 3] computed    0 tokens (   0ms)
[index 2] computed 1908 tokens ( 857ms)
[index 3] computed    2 tokens (  54ms)
[index 3] computed  128 tokens ( 107ms)
[index 3] computed  254 tokens ( 161ms)
[index 3] computed  380 tokens ( 214ms)
[index 3] computed  506 tokens ( 267ms)
[index 3] computed  633 tokens ( 321ms)
[index 3] computed  760 tokens ( 374ms)
[index 3] computed  887 tokens ( 428ms)
[index 3] computed 1014 tokens ( 481ms)
[index 3] computed 1141 tokens ( 535ms)
[index 3] computed 1268 tokens ( 588ms)
[index 3] computed 1395 tokens ( 642ms)
[index 3] computed 1522 tokens ( 695ms)
[index 3] computed 1649 tokens ( 748ms)
[index 3] computed 1776 tokens ( 802ms)
[index 3] computed 1903 tokens ( 855ms)
[index 3] computed 1908 tokens ( 884ms)

Expected: one index computes all tokens, others clone
Actual: all indices compute independently
(root|~)

Anico2 added a commit to Anico2/llama.cpp that referenced this pull request Jan 15, 2026
…ggml-org#17775)

* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
blime4 referenced this pull request in blime4/llama.cpp Feb 5, 2026
… (#17775)

* backend support

* server: support multiple generations from one prompt (OAI "n" option)

* fix invalid batch

* format oai

* clean up

* disable ctx shift

* add test

* update comments

* fix style

* add n_cmpl to docs [no ci]

* allowing using both n_cmpl and n
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples python python script changes server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

server : add support for multiple responses

6 participants