sampling : support multiple outputs per sequence by danbev · Pull Request #19833 · ggml-org/llama.cpp

danbev · 2026-02-23T13:51:03Z

This commit adds support for multiple outputs per sequence in the backend sampling implementation.

The main motivation for this change is to be able to support speculative decoding using backend samplers where multiple outputs for the same sequence would be needed.

Example usage with llama-server:

$ llama-server -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf  \
      --spec-type ngram-mod --spec-ngram-size-n 8 --draft-min 1 --draft-max 64 \
      -bs -v

TODO:

Make the graph "static" for sequences with multiple outputs.
Perhaps add a configuration option specifying that multiple outputs are not used for backend sampling to avoid the additional nodes in the compute graph.

This commit adds support for multiple outputs per sequence in the backend sampling implementation. The main motivation for this change is to be able to support speculative decoding using backend samplers where multiple outputs for the same sequence would be needed.

This commit adds a compute graph parameter named n_sampling_outputs_max which is intended to be used as a max (cap) value for the number of output for backend sampling. The motivation for this is that it gives a configurable value instead of a hardcoded macro (LLAMA_MAX_SAMPLING_OUTPUTS) which has been removed. I'm not sure if this is the best option as having multiple outputs per sequence might not be the most common use case. I need to think a little bit more about this. I'll commmit this to see that CI passes and also this parameter should be exposed as a common options for tools which I'll do in a follow up commit.

This commit makes the computation graph static when backend samplers process multiple outputs per sequence. Previously, only active samplers, those with outputs in the current batch, were added to the graph. This could cause graph reallocations if different samplers become active/inactive across batches, even when the number of outputs remained constant.

This commit adds clamping to the backend distribution sampler to avoid the case where idxf values are all zero. If this happens then we will incorrectly create an out of bounds idx value which will cause a crash. This can be reproduced by explicitly setting idxf to zero: ```c++ idxf = ggml_scale(ctx, idxf, 0.0f); ```

This commit adds a function to the backend sampler API to retrieve the number of nodes required for sampling. The motivation for this is that this enables graph_max_nodes to get the number of nodes required for backend sampling.

danbev requested review from CISC and ggerganov as code owners February 23, 2026 13:51

danbev marked this pull request as draft February 23, 2026 13:51

github-actions Bot added testing Everything test related examples server labels Feb 23, 2026

danbev force-pushed the backend-sampling-multiple-outputs-per-seq branch from 6c6b36b to 5c92c76 Compare February 27, 2026 05:14

danbev marked this pull request as ready for review February 27, 2026 10:19

danbev requested a review from ngxson as a code owner February 27, 2026 10:19

am17an mentioned this pull request May 18, 2026

Move to backend sampling for MTP draft path #23287

Merged

danbev added 5 commits May 20, 2026 10:18

server : enable backend sampling for multiple outputs per sequence

9e7599f

danbev force-pushed the backend-sampling-multiple-outputs-per-seq branch from 5c92c76 to b4f486d Compare May 20, 2026 08:20

sampling : introduce backend_n_nodes in API

6c76343

This commit adds a function to the backend sampler API to retrieve the number of nodes required for sampling. The motivation for this is that this enables graph_max_nodes to get the number of nodes required for backend sampling.

danbev marked this pull request as draft May 20, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sampling : support multiple outputs per sequence#19833

sampling : support multiple outputs per sequence#19833
danbev wants to merge 6 commits into
ggml-org:masterfrom
danbev:backend-sampling-multiple-outputs-per-seq

danbev commented Feb 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danbev commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TODO:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danbev commented Feb 23, 2026 •

edited

Loading