Skip to content

sampling : support multiple outputs per sequence#19833

Draft
danbev wants to merge 6 commits into
ggml-org:masterfrom
danbev:backend-sampling-multiple-outputs-per-seq
Draft

sampling : support multiple outputs per sequence#19833
danbev wants to merge 6 commits into
ggml-org:masterfrom
danbev:backend-sampling-multiple-outputs-per-seq

Conversation

@danbev
Copy link
Copy Markdown
Member

@danbev danbev commented Feb 23, 2026

This commit adds support for multiple outputs per sequence in the backend sampling implementation.

The main motivation for this change is to be able to support speculative decoding using backend samplers where multiple outputs for the same sequence would be needed.


Example usage with llama-server:

$ llama-server -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf  \
      --spec-type ngram-mod --spec-ngram-size-n 8 --draft-min 1 --draft-max 64 \
      -bs -v

TODO:

  • Make the graph "static" for sequences with multiple outputs.
  • Perhaps add a configuration option specifying that multiple outputs are not used for backend sampling to avoid the additional nodes in the compute graph.

@danbev danbev marked this pull request as draft February 23, 2026 13:51
@github-actions github-actions Bot added testing Everything test related examples server labels Feb 23, 2026
@danbev danbev force-pushed the backend-sampling-multiple-outputs-per-seq branch from 6c6b36b to 5c92c76 Compare February 27, 2026 05:14
@danbev danbev marked this pull request as ready for review February 27, 2026 10:19
@danbev danbev requested a review from ngxson as a code owner February 27, 2026 10:19
danbev added 5 commits May 20, 2026 10:18
This commit adds support for multiple outputs per sequence in the
backend sampling implementation.

The main motivation for this change is to be able to support speculative
decoding using backend samplers where multiple outputs for the same
sequence would be needed.
This commit adds a compute graph parameter named n_sampling_outputs_max
which is intended to be used as a max (cap) value for the number of
output for backend sampling.

The motivation for this is that it gives a configurable value instead of
a hardcoded macro (LLAMA_MAX_SAMPLING_OUTPUTS) which has been removed.

I'm not sure if this is the best option as having multiple outputs per
sequence might not be the most common use case. I need to think a little
bit more about this. I'll commmit this to see that CI passes and also
this parameter should be exposed as a common options for tools which
I'll do in a follow up commit.
This commit makes the computation graph static when backend samplers
process multiple outputs per sequence.

Previously, only active samplers, those with outputs in the current
batch, were added to the graph. This could cause graph reallocations if
different samplers become active/inactive across batches, even when the
number of outputs remained constant.
This commit adds clamping to the backend distribution sampler to avoid
the case where idxf values are all zero. If this happens then we will
incorrectly create an out of bounds idx value which will cause a crash.

This can be reproduced by explicitly setting idxf to zero:
```c++
    idxf = ggml_scale(ctx, idxf, 0.0f);
```
@danbev danbev force-pushed the backend-sampling-multiple-outputs-per-seq branch from 5c92c76 to b4f486d Compare May 20, 2026 08:20
This commit adds a function to the backend sampler API to retrieve the
number of nodes required for sampling.

The motivation for this is that this enables graph_max_nodes to get the
number of nodes required for backend sampling.
@danbev danbev marked this pull request as draft May 20, 2026 13:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

examples server testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant