sampling : support multiple outputs per sequence#19833
Draft
danbev wants to merge 6 commits into
Draft
Conversation
6c6b36b to
5c92c76
Compare
This commit adds support for multiple outputs per sequence in the backend sampling implementation. The main motivation for this change is to be able to support speculative decoding using backend samplers where multiple outputs for the same sequence would be needed.
This commit adds a compute graph parameter named n_sampling_outputs_max which is intended to be used as a max (cap) value for the number of output for backend sampling. The motivation for this is that it gives a configurable value instead of a hardcoded macro (LLAMA_MAX_SAMPLING_OUTPUTS) which has been removed. I'm not sure if this is the best option as having multiple outputs per sequence might not be the most common use case. I need to think a little bit more about this. I'll commmit this to see that CI passes and also this parameter should be exposed as a common options for tools which I'll do in a follow up commit.
This commit makes the computation graph static when backend samplers process multiple outputs per sequence. Previously, only active samplers, those with outputs in the current batch, were added to the graph. This could cause graph reallocations if different samplers become active/inactive across batches, even when the number of outputs remained constant.
This commit adds clamping to the backend distribution sampler to avoid
the case where idxf values are all zero. If this happens then we will
incorrectly create an out of bounds idx value which will cause a crash.
This can be reproduced by explicitly setting idxf to zero:
```c++
idxf = ggml_scale(ctx, idxf, 0.0f);
```
5c92c76 to
b4f486d
Compare
This commit adds a function to the backend sampler API to retrieve the number of nodes required for sampling. The motivation for this is that this enables graph_max_nodes to get the number of nodes required for backend sampling.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This commit adds support for multiple outputs per sequence in the backend sampling implementation.
The main motivation for this change is to be able to support speculative decoding using backend samplers where multiple outputs for the same sequence would be needed.
Example usage with llama-server:
TODO: