Skip to content

implement adaptive-p sampler#17927

Merged
ggerganov merged 92 commits intoggml-org:masterfrom
ddh0:power-law-sampler
Jan 15, 2026
Merged

implement adaptive-p sampler#17927
ggerganov merged 92 commits intoggml-org:masterfrom
ddh0:power-law-sampler

Conversation

@ddh0
Copy link
Contributor

@ddh0 ddh0 commented Dec 11, 2025

This PR implements a new sampler called adaptive-p that selects tokens near a configurable target probability over time.

How it works

The adaptive-p sampler transforms the token probability distribution to favor tokens that fall near a user-configurable probability target. Internally, the sampler maintains an exponential moving average of the original probabilities of selected tokens. It uses this, along with the user's set target, to compute an adapted target at each sampling step, steering the running average toward the configured target over time. If recent selections have been higher-probability than target, the sampler compensates by temporarily favoring lower-probability tokens, and vice versa.

Parameters

This sampler exposes two parameters:

Parameter name Description CLI argument Valid range Default value Notes
target Select tokens near this probability --adaptive-target N 0.0 - 1.0 -1.0 When set to a negative number, the adaptive probability transform is disabled, and instead it just samples normally. Note that since the default value is -1.0, the sampler is disabled by default. This is intentional.
decay Decay value for exponential moving average - lower values are more reactive, higher values are more stable --adaptive-decay N 0.0 - 0.99 0.90 Clamped to <=0.99 at init to avoid unbounded accumulation

In most cases, you can just play with --adaptive-target. The default decay of 0.9 (for a ~10 token history) works well. A good starting value is --adaptive-target 0.55. It is suggested to raise or lower the target in increments of 0.05 as needed.

Usage notes

adaptive-p selects a token ID rather than just mutating candidates, so it must be last in the sampler chain. It shares this behaviour with some existing samplers like mirostat, dist, and greedy (mirostat being the closest relative).

Only mild truncation before this sampler is recommended. We suggest applying min-p before adaptive-p as the only other active sampler in the chain (optionally with top-k as well).

Example usage:

./build/bin/llama-server -m ~/gguf/my-model.gguf --samplers "top-k;min-p;adaptive-p" --top-k 128 --min-p 0.05 --adaptive-target 0.55

Other notes

This sampler was previously called "power law" in earlier versions of this PR, named for the power law transform we were applying to logits. We are no longer applying the power law transform. We also experimented with gaussian, but ultimately settled on the current formula.

Acknowledgements

  • @MrJackSpade - original "power law" sampler idea and implementation as can be seen here - they continued to help guide me as I worked through the implementation of this sampler. They deserve an enormous! amount of credit for the work they put in on this.
  • @Geechan for advice on sampler parameters and collaboration during development
  • @AesSedai for testing, collaboration, and hosting, as well as coming up with the final name "adaptive-p"

@ddh0

This comment was marked as outdated.

@ddh0 ddh0 marked this pull request as ready for review December 11, 2025 23:59
@ddh0 ddh0 requested a review from ggerganov as a code owner December 11, 2025 23:59
@ddh0
Copy link
Contributor Author

ddh0 commented Dec 12, 2025

Nevermind, sorry, I think we want to do a little more testing. I'm going to mark this as draft again temporarily.

@ddh0 ddh0 marked this pull request as draft December 12, 2025 02:55
Copy link
Contributor

@pnb pnb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very interesting! I wish the original compared to XTC, since the goals seem highly similar.

As an aside, I am curious if there is some way to make it work without selecting a token (i.e., only steps 1-3). I see why token selection is necessary, given the need to save the original probability to the history for the adaptive adjustment part. But, for example, maybe it would suffice instead to save the original probability of the highest-probability token after transforming, regardless of which one is eventually selected by a downstream sampler.


// fixed power law transform parameters (from original implementation)
const float distribution_width = 0.2f;
const float peak_logit_value = 3.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these parameters be configurable like in the original implementation? There is probably a tradeoff with feature creep, having too many options for users to control, but some of these seem potentially important (especially distribution_width). Also, I noticed peak_logit_value is outside the range suggested in the original implementation; is that intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myself and the original author are discussing the parameters over the next few days, I agree that the current implementation is probably not ideal, which is why I marked it back as draft.

I will post a comment in the main thread with an update once we've got it more figured out. Thank you!

@Beinsezii
Copy link
Contributor

Beinsezii commented Jan 12, 2026

thing I worry is someone else will pick a different magic number than 1e9 in the future and then adaptive-p breaks again. that's why I think adaptive-p should calculate its own zero point against -inf or -1e9 or whatever instead of using a second softmax

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 12, 2026

thing I worry is someone else will pick a different magic number than 1e9 in the future and then adaptive-p breaks again. that's why I think adaptive-p should calculate its own zero point against -inf or -1e9 or whatever.

Yes, I'm not sure. -1e9f is fine for now as far as I'm aware but I'll wait for @ggerganov to give his input on this. Maybe we can have a file-level constant for this threshold.

@ggerganov
Copy link
Member

Haven't tested this, but can we avoid the 1e9 constant and do something like this instead:

diff --git a/src/llama-sampling.cpp b/src/llama-sampling.cpp
index 11f0394c4..19c5c419e 100644
--- a/src/llama-sampling.cpp
+++ b/src/llama-sampling.cpp
@@ -1495,6 +1495,10 @@ static void llama_sampler_top_p_backend_apply(
     struct ggml_tensor * mask = ggml_step(ctx, cdf_scaled);
     ggml_set_name(mask, "top_p_mask");
 
+    // mask_inf = (mask - 1) * -INFINITY
+    struct ggml_tensor * mask_inf = ggml_scale(ctx, ggml_scale_bias(ctx, mask, 1.0f, -1.0f), -INFINITY);
+    ggml_set_name(mask, "top_p_mask_inf");
+
     // Taking the sum of the mask gives us the sum of elements after the threshold
     // we are interested in.
     struct ggml_tensor * idxf = ggml_sum(ctx, mask);
@@ -1517,8 +1521,9 @@ static void llama_sampler_top_p_backend_apply(
     // top_p_bias = (mask * 1e9f) - 1e9f.
     // So entries in the mask that we want to discard will become -1e9f, and
     // others will be 0 (meaning that will not effect the logits).
-    const float large_val = 1e9f;
-    struct ggml_tensor * top_p_bias = ggml_scale_bias(ctx, mask, large_val, -large_val);
+    //const float large_val = 1e9f;
+    //struct ggml_tensor * top_p_bias = ggml_scale_bias(ctx, mask, large_val, -large_val);
+    struct ggml_tensor * top_p_bias = ggml_add(ctx, mask, mask_inf);
     ggml_set_name(top_p_bias, "top_p_bias");
 
     data->logits = ggml_add(ctx, sorted_logits, top_p_bias);

Would this help?

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 13, 2026

Would this help?

Thank you! That code specifically didn't work but it was enough to point me in the right direction, and adaptive-p now plays nicely with both CPU and backend sampling, without any more magic numbers. PTAL.

@ggerganov
Copy link
Member

Nice, this looks like a better solution. Reviewing

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After addressing the minor comments we can merge

Comment on lines +3321 to +3333
auto * ctx = (llama_sampler_adaptive_p *) smpl->ctx;

if (ctx->target < 0.0f) {
// at negative target values, adaptive-p is no-op
// we simply sample from the existing distribution
llama_sampler_softmax_impl(cur_p, false);
cur_p->selected = llama_sample_dist(cur_p, ctx->rng);
return;
}

// softmax and store the original probabilities
llama_sampler_softmax_impl(cur_p, false);
ctx->original_probs.resize(cur_p->size);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The llama_sampler_softmax_impl(cur_p, false); call can be deduplicated here - call once before the if

float decay,
uint32_t seed
) {
auto seed_cur = get_rng_seed(seed);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seed_cur logic breaks when the seed is LLAMA_DEFAULT_SEED. In that case we generate a random seed based on the time. But with this implementation, if you clone the sampler, it will inherit the "seed_cur" of the source sampler, instead of generating a new seed.

See how the dist sampler maintains a seed_cur in its context and replicate the logic here.

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 14, 2026

Latest commit addresses both review comments. I added a seed_cur member to the struct and clone the seed member now like dist does.

@z80maniac
Copy link
Contributor

I think this sampler needs to be described in the server's README. Otherwise the users will have no idea how to use it or that it even exists in the first place.

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 14, 2026

Hmm, is there not a README that explains all the samplers somewhere? Since the samplers are not just used for llama-server but also llama-cli and exposed in the shared library.

@z80maniac
Copy link
Contributor

It's more about describing the server API for that sampler. For example, that the users need to use adaptive_target and adaptive_decay JSON params in the /completion endpoint and that the values must be of a certain type (float) and within a certain range. And also, what the default values are. All other samplers are described somewhat like this in the server's README.

The CLI README also describes all the CLI arguments separately.

I have noticed that more expanded descriptions of the samplers can be found at tools/completion/README.md. Maybe the full descriptions of this sampler can be written in there.

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 14, 2026

I gotcha. Latest commit adds docs to server, CLI, and completion.

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 14, 2026

Should be ready to go @ggerganov 🙏 thanks for the review

Comment on lines +3384 to +3395

static void llama_sampler_adaptive_p_reset(struct llama_sampler * smpl) {
auto * ctx = (llama_sampler_adaptive_p *) smpl->ctx;
// ctx->target and ctx->decay never change after init, so it's safe to keep them as is.
// original_probs is completely overwritten on every call to _apply.
// so we only need to reset the EMA state and pending token.
ctx->weighted_sum = ctx->target / (1.0f - ctx->decay);
ctx->total_weight = 1.0f / (1.0f - ctx->decay);
ctx->pending_token_id = LLAMA_TOKEN_NULL;
ctx->pending_token_idx = -1;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should apply the same logic for the RNG seeds here as in the dist reset:

static void llama_sampler_dist_reset(struct llama_sampler * smpl) {
auto * ctx = (llama_sampler_dist *) smpl->ctx;
ctx->seed_cur = get_rng_seed(ctx->seed);
ctx->rng.seed(ctx->seed_cur);
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. I addressed this in 40fd48f just now

@ddh0
Copy link
Contributor Author

ddh0 commented Jan 15, 2026

Please excuse my ping @ggerganov - is there anything else needed on my end to push this over the finish line?

@ggerganov ggerganov merged commit 13f1e4a into ggml-org:master Jan 15, 2026
69 of 73 checks passed
@ddh0 ddh0 deleted the power-law-sampler branch February 20, 2026 18:17
MaheshJakkala pushed a commit to MaheshJakkala/llama.cpp that referenced this pull request Mar 15, 2026
* initial commit for branch

* simplify constants

* add params to `struct common_params_sampling`, add reference to PR

* explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]`

* add args, rename `queue_size` -> `window_size`

* improved comments

* minor

* remove old unused code from algorithm

* minor

* add power law case to `common_sampler_init`, add sampler name mappings

* clarify behaviour when `window_size = 0`

* add missing enums

* remove `target_range` param, make `target == 1` no-op, cleanup code

* oops, straggler

* add missing parameters in `server-task.cpp`

* copy from author

ref:
https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069

* remove old debug log, style nit

* fix compiler warning, add commented-out logging per token

* re-write + change parameters + simplify

* oops forgot args.cpp

* fix leftover `window_size`

* add missing values to `common_params_sampling::print()`

* with logging

* does this fix it?

* no, but does this?

* update default decay

* optimize

* fix bad merge

my git skills are lacking

* silence `missing initializer for member`

* update default decay to 0.9

* fix logging

* format (double)

* add power law to the new `samplers` vector

* log sampler init values

* improve logging messages in llama_sampler_power_law

* remove extraneous logging

* simplify target computation

last commit with debug logging!

* remove debug logging, explicitly clamp params at init

* add `use_power_law` flag + logic, minor cleanup

* update `power-law` -> `adaptive-p`

* fix cold start EMA

- `ctx->weighted_sum` is now initialized and reset to `target / (1.0f -
clamped_decay)`
- `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f -
clamped_decay)`

this fixes a "cold start" problem with the moving average

* update `SHARPNESS` constant to `10.0f`

* minor style fixes

no functional changes

* minor style fixes cont.

* update `llama_sampler_adaptive_p_i` for backend sampling (ref: ggml-org#17004)

* separate into `apply` + `accept` functions

* `pending_token_idx`: switch from `llama_token` to `int32`

functionally identical (`llama.h` has `typedef int32_t llama_token;`),
but its more correct now

* don't transform logits <= -1e9f

* fix masking in backend top-p, min-p

* address review comments

* typo in comments `RND` -> `RNG`

* add docs

* add recommended values in completion docs

* address PR feedback

* remove trailing whitespace (for CI `editorconfig`)

* add to adaptive-p to `common_sampler_types_from_chars`
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.