implement adaptive-p sampler#17927
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
|
Nevermind, sorry, I think we want to do a little more testing. I'm going to mark this as draft again temporarily. |
pnb
left a comment
There was a problem hiding this comment.
This looks very interesting! I wish the original compared to XTC, since the goals seem highly similar.
As an aside, I am curious if there is some way to make it work without selecting a token (i.e., only steps 1-3). I see why token selection is necessary, given the need to save the original probability to the history for the adaptive adjustment part. But, for example, maybe it would suffice instead to save the original probability of the highest-probability token after transforming, regardless of which one is eventually selected by a downstream sampler.
src/llama-sampling.cpp
Outdated
|
|
||
| // fixed power law transform parameters (from original implementation) | ||
| const float distribution_width = 0.2f; | ||
| const float peak_logit_value = 3.0f; |
There was a problem hiding this comment.
Should these parameters be configurable like in the original implementation? There is probably a tradeoff with feature creep, having too many options for users to control, but some of these seem potentially important (especially distribution_width). Also, I noticed peak_logit_value is outside the range suggested in the original implementation; is that intentional?
There was a problem hiding this comment.
Myself and the original author are discussing the parameters over the next few days, I agree that the current implementation is probably not ideal, which is why I marked it back as draft.
I will post a comment in the main thread with an update once we've got it more figured out. Thank you!
|
thing I worry is someone else will pick a different magic number than 1e9 in the future and then adaptive-p breaks again. that's why I think adaptive-p should calculate its own zero point against -inf or -1e9 or whatever instead of using a second softmax |
Yes, I'm not sure. |
|
Haven't tested this, but can we avoid the diff --git a/src/llama-sampling.cpp b/src/llama-sampling.cpp
index 11f0394c4..19c5c419e 100644
--- a/src/llama-sampling.cpp
+++ b/src/llama-sampling.cpp
@@ -1495,6 +1495,10 @@ static void llama_sampler_top_p_backend_apply(
struct ggml_tensor * mask = ggml_step(ctx, cdf_scaled);
ggml_set_name(mask, "top_p_mask");
+ // mask_inf = (mask - 1) * -INFINITY
+ struct ggml_tensor * mask_inf = ggml_scale(ctx, ggml_scale_bias(ctx, mask, 1.0f, -1.0f), -INFINITY);
+ ggml_set_name(mask, "top_p_mask_inf");
+
// Taking the sum of the mask gives us the sum of elements after the threshold
// we are interested in.
struct ggml_tensor * idxf = ggml_sum(ctx, mask);
@@ -1517,8 +1521,9 @@ static void llama_sampler_top_p_backend_apply(
// top_p_bias = (mask * 1e9f) - 1e9f.
// So entries in the mask that we want to discard will become -1e9f, and
// others will be 0 (meaning that will not effect the logits).
- const float large_val = 1e9f;
- struct ggml_tensor * top_p_bias = ggml_scale_bias(ctx, mask, large_val, -large_val);
+ //const float large_val = 1e9f;
+ //struct ggml_tensor * top_p_bias = ggml_scale_bias(ctx, mask, large_val, -large_val);
+ struct ggml_tensor * top_p_bias = ggml_add(ctx, mask, mask_inf);
ggml_set_name(top_p_bias, "top_p_bias");
data->logits = ggml_add(ctx, sorted_logits, top_p_bias);Would this help? |
Thank you! That code specifically didn't work but it was enough to point me in the right direction, and adaptive-p now plays nicely with both CPU and backend sampling, without any more magic numbers. PTAL. |
|
Nice, this looks like a better solution. Reviewing |
ggerganov
left a comment
There was a problem hiding this comment.
After addressing the minor comments we can merge
| auto * ctx = (llama_sampler_adaptive_p *) smpl->ctx; | ||
|
|
||
| if (ctx->target < 0.0f) { | ||
| // at negative target values, adaptive-p is no-op | ||
| // we simply sample from the existing distribution | ||
| llama_sampler_softmax_impl(cur_p, false); | ||
| cur_p->selected = llama_sample_dist(cur_p, ctx->rng); | ||
| return; | ||
| } | ||
|
|
||
| // softmax and store the original probabilities | ||
| llama_sampler_softmax_impl(cur_p, false); | ||
| ctx->original_probs.resize(cur_p->size); |
There was a problem hiding this comment.
The llama_sampler_softmax_impl(cur_p, false); call can be deduplicated here - call once before the if
| float decay, | ||
| uint32_t seed | ||
| ) { | ||
| auto seed_cur = get_rng_seed(seed); |
There was a problem hiding this comment.
This seed_cur logic breaks when the seed is LLAMA_DEFAULT_SEED. In that case we generate a random seed based on the time. But with this implementation, if you clone the sampler, it will inherit the "seed_cur" of the source sampler, instead of generating a new seed.
See how the dist sampler maintains a seed_cur in its context and replicate the logic here.
|
Latest commit addresses both review comments. I added a |
|
I think this sampler needs to be described in the server's README. Otherwise the users will have no idea how to use it or that it even exists in the first place. |
|
Hmm, is there not a README that explains all the samplers somewhere? Since the samplers are not just used for |
|
It's more about describing the server API for that sampler. For example, that the users need to use The CLI README also describes all the CLI arguments separately. I have noticed that more expanded descriptions of the samplers can be found at |
|
I gotcha. Latest commit adds docs to server, CLI, and completion. |
|
Should be ready to go @ggerganov 🙏 thanks for the review |
|
|
||
| static void llama_sampler_adaptive_p_reset(struct llama_sampler * smpl) { | ||
| auto * ctx = (llama_sampler_adaptive_p *) smpl->ctx; | ||
| // ctx->target and ctx->decay never change after init, so it's safe to keep them as is. | ||
| // original_probs is completely overwritten on every call to _apply. | ||
| // so we only need to reset the EMA state and pending token. | ||
| ctx->weighted_sum = ctx->target / (1.0f - ctx->decay); | ||
| ctx->total_weight = 1.0f / (1.0f - ctx->decay); | ||
| ctx->pending_token_id = LLAMA_TOKEN_NULL; | ||
| ctx->pending_token_idx = -1; | ||
| } | ||
|
|
There was a problem hiding this comment.
Should apply the same logic for the RNG seeds here as in the dist reset:
llama.cpp/src/llama-sampling.cpp
Lines 1111 to 1117 in 516a4ca
There was a problem hiding this comment.
Oops. I addressed this in 40fd48f just now
|
Please excuse my ping @ggerganov - is there anything else needed on my end to push this over the finish line? |
* initial commit for branch * simplify constants * add params to `struct common_params_sampling`, add reference to PR * explicitly clamp `min_target` and `max_target` to `[0.0, 1.0]` * add args, rename `queue_size` -> `window_size` * improved comments * minor * remove old unused code from algorithm * minor * add power law case to `common_sampler_init`, add sampler name mappings * clarify behaviour when `window_size = 0` * add missing enums * remove `target_range` param, make `target == 1` no-op, cleanup code * oops, straggler * add missing parameters in `server-task.cpp` * copy from author ref: https://gist.github.com/MrJackSpade/9be99c7efbba7b95a41377e123b7b069 * remove old debug log, style nit * fix compiler warning, add commented-out logging per token * re-write + change parameters + simplify * oops forgot args.cpp * fix leftover `window_size` * add missing values to `common_params_sampling::print()` * with logging * does this fix it? * no, but does this? * update default decay * optimize * fix bad merge my git skills are lacking * silence `missing initializer for member` * update default decay to 0.9 * fix logging * format (double) * add power law to the new `samplers` vector * log sampler init values * improve logging messages in llama_sampler_power_law * remove extraneous logging * simplify target computation last commit with debug logging! * remove debug logging, explicitly clamp params at init * add `use_power_law` flag + logic, minor cleanup * update `power-law` -> `adaptive-p` * fix cold start EMA - `ctx->weighted_sum` is now initialized and reset to `target / (1.0f - clamped_decay)` - `ctx->total_weight` is now initialized and reset to `1.0f / (1.0f - clamped_decay)` this fixes a "cold start" problem with the moving average * update `SHARPNESS` constant to `10.0f` * minor style fixes no functional changes * minor style fixes cont. * update `llama_sampler_adaptive_p_i` for backend sampling (ref: ggml-org#17004) * separate into `apply` + `accept` functions * `pending_token_idx`: switch from `llama_token` to `int32` functionally identical (`llama.h` has `typedef int32_t llama_token;`), but its more correct now * don't transform logits <= -1e9f * fix masking in backend top-p, min-p * address review comments * typo in comments `RND` -> `RNG` * add docs * add recommended values in completion docs * address PR feedback * remove trailing whitespace (for CI `editorconfig`) * add to adaptive-p to `common_sampler_types_from_chars`
This PR implements a new sampler called adaptive-p that selects tokens near a configurable target probability over time.
How it works
The adaptive-p sampler transforms the token probability distribution to favor tokens that fall near a user-configurable probability target. Internally, the sampler maintains an exponential moving average of the original probabilities of selected tokens. It uses this, along with the user's set target, to compute an adapted target at each sampling step, steering the running average toward the configured target over time. If recent selections have been higher-probability than target, the sampler compensates by temporarily favoring lower-probability tokens, and vice versa.
Parameters
This sampler exposes two parameters:
target--adaptive-target Ndecay--adaptive-decay NIn most cases, you can just play with
--adaptive-target. The default decay of 0.9 (for a ~10 token history) works well. A good starting value is--adaptive-target 0.55. It is suggested to raise or lower the target in increments of 0.05 as needed.Usage notes
adaptive-p selects a token ID rather than just mutating candidates, so it must be last in the sampler chain. It shares this behaviour with some existing samplers like mirostat, dist, and greedy (mirostat being the closest relative).
Only mild truncation before this sampler is recommended. We suggest applying min-p before adaptive-p as the only other active sampler in the chain (optionally with top-k as well).
Example usage:
Other notes
This sampler was previously called "power law" in earlier versions of this PR, named for the power law transform we were applying to logits. We are no longer applying the power law transform. We also experimented with gaussian, but ultimately settled on the current formula.
Acknowledgements