Skip to content

Make string ban more robust and add regex ban#1243

Merged
ikawrakow merged 8 commits intoikawrakow:mainfrom
SneedwareInc:main
Mar 11, 2026
Merged

Make string ban more robust and add regex ban#1243
ikawrakow merged 8 commits intoikawrakow:mainfrom
SneedwareInc:main

Conversation

@SneedwareInc
Copy link
Contributor

@SneedwareInc SneedwareInc commented Feb 6, 2026

Continuation of #1131.

This PR adds regex ban and makes string ban location dependent. Currently the string ban is flawed: if a token is banned, it is banned in the entire buffer. During my testing with long, overlapping strings it frequently backfired, for example if I was banned in the beginning and later in the context I was needed, it produced nonsense. In this PR the ban is localized to specific token locations.
New arguments:

  • banned_regex: accepts json with regex, case sensitive
  • banned_regex_case_insensitive: accepts json with regex, case insensitive
  • banbuffer_size: number, sets the size for the buffer, useful when using regex. Be default(or if 0) it is longest string/regex+1

My ST fork for testing: https://github.com/SneedwareInc/ik_SillyTavern

Example ban list: https://huggingface.co/datasets/ChuckMcSneed/ExampleAntislop

Currently I know it works in text completion, not sure about chat completion or openai formats.

@SneedwareInc SneedwareInc mentioned this pull request Feb 6, 2026
4 tasks
@Nexesenex
Copy link
Contributor

Nexesenex commented Feb 6, 2026

If I may, create different branches on your repo for your different PRs, because you erased the content of your previous one with this one (which is fine, it's a continuation, but not the best practice to help those who want to access your previous code with ease!).

I will test this.

@SneedwareInc
Copy link
Contributor Author

@Nexesenex https://github.com/SneedwareInc/ik_llama.cpp/tree/legacy
I've added my old build as 7z archive.

@ikawrakow
Copy link
Owner

@firecoperana Do you want to look again at this PR?

@firecoperana
Copy link
Collaborator

Yes, I will look at it when it's ready.

@SneedwareInc
Copy link
Contributor Author

@firecoperana What do you want me to change/add?

@SneedwareInc SneedwareInc reopened this Feb 19, 2026
// could be improved to support more languages
std::string string_lower(const std::string& str) {
std::string result = str;
for (char& c : result) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this. No functional change.

s = string_lower(s);
auto ban_tokens = common_tokenize(llama_get_model(ctx), s, false, true);
if (ban_tokens.size() > slot.n_buffer) {
slot.n_buffer = ban_tokens.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use the length of the string over tokens count? The buffer holds tokens, not each character.

auto ban_tokens = common_tokenize(llama_get_model(ctx), val, false, true);
if (ban_tokens.size() > slot.n_buffer) {
slot.n_buffer = ban_tokens.size();
// Use string length instead of token count
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.


count++;
if (!has_next) {
if (slot.stopped_limit && !slot.stopped_eos && !slot.stopped_word) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

slot.token_buffer.resize(n_keep_buffer);

// Adjust decoded count
slot.n_decoded -= n_rewind;
Copy link
Collaborator

@firecoperana firecoperana Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't change slot.n_decoded. This will make prompt processing and token generation time and speed calculation incorrect.

n_rewind = check_ban_phrase(slot);
}
// if found string in the ban
if (n_rewind > 0 && (slot.rewind_count <20 || slot.rewind_count <= 2 * slot.ban_phrases.size())) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need some kind of logic to limit the number of times to rewind.

generated_token_probs.clear();

positional_bans.clear();
ban_phrases.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put them in server_slot::reset()

// Check if we have specific bans for this exact position (slot.n_past)
// Note: slot.n_past is the index of the token we are about to generate.
auto pos_ban_it = slot.positional_bans.find(slot.n_past);
std::vector<llama_token> temp_banned;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code and the code below could be moved inside rewind_context as it's currently done. Use slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias; to adjust logit_bias.

@SneedwareInc
Copy link
Contributor Author

I would keep this. No functional change.

Okay

Why use the length of the string over tokens count? The buffer holds tokens, not each character.
Same here.

Edge cases like when ALLCAPS gets tokenized as A L L C A P S(7 tokens=string length) while in lowercase it gets tokenized as all caps(2 tokens)+it's better for automatic buffer size estimation for regex.

What does this do?

This specific code block is strictly necessary to prevent valid tokens from being silently discarded when a generation reaches its maximum token limit. Because the server buffers tokens to check for banned phrases, several safe, generated tokens are often waiting in the queue. If this continue statement is removed, hitting the token limit will trigger an immediate break, instantly destroying the buffer and closing the connection. Consequently, every response that hits the token limit will have its final words abruptly cut off before reaching the user. The continue simply allows the loop to finish flushing the already-approved tokens to the client before cleanly releasing the slot.

Don't change slot.n_decoded. This will make prompt processing and token generation time and speed calculation incorrect.

Do you have any suggestions for an elegant solution that makes sure that n_predict is the amount of tokens you actually get as output, instead of n_decoded-discarded tokens?

Need some kind of logic to limit the number of times to rewind.

Why set an arbitrary limit? With regex there are many more banned combinations possible per item than with strings.

Put them in server_slot::reset()

They are already there?

this line to remove kv cache is not needed

It is needed. Without it the program does not function correctly. I know you don't test your code properly, so let me demonstrate:
Mistral Nemo Q6_K, temperature=0, "banned_strings": ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","t","u","v","w","x","y","z"],
Prompt:

[INST]Pick a random letter.[/INST]Sure, the random letter I've picked is "

Without it: " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " and it keeps going

With it: S".</s>

This code and the code below could be moved inside rewind_context as it's currently done. Use slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias; to adjust logit_bias.

Okay

@firecoperana
Copy link
Collaborator

  1. Tokenized twice. One with the original strings and one with the lower cased ones. Take the longest size as the buffer. The way you are doing it, you create unnecessary delays. For regex purpose, you can just set your own buffer size. There is no need to use string length.
  2. For slot.n_decoded, let's keep this monotonic. Create a new variable for this and n_output so that it serves the purpose you want. When I use n_predict, I want to limit how many tokens the LLM has calculated. If for some reason all the tokens are banned, I will wait forever for it to stop with your change.
  3. For rewind limit, the default behavior should have a limit. This is a reasonable limit that works for me. If you don't think that's enough, you can add a payload to override it. For regex ban, you always want to set your own limit.

@SneedwareInc
Copy link
Contributor Author

Tokenized twice. One with the original strings and one with the lower cased ones. Take the longest size as the buffer.

That's stupid and does not catch edge cases. Let's keep it simple and reliable longest string/regex length+1.

The way you are doing it, you create unnecessary delays.

What delays? Care to demonstrate?

For slot.n_decoded, let's keep this monotonic. Create a new variable for this and n_output so that it serves the purpose you want. When I use n_predict, I want to limit how many tokens the LLM has calculated. If for some reason all the tokens are banned, I will wait forever for it to stop with your change.

Okay

This is a reasonable limit that works for me.

Okay, I'll set the default limit to a reasonable amount(512) that works for me, someone who uses this functionality a lot, and add an option to set it to whatever you want if you are so afraid it will get stuck.

@ikawrakow
Copy link
Owner

@SneedwareInc I would appreciate of you were slightly more respectful in your responses to @firecoperana. Thank you.

@SneedwareInc
Copy link
Contributor Author

@ikawrakow How am I disrespectful?

@Lissanro
Copy link

Lissanro commented Feb 20, 2026

@SneedwareInc I think he is referring to using phrase like "That's stupid and does not catch edge cases" instead of just "does not catch edge cases", or "I know you don't test your code properly, so let me demonstrate" instead of just "let me demonstrate". Imagine phrases like that would be directed at you when you originally missed a lot of use cases. Anyway, I appreciate your work, but firecoperana also putting a lot of effort... and not just here, he did enormous amount of work and contributions. For what's it worth, I also put a lot of work in testing the previous patches, but I did not catch the edge cases you have mentioned...

My point is, not catching or missing edge cases does not mean not testing properly or being stupid, it is just implementing features is a lot of work. Your previous patch also was missing a lot of cases, so it is really hard to take into account everything. This is why discussing and testing things together always helps, for non-trivial features that cover massive amount of use cases, it is not really possible for a single person to think of them all, some suggested optimization or ones that you think of yourself may or may not cover all possible cases... so my two cents if you think something wrong being suggested, just explain why and what issues it would cause, no need for negative phrases.

@SneedwareInc
Copy link
Contributor Author

Imagine phrases like that would be directed at you when you originally missed a lot of use cases.

I would not care, but that's me, I know that LLMs can make mistakes in code, I am prepared for insults. But I'll soften my language for future interactions, thanks for pointing out.

@SneedwareInc
Copy link
Contributor Author

@firecoperana I've updated the code. Is this what you wanted?
I added 2 new arguments:

  • saturate_predict - Establishes how n_predict is treated. If set to true, treats n_predict as count of actual output tokens, if set to false, n_predict is number of predicted tokens, regardless if they were flushed out or not. Defaults to false.
  • rewind_count_max - Specifies maximum rewind count. -1 uses automatic limit: sum of total bans*2 or 20, whichever is greater. If set to 0, sets rewinds to infinity. If set to any other positive number, uses that instead. Defaults to -1.

@firecoperana
Copy link
Collaborator

firecoperana commented Feb 21, 2026

Yes, that works. One last thing is to revert the change of the buffer size for ban strings or change to what I suggested.

@SneedwareInc
Copy link
Contributor Author

One last thing is to revert the change of the buffer size for ban strings or change to what I suggested.

@firecoperana Here I have to disagree. My position comes from noticing how aggressively some LLMs have tried to bypass the banned strings. I had "core guidelines" banned when I used tokenization method for buffer size determination, but the model sneakily went in and bypassed it by writing CORE GUIDELINES, which exceeded the buffer and did not trigger a rewind. Having to guess which variation will lead to maximal tokenization like you are suggesting is impossible. I am not suggesting using string length+1 without a reason.

Other than that, you seem to not understand that there is no significant speed penalty when buffering, just visual representation latency. Let me prove my point with data:

Attempt banbuffer size 4 banbuffer size 20
1 2338 2321
2 2381 2322
3 2346 2410
4 2492 2317
5 2466 2456
6 2374 2419
7 2424 2399
8 2341 2458
9 2458 2430
10 2418 2449
AVG 2403.8 2398.1
STDEV 55.99 57.19

Times are in ms.

Mistral Nemo, Q6_K,

"n_predict": 100,
"temperature": 0.0,
"banned_strings": ["test"],
"banbuffer_size": 4 or 20,
"saturate_predict": true,

Prompt:

[INST] Write a story about a cat. Write like a female writer with a lot of purple prose.[/INST] In

Can you explain to me why do you think your approach is superior?

@Lissanro
Copy link

@SneedwareInc If you can find time, could you please rebase your patch, I tried apply https://github.com/ikawrakow/ik_llama.cpp/pull/1243.patch but it has many conflicts:

> patch -p1 < patches/1243.patch                             
patching file common/common.cpp
patching file examples/server/server-context.cpp
Hunk #6 succeeded at 3097 (offset 2 lines).
Hunk #7 succeeded at 3125 (offset 2 lines).
Hunk #8 FAILED at 3137.
Hunk #9 succeeded at 3313 (offset 4 lines).
Hunk #10 succeeded at 3454 (offset 6 lines).
1 out of 10 hunks FAILED -- saving rejects to file examples/server/server-context.cpp.rej
patching file examples/server/server-context.h
patching file examples/server/server-context.cpp
patching file examples/server/server-context.cpp
Hunk #3 succeeded at 3099 (offset 2 lines).
Hunk #4 succeeded at 3110 (offset 2 lines).
Hunk #5 FAILED at 3270.
Hunk #6 FAILED at 3287.
Hunk #7 succeeded at 3328 (offset -62 lines).
2 out of 7 hunks FAILED -- saving rejects to file examples/server/server-context.cpp.rej
patching file examples/server/server-context.h

@SneedwareInc
Copy link
Contributor Author

@Lissanro For some reason mainline no longer works for me.

Same prompt as above, same settings, mainline crashed after generating just one token, no error log:

llama-server -m mistral-nemo-instruct/ggml-model-Q6_K.gguf -c 8000 --verbose --special -ngl 999

...

VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716361
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716361
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716361
INFO [              slots_idle] all slots are idle | tid="12628" timestamp=1771716361
VERB [          kv_cache_clear] clearing KV cache | tid="12628" timestamp=1771716361
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716361
INFO [      log_server_request] request | tid="9868" timestamp=1771716367 remote_addr="127.0.0.1" remote_port=63281 status=200 method="OPTIONS" path="/completion" params={}
VERB [      log_server_request] request | tid="9868" timestamp=1771716367 request="" response=""
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716367
VERB [      get_available_slot] selected slot by lru | tid="12628" timestamp=1771716367 id_slot=0 t_last=-1
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="12628" timestamp=1771716367 id_slot=0 id_task=0
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716367
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716367
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716367
VERB [    batch_pending_prompt] tokenizing prompt | tid="12628" timestamp=1771716367 id_slot=0 id_task=0
VERB [    batch_pending_prompt] prompt tokenized | tid="12628" timestamp=1771716367 id_slot=0 id_task=0 n_ctx=8192 n_keep=0 n_prompt_tokens=23 prompt_tokens="<s>[INST] Write a story about a cat. Write like a female writer with a lot of purple prose.[/INST] In"
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="12628" timestamp=1771716367 id_slot=0 id_task=0 p0=0
VERB [    batch_pending_prompt] prompt processing progress | tid="12628" timestamp=1771716367 id_slot=0 n_past=23 n_ctx=8192 n_tokens=23 progress=1.0
VERB [    batch_pending_prompt] prompt done | tid="12628" timestamp=1771716367 id_slot=0 n_past=23 n_ctx=8192 n_tokens=23
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716367 n_tokens=23
VERB [            update_slots] run slots completed | tid="12628" timestamp=1771716367
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716367
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716367
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716367
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716367
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716367
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716367 n_tokens=1
VERB [            update_slots] run slots completed | tid="12628" timestamp=1771716367
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716367
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716368
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716368
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716368
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716368
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716368 n_tokens=1
VERB [            update_slots] run slots completed | tid="12628" timestamp=1771716368
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716368
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716368
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716368
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716368
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716368
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716368 n_tokens=1
VERB [           process_token] next token | tid="12628" timestamp=1771716368 id_slot=0 id_task=0 token=1278 token_text=" the" has_next_token=true n_remain=98 n_decoded=4 stopped_eos=false stopped_word=false stopped_limit=false stopping_word=""
llama-server --version
version: 4214 (bd387a27)
built with MSVC 19.42.34433.0 for x64

@ikawrakow @firecoperana Can you please look into that?

@firecoperana
Copy link
Collaborator

Does #1304 fix it?

@SneedwareInc
Copy link
Contributor Author

@firecoperana It no longer crashes, but string ban doesn't work:
"banned_strings": ["e"],
Same prompt
Output:

 the resplendent tapestry of twilight, where the sun's last embers kissed the horizon, there resided a feline enchantress named Isolde. Her coat, a symphony of ebony and silver, shimmered like

@firecoperana
Copy link
Collaborator

#1310 fixes it.

@SneedwareInc SneedwareInc marked this pull request as ready for review February 27, 2026 21:30
@SneedwareInc
Copy link
Contributor Author

@firecoperana done!

Copy link
Collaborator

@firecoperana firecoperana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR changes many existing code, so it will take a while for me to fully review. When you use AI to write PR, watch out for any code that it removed by AI.

if (ban_pos >= 0 && allow_rewind) {
rewind_context(slot, ban_pos);
slot.rewind_status = true;
slot.ctx_sampling->rewind_samplers = true;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this removed?

}
}
n++;
if (slot.banned_n > 0 && n == slot.banned_n) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

banned_n no longer works

LLAMA_LOG_INFO("Banned pattern detected at pos %d. Banning token %d ('%s') and rewinding.\n",
abs_pos, banned_tok, slot.token_buffer[token_idx].text_to_send.c_str());

slot.positional_bans[abs_pos].insert(banned_tok);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be moved to rewind_context? This function should just check whether ban string existed. If possible, can you make it return n_rewind? There is no need to make additional changes in this function except for adding regex bans detection. With n_rewind being returned, there is less change in rewind_context function too.

@SneedwareInc
Copy link
Contributor Author

@firecoperana I got back rewind_samplers and record_samplers, restored banned_n and moved the code to rewind_context. Is there anything else you wish me to do?

@SneedwareInc
Copy link
Contributor Author

SneedwareInc commented Mar 3, 2026

slot.ctx_sampling->rewind_samplers = true; and slot.ctx_sampling->record_samplers = true; are breaking something. The bans work, but the output quality is degraded when they are present. I will remove them.

@firecoperana
Copy link
Collaborator

I still see banned_n not working. There are other code that is removed too.
slot.ctx_sampling->rewind_samplers = true; and slot.ctx_sampling->record_samplers = true. are added for adaptive-p. Do you use adaptive p sampling when the output is broken?

Do you mind if I copy your code and create a clean PR? Your PR removes more code that is not related to regex ban than the last time I checked.

@dungquixote42
Copy link
Contributor

Perhaps #1359 fixes the issue. @SneedwareInc if you would test it with string/regex bans, I would love to hear the results.

@SneedwareInc
Copy link
Contributor Author

I still see banned_n not working.

I'll look into that.

There are other code that is removed too.

Such as?

Do you use adaptive p sampling when the output is broken?

No, I do not use or even know what adaptive p is. I only use temperature and TFS. It is very concerining if rewind samplers causes it to turn on or affects other samplers that should not be affected.

Do you mind if I copy your code and create a clean PR?

You're welcome to cherry-pick this or copy it exactly as-is into a clean branch.

However, please don't rewrite the implementation logic, just copy it verbatim. The last rewrite introduced bugs that weren't caught because they weren't tested against edge cases.

If the issue is just formatting or the unrelated deletions, I can clean those up myself in this PR. I'd strongly prefer we fix the current one rather than risk another untested rewrite.

Perhaps #1359 fixes the issue. @SneedwareInc if you would test it with string/regex bans, I would love to hear the results.

I'll look into that.

}
else if (penalty_prompt->is_array()) {
const auto n_tokens = penalty_prompt->size();
slot.sparams.penalty_prompt_tokens.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.


const auto preserved_tokens = data.find("preserved_tokens");
if (preserved_tokens != data.end()) {
slot.sparams.preserved_tokens.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.

}
const auto grammar_triggers = data.find("grammar_triggers");
if (grammar_triggers != data.end()) {
slot.sparams.grammar_triggers.clear();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.


slot.logit_bias = slot.sparams.logit_bias; // keep a copy to restore
slot.ban_phrases_bias = json_value(data, "banned_bias", params_base.ban_phrases_bias);
slot.banned_n = json_value(data, "banned_n", params_base.banned_n);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.

slot.n_past_prompt++;
slot.n_past++;
slot.do_checkpoint = false;
if (params_base.do_checkpoint && slot.n_prompt_tokens - slot.n_past_prompt == params_base.ctx_checkpoints_tolerance) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.

if (slot.state != SLOT_STATE_PROCESSING || slot.i_batch < (int)i || slot.i_batch >= (int)(i + n_tokens)) {
// save checkpoint during prompt processing
if (slot.command == SLOT_COMMAND_LOAD_PROMPT) {
if (slot.do_checkpoint) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.

slot.t_start_generation = ggml_time_us();
slot.t_prompt_processing = (slot.t_start_generation - slot.t_start_process_prompt) / 1e3;
metrics.on_prompt_eval(slot);
// create checkpoint after prompt processing ends
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.

}
}

// create checkpoint during generation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep this.

@SneedwareInc SneedwareInc reopened this Mar 6, 2026
@SneedwareInc
Copy link
Contributor Author

@dungquixote42 Quality degradation is still there, but feels less severe than before. Could it be that adaptive_p somehow gets auto-enabled(I never had it enabled)? Or is it the way I copied it over?

@firecoperana Fixed.

@dungquixote42
Copy link
Contributor

@dungquixote42 Quality degradation is still there, but feels less severe than before. Could it be that adaptive_p somehow gets auto-enabled(I never had it enabled)? Or is it the way I copied it over?

I fetched this PR and ran it with test code. The adaptive P sampler is working as intended as far as I can tell. That is, no-op when its target is <0.
Your frontend is not setting the target at 0, is it? It needs to be negative.

Copy link
Collaborator

@firecoperana firecoperana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides adding back the code that is removed, also unify the slot.banned_n with 1 and !=1. There is no need to do a special case with slot.banned_n==1 for positional ban, recovering and setting logit bias.

{
for (auto result = slot.token_buffer.begin() + n_keep_buffer; result != slot.token_buffer.end(); result++) {
if (!tokens.contains(result->tok)) {
slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias;
if (!tokens.contains(result->tok)) {
tokens.insert(result->tok);
slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can combine slot.banned_n==1 in this as well. No need to create a new if . Missing positional_bans in banned_n!=0.

continue; // sample using speculative decoding
}

// RESTORE AND APPLY POSITIONAL BANS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this inside rewind_context.

if (!slot.rewind_status) {
slot.ctx_sampling->params.logit_bias = slot.logit_bias; // restore logit bias

if (slot.banned_n != 1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to special case banned_n!=1?

for (size_t i = 0; i < std::min(max_probs, n_probs); i++) {
result.probs.push_back({
cur_p->data[i].id,
common_token_to_piece(ctx, cur_p->data[i].id, special),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this

for (size_t i = 0; i < std::min(n_vocab, n_probs); i++) {
result.probs.push_back({
cur[i].id,
common_token_to_piece(ctx, cur[i].id, special),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

keep this

}

slot.ctx_sampling->n_rewind = sent_results ? -1 : n_rewind;
if (slot.sparams.adaptive_target >= 0.0f) {
Copy link
Contributor

@dungquixote42 dungquixote42 Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this check (and others elsewhere) is the right solution to the sampler running when it is not supposed to. I cannot reproduce this, so would you print adapt_p_ctx->target from llama_sample_adaptive_p_impl() and show us what they say? Preferably before and after the enable check.

@SneedwareInc
Copy link
Contributor Author

Quality is degraded right now, not sure which of the changes caused it

@SneedwareInc
Copy link
Contributor Author

Should be fixed. Moving // RESTORE AND APPLY POSITIONAL BANS to rewind_context is not possible due to quality degradation.

Copy link
Collaborator

@firecoperana firecoperana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the degradation due to adaptive p, but if there is, it can be fixed later.

@ikawrakow ikawrakow merged commit 4a24759 into ikawrakow:main Mar 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants