Make string ban more robust and add regex ban by SneedwareInc · Pull Request #1243 · ikawrakow/ik_llama.cpp

SneedwareInc · 2026-02-06T11:26:43Z

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

Continuation of #1131.

This PR adds regex ban and makes string ban location dependent. Currently the string ban is flawed: if a token is banned, it is banned in the entire buffer. During my testing with long, overlapping strings it frequently backfired, for example if I was banned in the beginning and later in the context I was needed, it produced nonsense. In this PR the ban is localized to specific token locations.
New arguments:

banned_regex: accepts json with regex, case sensitive
banned_regex_case_insensitive: accepts json with regex, case insensitive
banbuffer_size: number, sets the size for the buffer, useful when using regex. Be default(or if 0) it is longest string/regex+1

My ST fork for testing: https://github.com/SneedwareInc/ik_SillyTavern

Example ban list: https://huggingface.co/datasets/ChuckMcSneed/ExampleAntislop

Currently I know it works in text completion, not sure about chat completion or openai formats.

Nexesenex · 2026-02-06T14:54:40Z

If I may, create different branches on your repo for your different PRs, because you erased the content of your previous one with this one (which is fine, it's a continuation, but not the best practice to help those who want to access your previous code with ease!).

I will test this.

SneedwareInc · 2026-02-07T07:34:16Z

@Nexesenex https://github.com/SneedwareInc/ik_llama.cpp/tree/legacy
I've added my old build as 7z archive.

ikawrakow · 2026-02-18T07:26:32Z

@firecoperana Do you want to look again at this PR?

firecoperana · 2026-02-18T23:30:13Z

Yes, I will look at it when it's ready.

SneedwareInc · 2026-02-19T19:15:46Z

@firecoperana What do you want me to change/add?

firecoperana · 2026-02-19T23:24:36Z

common/common.cpp

-// could be improved to support more languages
 std::string string_lower(const std::string& str) {
    std::string result = str;
-    for (char& c : result) {


I would keep this. No functional change.

firecoperana · 2026-02-19T23:26:20Z

examples/server/server-context.cpp

                        s = string_lower(s);
-                        auto ban_tokens = common_tokenize(llama_get_model(ctx), s, false, true);
-                        if (ban_tokens.size() > slot.n_buffer) {
-                            slot.n_buffer = ban_tokens.size();


Why use the length of the string over tokens count? The buffer holds tokens, not each character.

firecoperana · 2026-02-19T23:28:23Z

examples/server/server-context.cpp

-                        auto ban_tokens = common_tokenize(llama_get_model(ctx), val, false, true);
-                        if (ban_tokens.size() > slot.n_buffer) {
-                            slot.n_buffer = ban_tokens.size();
+                        // Use string length instead of token count


firecoperana · 2026-02-19T23:30:40Z

examples/server/server-context.cpp

+
        count++;
        if (!has_next) {
+            if (slot.stopped_limit && !slot.stopped_eos && !slot.stopped_word) {


What does this do?

firecoperana · 2026-02-19T23:33:09Z

examples/server/server-context.cpp

+    slot.token_buffer.resize(n_keep_buffer);
+
+    // Adjust decoded count
+    slot.n_decoded -= n_rewind;


Don't change slot.n_decoded. This will make prompt processing and token generation time and speed calculation incorrect.

firecoperana · 2026-02-19T23:35:29Z

examples/server/server-context.cpp

        n_rewind = check_ban_phrase(slot);
    }
-    // if found string in the ban
-    if (n_rewind > 0 && (slot.rewind_count <20 || slot.rewind_count <= 2 * slot.ban_phrases.size())) {


Need some kind of logic to limit the number of times to rewind.

firecoperana · 2026-02-19T23:36:19Z

examples/server/server-context.cpp

    generated_token_probs.clear();
-
+    positional_bans.clear();
+    ban_phrases.clear();


Put them in server_slot::reset()

firecoperana · 2026-02-19T23:44:44Z

examples/server/server-context.cpp

+            // Check if we have specific bans for this exact position (slot.n_past)
+            // Note: slot.n_past is the index of the token we are about to generate.
+            auto pos_ban_it = slot.positional_bans.find(slot.n_past);
+            std::vector<llama_token> temp_banned;


This code and the code below could be moved inside rewind_context as it's currently done. Use slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias; to adjust logit_bias.

SneedwareInc · 2026-02-20T13:10:33Z

I would keep this. No functional change.

Okay

Why use the length of the string over tokens count? The buffer holds tokens, not each character.
Same here.

Edge cases like when ALLCAPS gets tokenized as A L L C A P S(7 tokens=string length) while in lowercase it gets tokenized as all caps(2 tokens)+it's better for automatic buffer size estimation for regex.

What does this do?

This specific code block is strictly necessary to prevent valid tokens from being silently discarded when a generation reaches its maximum token limit. Because the server buffers tokens to check for banned phrases, several safe, generated tokens are often waiting in the queue. If this continue statement is removed, hitting the token limit will trigger an immediate break, instantly destroying the buffer and closing the connection. Consequently, every response that hits the token limit will have its final words abruptly cut off before reaching the user. The continue simply allows the loop to finish flushing the already-approved tokens to the client before cleanly releasing the slot.

Don't change slot.n_decoded. This will make prompt processing and token generation time and speed calculation incorrect.

Do you have any suggestions for an elegant solution that makes sure that n_predict is the amount of tokens you actually get as output, instead of n_decoded-discarded tokens?

Need some kind of logic to limit the number of times to rewind.

Why set an arbitrary limit? With regex there are many more banned combinations possible per item than with strings.

Put them in server_slot::reset()

They are already there?

this line to remove kv cache is not needed

It is needed. Without it the program does not function correctly. I know you don't test your code properly, so let me demonstrate:
Mistral Nemo Q6_K, temperature=0, "banned_strings": ["a","b","c","d","e","f","g","h","i","j","k","l","m","n","o","p","q","r","t","u","v","w","x","y","z"],
Prompt:

[INST]Pick a random letter.[/INST]Sure, the random letter I've picked is "

Without it: " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " " and it keeps going

With it: S".</s>

This code and the code below could be moved inside rewind_context as it's currently done. Use slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias; to adjust logit_bias.

Okay

firecoperana · 2026-02-20T14:26:08Z

Tokenized twice. One with the original strings and one with the lower cased ones. Take the longest size as the buffer. The way you are doing it, you create unnecessary delays. For regex purpose, you can just set your own buffer size. There is no need to use string length.
For slot.n_decoded, let's keep this monotonic. Create a new variable for this and n_output so that it serves the purpose you want. When I use n_predict, I want to limit how many tokens the LLM has calculated. If for some reason all the tokens are banned, I will wait forever for it to stop with your change.
For rewind limit, the default behavior should have a limit. This is a reasonable limit that works for me. If you don't think that's enough, you can add a payload to override it. For regex ban, you always want to set your own limit.

SneedwareInc · 2026-02-20T14:45:43Z

Tokenized twice. One with the original strings and one with the lower cased ones. Take the longest size as the buffer.

That's stupid and does not catch edge cases. Let's keep it simple and reliable longest string/regex length+1.

The way you are doing it, you create unnecessary delays.

What delays? Care to demonstrate?

For slot.n_decoded, let's keep this monotonic. Create a new variable for this and n_output so that it serves the purpose you want. When I use n_predict, I want to limit how many tokens the LLM has calculated. If for some reason all the tokens are banned, I will wait forever for it to stop with your change.

Okay

This is a reasonable limit that works for me.

Okay, I'll set the default limit to a reasonable amount(512) that works for me, someone who uses this functionality a lot, and add an option to set it to whatever you want if you are so afraid it will get stuck.

ikawrakow · 2026-02-20T14:49:51Z

@SneedwareInc I would appreciate of you were slightly more respectful in your responses to @firecoperana. Thank you.

SneedwareInc · 2026-02-20T14:52:23Z

@ikawrakow How am I disrespectful?

Lissanro · 2026-02-20T15:16:53Z

@SneedwareInc I think he is referring to using phrase like "That's stupid and does not catch edge cases" instead of just "does not catch edge cases", or "I know you don't test your code properly, so let me demonstrate" instead of just "let me demonstrate". Imagine phrases like that would be directed at you when you originally missed a lot of use cases. Anyway, I appreciate your work, but firecoperana also putting a lot of effort... and not just here, he did enormous amount of work and contributions. For what's it worth, I also put a lot of work in testing the previous patches, but I did not catch the edge cases you have mentioned...

My point is, not catching or missing edge cases does not mean not testing properly or being stupid, it is just implementing features is a lot of work. Your previous patch also was missing a lot of cases, so it is really hard to take into account everything. This is why discussing and testing things together always helps, for non-trivial features that cover massive amount of use cases, it is not really possible for a single person to think of them all, some suggested optimization or ones that you think of yourself may or may not cover all possible cases... so my two cents if you think something wrong being suggested, just explain why and what issues it would cause, no need for negative phrases.

SneedwareInc · 2026-02-20T15:28:49Z

Imagine phrases like that would be directed at you when you originally missed a lot of use cases.

I would not care, but that's me, I know that LLMs can make mistakes in code, I am prepared for insults. But I'll soften my language for future interactions, thanks for pointing out.

SneedwareInc · 2026-02-20T19:12:32Z

@firecoperana I've updated the code. Is this what you wanted?
I added 2 new arguments:

saturate_predict - Establishes how n_predict is treated. If set to true, treats n_predict as count of actual output tokens, if set to false, n_predict is number of predicted tokens, regardless if they were flushed out or not. Defaults to false.
rewind_count_max - Specifies maximum rewind count. -1 uses automatic limit: sum of total bans*2 or 20, whichever is greater. If set to 0, sets rewinds to infinity. If set to any other positive number, uses that instead. Defaults to -1.

firecoperana · 2026-02-21T03:58:42Z

Yes, that works. One last thing is to revert the change of the buffer size for ban strings or change to what I suggested.

SneedwareInc · 2026-02-21T09:33:22Z

One last thing is to revert the change of the buffer size for ban strings or change to what I suggested.

@firecoperana Here I have to disagree. My position comes from noticing how aggressively some LLMs have tried to bypass the banned strings. I had "core guidelines" banned when I used tokenization method for buffer size determination, but the model sneakily went in and bypassed it by writing CORE GUIDELINES, which exceeded the buffer and did not trigger a rewind. Having to guess which variation will lead to maximal tokenization like you are suggesting is impossible. I am not suggesting using string length+1 without a reason.

Other than that, you seem to not understand that there is no significant speed penalty when buffering, just visual representation latency. Let me prove my point with data:

Attempt	banbuffer size 4	banbuffer size 20
1	2338	2321
2	2381	2322
3	2346	2410
4	2492	2317
5	2466	2456
6	2374	2419
7	2424	2399
8	2341	2458
9	2458	2430
10	2418	2449
AVG	2403.8	2398.1
STDEV	55.99	57.19

Times are in ms.

Mistral Nemo, Q6_K,

"n_predict": 100,
"temperature": 0.0,
"banned_strings": ["test"],
"banbuffer_size": 4 or 20,
"saturate_predict": true,

Prompt:

[INST] Write a story about a cat. Write like a female writer with a lot of purple prose.[/INST] In

Can you explain to me why do you think your approach is superior?

Lissanro · 2026-02-21T16:26:27Z

@SneedwareInc If you can find time, could you please rebase your patch, I tried apply https://github.com/ikawrakow/ik_llama.cpp/pull/1243.patch but it has many conflicts:

> patch -p1 < patches/1243.patch                             
patching file common/common.cpp
patching file examples/server/server-context.cpp
Hunk #6 succeeded at 3097 (offset 2 lines).
Hunk #7 succeeded at 3125 (offset 2 lines).
Hunk #8 FAILED at 3137.
Hunk #9 succeeded at 3313 (offset 4 lines).
Hunk #10 succeeded at 3454 (offset 6 lines).
1 out of 10 hunks FAILED -- saving rejects to file examples/server/server-context.cpp.rej
patching file examples/server/server-context.h
patching file examples/server/server-context.cpp
patching file examples/server/server-context.cpp
Hunk #3 succeeded at 3099 (offset 2 lines).
Hunk #4 succeeded at 3110 (offset 2 lines).
Hunk #5 FAILED at 3270.
Hunk #6 FAILED at 3287.
Hunk #7 succeeded at 3328 (offset -62 lines).
2 out of 7 hunks FAILED -- saving rejects to file examples/server/server-context.cpp.rej
patching file examples/server/server-context.h

SneedwareInc · 2026-02-21T23:33:35Z

@Lissanro For some reason mainline no longer works for me.

Same prompt as above, same settings, mainline crashed after generating just one token, no error log:

llama-server -m mistral-nemo-instruct/ggml-model-Q6_K.gguf -c 8000 --verbose --special -ngl 999

...

VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716361
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716361
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716361
INFO [              slots_idle] all slots are idle | tid="12628" timestamp=1771716361
VERB [          kv_cache_clear] clearing KV cache | tid="12628" timestamp=1771716361
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716361
INFO [      log_server_request] request | tid="9868" timestamp=1771716367 remote_addr="127.0.0.1" remote_port=63281 status=200 method="OPTIONS" path="/completion" params={}
VERB [      log_server_request] request | tid="9868" timestamp=1771716367 request="" response=""
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716367
VERB [      get_available_slot] selected slot by lru | tid="12628" timestamp=1771716367 id_slot=0 t_last=-1
======== Prompt cache: cache size: 0, n_keep: 0, n_discarded_prompt: 0, cache_ram_n_min: 0, f_keep: 0.00, cache_ram_similarity: 0.50
INFO [   launch_slot_with_task] slot is processing task | tid="12628" timestamp=1771716367 id_slot=0 id_task=0
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716367
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716367
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716367
VERB [    batch_pending_prompt] tokenizing prompt | tid="12628" timestamp=1771716367 id_slot=0 id_task=0
VERB [    batch_pending_prompt] prompt tokenized | tid="12628" timestamp=1771716367 id_slot=0 id_task=0 n_ctx=8192 n_keep=0 n_prompt_tokens=23 prompt_tokens="<s>[INST] Write a story about a cat. Write like a female writer with a lot of purple prose.[/INST] In"
======== Cache: cache_size = 0, n_past0 =  0, n_past1 =  0, n_past_prompt1 = 0,  n_past2 =  0, n_past_prompt2 =  0
INFO [    batch_pending_prompt] kv cache rm [p0, end) | tid="12628" timestamp=1771716367 id_slot=0 id_task=0 p0=0
VERB [    batch_pending_prompt] prompt processing progress | tid="12628" timestamp=1771716367 id_slot=0 n_past=23 n_ctx=8192 n_tokens=23 progress=1.0
VERB [    batch_pending_prompt] prompt done | tid="12628" timestamp=1771716367 id_slot=0 n_past=23 n_ctx=8192 n_tokens=23
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716367 n_tokens=23
VERB [            update_slots] run slots completed | tid="12628" timestamp=1771716367
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716367
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716367
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716367
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716367
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716367
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716367 n_tokens=1
VERB [            update_slots] run slots completed | tid="12628" timestamp=1771716367
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716367
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716368
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716368
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716368
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716368
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716368 n_tokens=1
VERB [            update_slots] run slots completed | tid="12628" timestamp=1771716368
VERB [              start_loop] wait for new task | tid="12628" timestamp=1771716368
VERB [              start_loop] new task may arrive | tid="12628" timestamp=1771716368
VERB [              start_loop] update_multitasks | tid="12628" timestamp=1771716368
VERB [              start_loop] callback_update_slots | tid="12628" timestamp=1771716368
VERB [            update_slots] posting NEXT_RESPONSE | tid="12628" timestamp=1771716368
VERB [            update_slots] decoding batch | tid="12628" timestamp=1771716368 n_tokens=1
VERB [           process_token] next token | tid="12628" timestamp=1771716368 id_slot=0 id_task=0 token=1278 token_text=" the" has_next_token=true n_remain=98 n_decoded=4 stopped_eos=false stopped_word=false stopped_limit=false stopping_word=""

llama-server --version
version: 4214 (bd387a27)
built with MSVC 19.42.34433.0 for x64

@ikawrakow @firecoperana Can you please look into that?

firecoperana · 2026-02-22T23:51:01Z

Does #1304 fix it?

SneedwareInc · 2026-02-23T01:23:39Z

@firecoperana It no longer crashes, but string ban doesn't work:
"banned_strings": ["e"],
Same prompt
Output:

 the resplendent tapestry of twilight, where the sun's last embers kissed the horizon, there resided a feline enchantress named Isolde. Her coat, a symphony of ebony and silver, shimmered like

firecoperana · 2026-02-24T02:13:01Z

#1310 fixes it.

SneedwareInc · 2026-02-27T21:30:21Z

@firecoperana done!

firecoperana

This PR changes many existing code, so it will take a while for me to fully review. When you use AI to write PR, watch out for any code that it removed by AI.

firecoperana · 2026-02-28T14:59:01Z

examples/server/server-context.cpp

+    if (ban_pos >= 0 && allow_rewind) {
+        rewind_context(slot, ban_pos);
        slot.rewind_status = true;
-        slot.ctx_sampling->rewind_samplers = true;


Why is this removed?

firecoperana · 2026-02-28T14:59:28Z

examples/server/server-context.cpp

+                }
            }
-            n++;
-            if (slot.banned_n > 0 && n == slot.banned_n) {


banned_n no longer works

firecoperana · 2026-02-28T16:02:53Z

examples/server/server-context.cpp

+            LLAMA_LOG_INFO("Banned pattern detected at pos %d. Banning token %d ('%s') and rewinding.\n", 
+                abs_pos, banned_tok, slot.token_buffer[token_idx].text_to_send.c_str());
+
+            slot.positional_bans[abs_pos].insert(banned_tok);


Can this be moved to rewind_context? This function should just check whether ban string existed. If possible, can you make it return n_rewind? There is no need to make additional changes in this function except for adding regex bans detection. With n_rewind being returned, there is less change in rewind_context function too.

SneedwareInc · 2026-03-03T15:17:09Z

@firecoperana I got back rewind_samplers and record_samplers, restored banned_n and moved the code to rewind_context. Is there anything else you wish me to do?

SneedwareInc · 2026-03-03T15:51:15Z

slot.ctx_sampling->rewind_samplers = true; and slot.ctx_sampling->record_samplers = true; are breaking something. The bans work, but the output quality is degraded when they are present. I will remove them.

firecoperana · 2026-03-03T23:27:43Z

I still see banned_n not working. There are other code that is removed too.
slot.ctx_sampling->rewind_samplers = true; and slot.ctx_sampling->record_samplers = true. are added for adaptive-p. Do you use adaptive p sampling when the output is broken?

Do you mind if I copy your code and create a clean PR? Your PR removes more code that is not related to regex ban than the last time I checked.

dungquixote42 · 2026-03-04T02:23:52Z

Perhaps #1359 fixes the issue. @SneedwareInc if you would test it with string/regex bans, I would love to hear the results.

SneedwareInc · 2026-03-04T04:54:53Z

I still see banned_n not working.

I'll look into that.

There are other code that is removed too.

Such as?

Do you use adaptive p sampling when the output is broken?

No, I do not use or even know what adaptive p is. I only use temperature and TFS. It is very concerining if rewind samplers causes it to turn on or affects other samplers that should not be affected.

Do you mind if I copy your code and create a clean PR?

You're welcome to cherry-pick this or copy it exactly as-is into a clean branch.

However, please don't rewrite the implementation logic, just copy it verbatim. The last rewrite introduced bugs that weren't caught because they weren't tested against edge cases.

If the issue is just formatting or the unrelated deletions, I can clean those up myself in this PR. I'd strongly prefer we fix the current one rather than risk another untested rewrite.

Perhaps #1359 fixes the issue. @SneedwareInc if you would test it with string/regex bans, I would love to hear the results.

I'll look into that.

firecoperana · 2026-03-04T17:55:00Z

examples/server/server-context.cpp

            }
            else if (penalty_prompt->is_array()) {
                const auto n_tokens = penalty_prompt->size();
-                slot.sparams.penalty_prompt_tokens.clear();


firecoperana · 2026-03-04T17:55:09Z

examples/server/server-context.cpp


        const auto preserved_tokens = data.find("preserved_tokens");
        if (preserved_tokens != data.end()) {
-            slot.sparams.preserved_tokens.clear();


firecoperana · 2026-03-04T17:55:16Z

examples/server/server-context.cpp

        }
        const auto grammar_triggers = data.find("grammar_triggers");
        if (grammar_triggers != data.end()) {
-            slot.sparams.grammar_triggers.clear();


firecoperana · 2026-03-04T17:55:40Z

examples/server/server-context.cpp

+
 		slot.logit_bias = slot.sparams.logit_bias; // keep a copy to restore
        slot.ban_phrases_bias = json_value(data, "banned_bias", params_base.ban_phrases_bias);
-        slot.banned_n = json_value(data, "banned_n", params_base.banned_n);


firecoperana · 2026-03-04T17:55:49Z

examples/server/server-context.cpp

                    slot.n_past_prompt++;
                    slot.n_past++;
-                    slot.do_checkpoint = false;
-                    if (params_base.do_checkpoint && slot.n_prompt_tokens - slot.n_past_prompt == params_base.ctx_checkpoints_tolerance) {


firecoperana · 2026-03-04T17:56:35Z

examples/server/server-context.cpp

            if (slot.state != SLOT_STATE_PROCESSING || slot.i_batch < (int)i || slot.i_batch >= (int)(i + n_tokens)) {
-                // save checkpoint during prompt processing
                if (slot.command == SLOT_COMMAND_LOAD_PROMPT) {
-                    if (slot.do_checkpoint) {


firecoperana · 2026-03-04T17:56:52Z

examples/server/server-context.cpp

                slot.t_start_generation = ggml_time_us();
                slot.t_prompt_processing = (slot.t_start_generation - slot.t_start_process_prompt) / 1e3;
                metrics.on_prompt_eval(slot);
-                // create checkpoint after prompt processing ends


firecoperana · 2026-03-04T17:56:59Z

examples/server/server-context.cpp

                }
            }

-            // create checkpoint during generation


SneedwareInc · 2026-03-06T21:28:45Z

@dungquixote42 Quality degradation is still there, but feels less severe than before. Could it be that adaptive_p somehow gets auto-enabled(I never had it enabled)? Or is it the way I copied it over?

@firecoperana Fixed.

dungquixote42 · 2026-03-07T05:47:00Z

@dungquixote42 Quality degradation is still there, but feels less severe than before. Could it be that adaptive_p somehow gets auto-enabled(I never had it enabled)? Or is it the way I copied it over?

I fetched this PR and ran it with test code. The adaptive P sampler is working as intended as far as I can tell. That is, no-op when its target is <0.
Your frontend is not setting the target at 0, is it? It needs to be negative.

firecoperana

Besides adding back the code that is removed, also unify the slot.banned_n with 1 and !=1. There is no need to do a special case with slot.banned_n==1 for positional ban, recovering and setting logit bias.

firecoperana · 2026-03-07T15:28:10Z

examples/server/server-context.cpp

-        {
+        for (auto result = slot.token_buffer.begin() + n_keep_buffer; result != slot.token_buffer.end(); result++) {
            if (!tokens.contains(result->tok)) {
                slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias;


Suggested change

slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias;

if (!tokens.contains(result->tok)) {

tokens.insert(result->tok);

slot.ctx_sampling->params.logit_bias[result->tok] += slot.ban_phrases_bias;

}

You can combine slot.banned_n==1 in this as well. No need to create a new if . Missing positional_bans in banned_n!=0.

firecoperana · 2026-03-07T15:51:14Z

examples/server/server-context.cpp

                continue; // sample using speculative decoding
            }

+            // RESTORE AND APPLY POSITIONAL BANS


Move this inside rewind_context.

firecoperana · 2026-03-07T15:53:29Z

examples/server/server-context.cpp

-    if (!slot.rewind_status) {
-        slot.ctx_sampling->params.logit_bias = slot.logit_bias; // restore logit bias
+
+    if (slot.banned_n != 1) {


What's the reason to special case banned_n!=1?

firecoperana · 2026-03-07T16:01:05Z

examples/server/server-context.cpp

        for (size_t i = 0; i < std::min(max_probs, n_probs); i++) {
            result.probs.push_back({
                cur_p->data[i].id,
-                common_token_to_piece(ctx, cur_p->data[i].id, special),


firecoperana · 2026-03-07T16:01:15Z

examples/server/server-context.cpp

        for (size_t i = 0; i < std::min(n_vocab, n_probs); i++) {
            result.probs.push_back({
                cur[i].id,
-                common_token_to_piece(ctx, cur[i].id, special),


dungquixote42 · 2026-03-08T01:47:08Z

examples/server/server-context.cpp

    }
-
-    slot.ctx_sampling->n_rewind = sent_results ? -1 : n_rewind;
+    if (slot.sparams.adaptive_target >= 0.0f) {


I am not sure if this check (and others elsewhere) is the right solution to the sampler running when it is not supposed to. I cannot reproduce this, so would you print adapt_p_ctx->target from llama_sample_adaptive_p_impl() and show us what they say? Preferably before and after the enable check.

SneedwareInc · 2026-03-10T22:57:00Z

Quality is degraded right now, not sure which of the changes caused it

SneedwareInc · 2026-03-10T23:49:11Z

Should be fixed. Moving // RESTORE AND APPLY POSITIONAL BANS to rewind_context is not possible due to quality degradation.

firecoperana

Not sure about the degradation due to adaptive p, but if there is, it can be fixed later.

SneedwareInc mentioned this pull request Feb 6, 2026

Add string and regex ban #1131

Closed

4 tasks

SneedwareInc closed this Feb 14, 2026

SneedwareInc force-pushed the main branch from c067fd6 to f5fe33b Compare February 14, 2026 11:09

SneedwareInc reopened this Feb 14, 2026

SneedwareInc closed this Feb 19, 2026

SneedwareInc force-pushed the main branch from 58fb9f1 to b855bf9 Compare February 19, 2026 19:13

SneedwareInc reopened this Feb 19, 2026

firecoperana reviewed Feb 19, 2026

View reviewed changes

SneedwareInc closed this Feb 26, 2026

SneedwareInc force-pushed the main branch from acf434c to 62a7dca Compare February 26, 2026 20:41

SneedwareInc marked this pull request as ready for review February 27, 2026 21:30

firecoperana reviewed Feb 28, 2026

View reviewed changes

firecoperana reviewed Mar 4, 2026

View reviewed changes

SneedwareInc closed this Mar 6, 2026

SneedwareInc force-pushed the main branch from c64514b to 277fc1d Compare March 6, 2026 20:44

SneedwareInc added 2 commits March 6, 2026 21:46

Test new ctx_sampling->n_rewind system

2d2a222

CRLF quickfix

d51b6b2

SneedwareInc reopened this Mar 6, 2026

SneedwareInc added 2 commits March 7, 2026 16:30

Merge branch 'ikawrakow:main' into main

f6134d5

Adaptive p check

fa75d48

firecoperana reviewed Mar 7, 2026

View reviewed changes

dungquixote42 reviewed Mar 8, 2026

View reviewed changes

dungquixote42 mentioned this pull request Mar 9, 2026

Adaptive P sampler: update review logic, delete old code comments, put prep stage after logit bias #1386

Open

4 tasks

SneedwareInc added 2 commits March 10, 2026 21:51

Merge branch 'ikawrakow:main' into main

85be0cc

merge banned_n

ed19364

SneedwareInc added 2 commits March 11, 2026 00:07

Fix attempt 1

99c2f1a

Fix attempt 2

be4d941

firecoperana approved these changes Mar 11, 2026

View reviewed changes

ikawrakow merged commit 4a24759 into ikawrakow:main Mar 11, 2026

Conversation

SneedwareInc commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Nexesenex commented Feb 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SneedwareInc commented Feb 7, 2026

Uh oh!

ikawrakow commented Feb 18, 2026

Uh oh!

firecoperana commented Feb 18, 2026

Uh oh!

SneedwareInc commented Feb 19, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

firecoperana Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SneedwareInc commented Feb 20, 2026

Uh oh!

firecoperana commented Feb 20, 2026

Uh oh!

SneedwareInc commented Feb 20, 2026

Uh oh!

ikawrakow commented Feb 20, 2026

Uh oh!

SneedwareInc commented Feb 20, 2026

Uh oh!

Lissanro commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SneedwareInc commented Feb 20, 2026

Uh oh!

SneedwareInc commented Feb 20, 2026

Uh oh!

firecoperana commented Feb 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SneedwareInc commented Feb 21, 2026

Uh oh!

Lissanro commented Feb 21, 2026

Uh oh!

SneedwareInc commented Feb 21, 2026

Uh oh!

firecoperana commented Feb 22, 2026

Uh oh!

SneedwareInc commented Feb 23, 2026

Uh oh!

firecoperana commented Feb 24, 2026

Uh oh!

SneedwareInc commented Feb 27, 2026

Uh oh!

firecoperana left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SneedwareInc commented Mar 3, 2026

Uh oh!

SneedwareInc commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

SneedwareInc commented Feb 6, 2026 •

edited

Loading

Nexesenex commented Feb 6, 2026 •

edited

Loading

firecoperana Feb 19, 2026 •

edited

Loading

Lissanro commented Feb 20, 2026 •

edited

Loading

firecoperana commented Feb 21, 2026 •

edited

Loading

SneedwareInc commented Mar 3, 2026 •

edited

Loading