Skip to content

Fix chat parser regressions: inference crashes/frozen; output backtracked#20660

Closed
jpohhhh wants to merge 1 commit intoggml-org:masterfrom
jpohhhh:master
Closed

Fix chat parser regressions: inference crashes/frozen; output backtracked#20660
jpohhhh wants to merge 1 commit intoggml-org:masterfrom
jpohhhh:master

Conversation

@jpohhhh
Copy link
Contributor

@jpohhhh jpohhhh commented Mar 17, 2026

The new chat parser throws std::runtime_error on result.fail() during final parse. If the PEG engine returns RESULT_FAIL for any reason, inference crashes. When repro'd with llama.cpp API, ex. by llama-server, this looks like a log with Failed to parse input at pos $X. With llama-server specifically, it looks like no finish_reason, HTTP 500. Here's every report.

The current approach to this is fixing forward: patch the parser to return RESULT_FAIL less often, one class of input at a time, with the implicit assumption that someday no input will ever reach RESULT_FAIL. That's not going to happen. There are 14 RESULT_FAIL return sites in peg-parser.cpp that fire on byte mismatches regardless of LENIENT, and routine agentic workflows hit them: multi-tool-call responses, trailing bytes after </tool_call>, models tacking on text after a tool call.

#20191 got the right idea: don't crash on incomplete parses, extract what you can. It made the parser report NEED_MORE_INPUT instead of FAIL at end-of-input, which falls through to AST extraction. But it only covers end-of-input.

Fix 1: Extend #20191's precedent to mismatches. If the parser did useful work (result.end > 0), return the AST instead of throwing, same as it already does for NEED_MORE_INPUT. Remove the is_partial guard. One line.

Fix 2: until(value_suffix) instead of schema(json()) for TAG_WITH_TAGGED arg capture. Strict json() validation on already-generated output causes PEG backtracking that wipes the entire AST (reasoning, tool names, all of it) over a single malformed character. Grammar constraints still enforce the schema during generation.

Tests crash on unpatched master, pass with fix:

cmake -B build_test -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build_test --target test-chat
./build_test/bin/test-chat
# std::runtime_error: Failed to parse input at pos 34: <tool_call>

@github-actions github-actions bot added the testing Everything test related label Mar 17, 2026
@aldehir
Copy link
Contributor

aldehir commented Mar 17, 2026

common_chat_peg_parse gates the AST fallback on is_partial. During streaming, every partial parse that can't fully match the grammar falls back to AST extraction: reasoning streams, tool names appear, everything works. Then server_task_result_cmpl_final::update() calls the same function on the same accumulated text with is_partial=false, and it throws std::runtime_error instead of using the fallback.

Which version are you testing this against? To my knowledge, this was fixed in #20191. It now applied the partial streaming path at all times.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

Which version are you testing this against?

This is a PR against master, diff seems to be baseline'd against it.

To my knowledge, this was fixed in #20191. It now applied the partial streaming path at all times.

This is a completely different codepath, see diff. This comment was accurate and another way of putting the diff/my commentary, if it helps: #20191 (comment)

@aldehir
Copy link
Contributor

aldehir commented Mar 17, 2026

Yes, I saw the diff.

The result.fail() branch was triggered when EOG is reached prematurely. #20191 adjusts the meaning of failure to only include mismatches between model output and the parser; reaching the end is treated the same as is_partial = true. In fact, the partial nomenclature in the parser was removed as it no longer made sense if it applied at all times. Instead, its functionality was renamed to lenient which is always enabled when parsing model output (even when reaching the end either naturally or via max_tokens). So the code path is the same, but it occurs during parsing and not at the end.

Now back to the failures that can occur: mismatch in output. The parsers are crafted to be forgiving when parsing content, at least until tool calling or structured outputs are required. At those points, the underlying grammar sampler is responsible for ensuring the model output remains parseable. If there is a mismatch between the grammar sampler and the parser, then you may reach a failure, which is something we absolutely don't want. In fact, we see this often when the model produces gibberish, which is usually a symptom of a generation issue, not a parsing one.

Nonetheless, I am amenable to approving the change in the result.fail() branch if you can produce a repro on master. Regarding all of the examples you listed in #18675 (comment):

Repro was available the whole time. @de-wim posted max_tokens=1 returning HTTP 500, a one-liner curl against stock llama-server/cli. @ZUIcat posted the exact error string with Qwen3-Coder. @Galunid bisected it to the refactoring commit. These were all in this thread. I count 7 total that bisected, mentioned max tokens, or mentioned the exact exception string used, and it is in the same if statement that causes the core bug.

These have since been resolved in #20191. I cannot reproduce any errors with max_tokens set, either as 1, 8, or adjusted to ensure generation ends mid tool call. Feel free to drop a curl or python script showing otherwise.

This comment was accurate and another way of putting the diff/my commentary, if it helps: #20191 (comment)

The author of that comment has since mentioned the PR resolved their issue.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

TL;DR: The is_partial && result.end > 0 guard that is causing llama.cpp's clients and own tools to crash out of inference is on master, right now. The PR includes tests that fail without the change and pass with it:

cmake -B build_test -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build_test --target test-chat
./build_test/bin/test-chat
# [chat] All tests passed!

# Revert the one-line change:
sed -i 's/if (result.end > 0)/if (is_partial \&\& result.end > 0)/' common/chat.cpp
cmake --build build_test --target test-chat
./build_test/bin/test-chat
# throws: std::runtime_error: Failed to parse input at pos 149

# Re-apply:
sed -i 's/if (is_partial \&\& result.end > 0)/if (result.end > 0)/' common/chat.cpp
cmake --build build_test --target test-chat
./build_test/bin/test-chat
# [chat] All tests passed!

I think we may be talking past each other on "which version." You mean what commit/binary, I mean the tests are in the PR and they exercise current master. The PR is based on HEAD. I am testing using the llama-server binary and my own API client.


Point by point:

To my knowledge, this was fixed in #20191.

#20191 changed the PEG parse context to always use lenient mode. It did not touch the guard:

$ git show upstream/master:common/chat.cpp | grep -A1 'result.fail()'
    if (result.fail()) {
        if (is_partial && result.end > 0) {

Lenient mode affects how the PEG engine parses. The guard affects what happens after parsing fails. These are different code, different lines, different concerns. #20191 reduced how often result.fail() triggers. It didn't change what happens when it's still reached.

In fact, the partial nomenclature in the parser was removed as it no longer made sense if it applied at all times.

The partial flag was renamed to lenient in the PEG parse context. is_partial is still a parameter of common_chat_peg_parse(), still passed by every caller, and still checked in the guard on line 1737.

So the code path is the same, but it occurs during parsing and not at the end.

No. Lenient mode means the PEG engine internally produces needs_more_input instead of hard failure when it reaches end-of-input. The if (is_partial && result.end > 0) guard runs after parsing, on result.fail(). They're separate.

the underlying grammar sampler is responsible for ensuring the model output remains parseable

For tool_choice=auto, the grammar is lazy, inactive until the trigger fires. Before the trigger, the model generates freely. For max_tokens truncation, the grammar is irrelevant, generation is just cut off. In neither case does the grammar sampler guarantee parseable output.

If there is a mismatch between the grammar sampler and the parser, then you may reach a failure, which is something we absolutely don't want.

Agreed, and the fix is to not throw away the entire AST when it happens. The reasoning parsed. The tool name parsed. One argument value has an extra }. The correct response is to preserve what parsed, not throw std::runtime_error.

In fact, we see this often when the model produces gibberish, which is usually a symptom of a generation issue, not a parsing issue.

An extra } from a 0.8B model is not gibberish. The structural markers (<tool_call>, <function=, <parameter=) are all correct. The parser's job is to extract what it can.

I cannot reproduce any errors with max_tokens set

See test commands above.

The author of that comment has since mentioned the PR resolved their issue.

@trshimizu's first response to #20191 was: "a quick check on my side didn't eliminate the error." Their analysis: "The root problem seems to be that there is no error recovery at the common_chat_parse() level for PEG formats. When is_partial=false and the final token cuts off mid-UTF-8 character, the parse fails with no fallback." That is exactly our bug. They later confirmed the latest commit worked for their specific case, Qwen3.5-397B-A17B in non-reasoning mode. A 397B model produces cleaner output than a 0.8B. The guard didn't go away; their model just stopped triggering it.

This is incorrect. With tool_choice = auto, the grammar is applied when a trigger pattern matches.

The grammar constraining generation and the parser extracting structure from already-generated output are two different things. The trigger fires, the grammar constrains, and the model still produces [{...}]}. Whether that's a trigger timing issue, a token boundary, or a small model being imprecise, the output exists and the parser has to deal with it. schema(json()) rejects it, tool_call rule fails, PEG backtracks the entire root sequence, wipes all AST nodes including reasoning and tool name. until(value_suffix) captures the raw value, structural markers are still matched strictly, caller gets a tool call with slightly messy args instead of nothing.

@Galunid
Copy link
Contributor

Galunid commented Mar 17, 2026

I cannot reproduce the issue I reported in #18675 (comment) even without this PR. As far as I know it was patched. There's a different regression where qwen3 used to work with enums, but doesn't anymore, but other than that 503 errors were fixed.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

I cannot reproduce the issue I reported in #18675 (comment) even without this PR. As far as I know it was patched. There's a different regression where qwen3 used to work with enums, but doesn't anymore, but other than that 503 errors were fixed.

This is a rare(?) blessed case where we can trace directly back to code directly from your report instead of vibereproing :) Your report logs "Failed to parse input at pos...", the only place that appears in the codebase is at this callsite.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 17, 2026

Let me remind you about the Contribution Policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

Let me remind you about the Contribution Policy:

It is strictly prohibited to use AI to write your posts for you (bug reports, feature requests, pull request descriptions, Github discussions, responding to humans, ...).

Which part of this is AI written for me?

I'm really disheartened by the reaction to this. There's a fatal crash, at HEAD. It's obvious at its face what causes it. It's a one line fix.

The feedback so far consists of one maintainer saying it feels spiritually similar to another commit they made recently, so I must not be testing at HEAD, but maybe they'll be amenable to letting it in if we can prove it happens. The other is another maintainer generally waving at some comment I'm written must be AI, and I'm uncertain how to respond because the Markdown is pretty damn good, and I did make my AI run the CLI commands and give me back Markdown, so technically some portion of it is AI? So what do we do then?

In any case, we continue crashing, at HEAD, and there's tests. I'm open to a fun back and forth with judges on style points for the PR and my comments, but I think it'd be more fun for everyone if we do that after fixing the crash.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 17, 2026

It would really help if instead of producing hallucinated reports and hallucinated fixes you'd actually produce a verifiable, reproducable issue.

Server running master with Qwen3.5 finetune:

ilintar@LinuksowaJaskinia:/devel/alt/tests$ curl -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":1}' http://localhost:2345/v1/chat/completions
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"","reasoning_content":"Thinking"}}],"created":1773755425,"model":"local","system_fingerprint":"b8367-a21d219a0","object":"chat.completion","usage":{"completion_tokens":1,"prompt_tokens":11,"total_tokens":12},"id":"chatcmpl-gDcPAg6k4Oh63NanAOVbfzORLxDnpEEv","timings":{"cache_n":0,"prompt_n":11,"prompt_ms":140.046,"prompt_per_token_ms":12.731454545454545,"prompt_per_second":78.54562072461906,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}}

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

It would really help if instead of producing hallucinated reports and hallucinated fixes you'd actually produce a verifiable, reproducable issue.

Server running master with Qwen3.5 finetune:

ilintar@LinuksowaJaskinia:/devel/alt/tests$ curl -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":1}' http://localhost:2345/v1/chat/completions
{"choices":[{"finish_reason":"length","index":0,"me

I'm not hallucinating anything. You have tests on the PR. They don't disappear because a one-off inference attempt doesn't repro the error because backtracking didn't occur because max tokens = 1 and there wasn't partial thinking/tools. (relevant test: https://github.com/ggml-org/llama.cpp/pull/20660/changes#diff-1c7deadc6d883e0ace74903fdc5d9f93cd403aee331139e18a49ea30a20b84e8R1927)

I just spent 3 days on root causing and fixing this forward. This sort of behavior towards a contributor is novel to me in my 37 years on earth. I feel for you guys, dilletantes with AI must be absolutely swarming the repo.

That being said, this is such a simple blindingly obvious error that we need to get it fixed and make sure the contributor (me) is put in their place later.

The server/cli do call the method with (is_partial=false) when done inferencing. There is, in fact, a public method with is_partial as an argument name, at HEAD, even after the changes aldehir made. The server, CLI, do, and API callers will, pass is_partial=false when inference is done. The method does have a special case where it throws and backtracks completely only when is_partial=false.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 17, 2026

I asked for a reproducible example. You provided an example, saying it crashes "on any thinking model". I ran it 10 times just to rule out some random seed stuff on two thinking models now, zero errors. You're claiming it's not a problem because you reproduced it so it surely must still be a problem.

Maybe let's get back to this when you can actually provide an example that can reproduce the problem on a live build on the newest master.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

I asked for a reproducible example.

There's tests in the PR with exact output from the model that induces this.

I don't mean to be curt when I say that.

The longer version is the one above that's tl;dr: "see tests and here's the CLI command to run tests, revert, run tests". It is AI-inflected one, i.e. you correctly pointed out the markdown-ified CLI commands were AI-generated.

If the too-perfect sed and markdown threw you off, I don't blame you for not running them. But, please, look at test #3: it's exactly what you're looking for. Cut-off generation leads to exception => in server/cli context, incomplete inference / crash.

You provided an example, saying it crashes "on any thinking model"

You're right. That example was WAY too simplified. I even want to go further, let's blame it on my morning coffee: that example is dumb and it was dumb of me to throw it out there. A global statement that any reasoning model with max_tokens=1 is doomed to repro was ridiculous. I should have let the tests speak for themselves.

I ran it 10 times just to rule out some random seed stuff on two thinking models now, zero errors.

You and I are on the same page here: repro'ing via a non-deterministic process is difficult and not helpful, tests would be better.

Maybe let's get back to this when you can actually provide an example that can reproduce the problem on a live build on the newest master.

Maybe I am hallucinating, I'm definitely missing something: A) this diff is against master B) this PR includes tests with the malformed output I'd get at runtime from the model that triggers the error in the code.

Generally, I'm making fun of myself because I'm hoping that makes you see I'm not talking down to you. Then, I'm hoping you think "gee, this guy sure is confident and willing to argue everything but maybe it's worth me considering A) stop claiming it's made up B) being nice to him, he's been through days investigating, root-causing this, and patching it C) look at the code and tests in the PR and get specific request on why the repros provided, and tests added for them, aren't sufficient"

It's really, really, scary to have a simple "server calls method(false). when false, method throws on any malformed input and throws away AST", sitting around and even a fix forward with tests, the kindest approach, be treated as hostile. Let's make it less scary.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 17, 2026

Tests which focus on impossible input are not really valid tests though - the way grammar works is it constraints the model from outputting incorrect JSON. If that's broken, then that needs to be fixed - but for that we need a real-life example, not tests.

That's why the reproduction code needs to be a query to a live model, not a test case that might or might not correspond to a real scenario.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

That's why the reproduction code needs to be a query to a live model, not a test case that might or might not correspond to a real scenario.

Can you help me out here? I'm being dead honest with the following, not talking back:

  • I repro'd the heck out of this thing. It's definitely not super obvious at first, you need a darn small model doing things small models couldn't do a month ago, at a relatively high temp. I found the logic error first, and even then, didn't realize how I could repro until I started accidentally using temp > 0 with the 0.8B.

  • I directly copy-pasted the output from this silly little 0.8B into tests.

  • I don't know a more constructive way to provide a repro than doing exactly that.

  • I spent the last 20-30m since my last comment pulling up issues with "Failed to parse input at pos", identifying if they had real output, and converting them to tests. I have two more. But that's the same thing you're not looking for, I suppose? More tests? But it kinda is what you're looking for? Proof the jpohhhh dude isn't just a frustrated teenager vibefixing fake errors?

Don't really know what to do or give. We seem kinda stuck at "the jpohhhh dude may be a vibecoding angry teenager, so we can't really trust when he says the test cases he gives are real output from a real model", which, fair enough man. I'm just an anonymous avatar on the internet.

But then I'm kinda stuck.

I can't force you to trust it, it's fair you don't, I'm an anonymous avatar on the internet, etc. etc.

I can't force you to just consider the code in the abstract. It's fair that this it looks like the last line of defense against bad outputs, instead of the thing that'll cause crashes on malformed outputs. Something that might be fruitful is getting 1:1 with aldehir: the refactoring he did is spiritually aligned with "this needs to be is_lenient instead of is_partial", it's just, there was one call site left and it happened to be the public call site, and we've all been calling it with is_partial = false on the last inference. (also worth noting, thought I had when falling asleep last night: aldehir's changes would increase the frequency of this, more outputs are "okay" now in the is_partial == is_lenient == true case, then on that last call with is_partial = false, now they're not okay)

I don't want to spam you with more random outputs from models that you can't trust anyway because I'm an an anonymous avatar wielding AI, etc.

Actually....maybe what I was doing was right...just pulling the 2 issues I found with the exception log, and output, and adding tests. Then its sort of like you're getting repros from people who are verifiably not me.

What do you think?

@pwilkin
Copy link
Contributor

pwilkin commented Mar 17, 2026

@jpohhhh There was an error with max_token handling earlier. It was fixed quite a while ago (see @aldehir 's first comment about #20191 ). Since then, there should not be any errors related to it.

There is another lingering error that we know of regarding tool calls in reasoning blocks - occasionally in some cases some thinking models will do a tool call within reasoning blocks and then the problem is that the grammar sampler kicks in, causing an <eog> after the tool call finishes even though the reasoning is not even closed, so this will output a weird reasoning block that ends with something that looks like a tool call - we've got a fix for that planned, but it's pending the resolution of #20424

I don't know what to tell you other than "if you did reproduce it before, but can't reproduce it now, then maybe it was just fixed in the meantime" - unless you can provide a reproduction scenario. But note that I've been using Qwen3.5 pretty extensively these past few days, ran a few entire benchmarks on it and have yet to see this problem occur.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

unless you can provide a reproduction scenario. But note that I've been using Qwen3.5 pretty extensively these past few days, ran a few entire benchmarks on it and have yet to see this problem occur.

Narrow, shortest version, I'm afraid it'll still sound combative but on my side it's "being concise and clear" not "snarky", and I don't want to go to AI to whine I'm scared it'll sound snarky and reword it:

I have repro'd. The exact repro is in the tests. The exact output from me repo'ing with a binary is in the tests.

Additionally, I am testing at HEAD. I am editing at HEAD. We can set any other idea aside. I'm not spending 3 days root causing, fixing, then arguing with credentialed people telling me in public that, inter alia, I'm hallucinating and that my fix isn't on master because it's the exact same edit they made recently. Trust me, I would have given up by now. This really sucks. I am having an awful time. I have never, ever, had a figure in authority do "but it does work on my machine" when faced with a straightforward code edit with unit tests. In my experience, that's how you make authority figures upset. So I have 0 idea how to interact and it is very uncomfortable for me.

Longer version:

Same disclaimer as above re: I'm not trying to be combative, at all, just spell out the technical stuff so we can figure out where our disconnect. And, again, no AI. I just like Markdown :)

Short version of longer version: my guess is the max_tokens thing is going over y'all's head because of the other PR taking up mindspace. Thought experiment: The pristine PEG code might get crap output because max_tokens made it end in the middle of a tool call. What happens then? Well, clients call chat_common_parse with is_partial false, chat_common_parse calls common_chat_peg_parse with is_partial false, which then throws an exception. Because the tool call can't be parsed. I understand from there, there is an argument akin to "we throw when that happens because it can't be parsed! and clients aren't catching it because [$FILL_IN_REASON / this is new behavior / they don't understand they're supposed to catch]". Let's set that aside until we agree on it does happen.

There's been so many claims re: whether this diff is even possible on master or if I'm lying via vibecoding/vibecommenting, that I think it'll be fruitful to re-lay this out as simple premises / assertions.
Why? At least one of them doesn't match y'all's understanding or experience. And with simple #'d assertions, we can reset & zero in on why I seem wrong. That gives us another avenue because if I average over all your comments, you're either (A) genuinely confused as to whether the unit tests are repros on master with real output data from repros on master (B) you don't believe they are. I respect (B) given I imagine a large influx of slop PRs and I'm confidently assertive like an AI would be, but about something that feels 100% wrong given y'alls experience.

Premises

  1. A model's output can be constrained to end after a certain number of tokens are emitted, via a parameter called max_tokens
  2. A method exists, on master, at HEAD, in common/chat.cpp, with the signature common_chat_msg common_chat_peg_parse(const common_peg_arena & src_parser, const std::string &input, bool is_partial, const common_chat_parser_params & params) source
  3. A method exists, on master, at HEAD, in common/chat.cpp, called common_chat_parse. It has one line and calls common_chat_peg_parse source
  4. llama-server calls common_chat_parse source
  5. llama-server calls common_chat_parse with is_partial false when inference is done. source
  6. Per (3) common_chat_parse calls common_chat_peg_parse.
  7. common_chat_peg_parse throws if is_partial is false. source

Theorem

  1. If a model's output is truncated (premise 1) such that common_chat_peg_parse cannot fully match it, and the server calls it with is_partial=false (premise 5), it throws (premise 7). The server does not catch this exception. The client receives a 500 error and no finish_reason.

(n.b. I came back and hand-edited all the source links to have master in them rather than the commit-hash-at-HEAD-on-master. GitHub gives you the commit hash version if you right click -> get permalink when browsing master, but, master in the URL does work. However, the links won't be accurate if/when these files shift lines, and I can't guarantee that didn't happen between the edits I made to the URLs and the time you are reading this, dear reader)

@strawberrymelonpanda
Copy link
Contributor

strawberrymelonpanda commented Mar 17, 2026

Since then, there should not be any errors related to it.
But note that I've been using Qwen3.5 pretty extensively these past few days, ran a few entire benchmarks on it and have yet to see this problem occur.

Just as a third party here, I have to say that most of these responses boil down to "there shouldn't be" and "It works for me".

I have had, in other repos, defensive maintainers dismiss PRs and it does not feel good. As a developer, I'm also well aware not all PRs are valid ones. Is there another maintainer who can step in, run these tests, and decide or ask for more info?

To show I'm not taking sides here:

And, again, no AI. I just like Markdown :)
I respect (B) given I imagine a large influx of slop PRs and I'm confidently assertive like an AI would be,
you are reading this, dear reader

I think at this point I think you're probably blowing off steam, but a few things does feel like a combative AI:

  1. markdown, it's not needed
  2. very assertive, indeed. Bordering on overly invested.
  3. verbosity in a bug report, using phrases like "dear reader"

I get it, really. I've been there. Advice: I treat PRs and bug reports like I treat emails to my boss - stay short and to the point, and assume they won't read half of it. More text just means more opportunity for misunderstanding.

At a certain point, be willing to move on.

(Example: just recently I was told that a live feature does not need to be documented. #20384 This is very counter to my (long) experience as a developer.)

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

  1. markdown, it's not needed

Respectfully, using backticks and #'d lists isn't something worth passing judgement on. I'm a designer by nature, engineer by accident, I'm not making stuff unreadable by my standards because AI uses Markdown too.

  1. very assertive, indeed. Bordering on overly invested.

You bet I'm invested. :) This is my PR, if no one responds when the maintainers have technical questions, it doesn't get in. Full stop.

  1. verbosity in a bug report, using phrases like "dear reader"

This isn't a bug report, this is a PR. I'd totally understand if I was filing an issue. Generally, you're right about over-verbosity. Shouldn't do it.

It's a fail-over strategy I use when the only other option is getting in the dirt with people. ex. "you are hallucinating issues and fixes" is concisely responded to with "you are lying", which won't help anything. Being ultra-very-clear about what's on your mind has the opposite effect. It signals "I don't think you're incompetent, I do think this is important, and I have enough knowledge that you can't file me under hallucinating"

I get it, really. I've been there. Advice: I treat PRs and bug reports like I treat emails to my boss - stay short and to the point, and assume they won't read half of it. More text just means more opportunity for misunderstanding.

Appreciate the advice, understand where it's coming from, doesn't apply here. The short version - unit tests + fix + concise explanation - got me a polite version of "this PR can't be on master because the edit was made already" and "this doesn't fix anything, but maybe I'll be amenable to it if you prove it does" and "you're hallucinating." On a PR with unit tests and two lines of changes.

And note no one has given an inch on any of those. We're still stuck at "this can't happen and we can't talk about it until you provide proof it does". On a commit with unit tests, with a runtime repro, and a one-line command that shows they fail without the change and pass with the change. On the most serious type of bug: a method that is called on every inference, throws an exception, on unexceptional behavior.

Less was never going to help, unless I just didn't care at all & gave up completely.

@strawberrymelonpanda
Copy link
Contributor

Appreciate the advice, understand where it's coming from, doesn't apply here

Well, I did start off by asking for another opinion.

And see the edit about docs at the bottom if you missed it, at a certain point there's just good reason to say: "Not my monkeys, not my circus." You can't always argue a PR into master.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

Appreciate the advice, understand where it's coming from, doesn't apply here

Well, I did start off by asking for another opinion.

And see the edit about docs at the bottom if you missed it, at a certain point there's just good reason to say: "Not my monkeys, not my circus." You can't always argue a PR into master.

I'm not "arguing" into master. I'm responding to "this was fixed" "you are hallucinating" "I need a repro from runtime" with answers and links to code.

There's a massive difference between "I don't want your docs" and "We introduced an exception during unexceptional behavior that is a massive regression and short circuits inference on our own tools, much less API callers. Btw ,you're hallucinating the issue, and the fix, and in fact are so incompetent that I don't trust the GitHub UI that says there's an actual change here, because I made the same change a week ago, there's no way the code you're mentioning is in the current codebase"

You're right that the first isn't worth discussing past "no".

Maybe this isn't worth it, either.

Eventually, the server maintainer gets tired of getting bug reports and searches for the log text, finds the exception, sees peg, and puts 2 + 2 together, right? And what do I care what strangers on the internet say?

Idk man. I just want the bug fixed, and I'm willing to answer what they ask. I feel for these dudes because you only react this way when you're burned out and it looks like they've had a two week marathon of fix forwards.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 17, 2026

All right, I'l bite:

7. `common_chat_peg_parse` throws if `is_partial` is `false`. [source](https://github.com/ggml-org/llama.cpp/blob/master/common/chat.cpp#L1765)

This is false.

  1. common_chat_peg_parse sets LENIENT:
common_peg_parse_flags flags = COMMON_PEG_PARSE_FLAG_LENIENT
  1. With LENIENT, the parser does not return RESULT_FAIL, but RESULT_NEEDS_MORE_INPUT:

if (!ctx.is_lenient()) {

  1. fail() is true for parser iff. result == COMMON_PEG_PARSE_RESULT_FAIL

This is exactly what #20191 was about, which we've tried to tell you with @aldehir now quite a few times.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 17, 2026

Just as a third party here, I have to say that most of these responses boil down to "there shouldn't be" and "It works for me".

I have had, in other repos, defensive maintainers dismiss PRs and it does not feel good. As a developer, I'm also well aware not all PRs are valid ones. Is there another maintainer who can step in, run these tests, and decide or ask for more info?

That's why we've both looked at it with aldehir and we've both reached the same conclusion.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

I'll adopt your framing. Premise 7 says common_chat_peg_parse throws if is_partial is false. What it should say is: it throws if is_partial is false and result.fail() is true. Fair correction, that was imprecise - I thought it was clear we were discussing unparsable outputs. That's the crux. llama.cpp now crashes on unparsable outputs. Not on every output with is_partial false :) That's why max_tokens and reasoning/tool calls/agents keep coming up with the log from when it throws the exception.

Your claim is that LENIENT prevents result.fail() from ever being true. Let's walk through the file you linked togther, because I think this is where the disconnect is.

// peg-parser.cpp, literal parser
if (pos >= ctx.input.size()) {          // line 350: ran out of input
    if (!ctx.is_lenient()) {                 
        return RESULT_FAIL;             // line 353 — end-of-input, guarded
    }
    return RESULT_NEED_MORE_INPUT;      // line 355 — this is what LENIENT does
}
if (ctx.input[pos] != p.literal[i]) {   // line 357 — character doesnt match
    return RESULT_FAIL;
}

Same function, six lines apart. LENIENT catches "we ran out of input." It does not catch "the next character is wrong." Those are different failure modes and only one of them was addressed by #20191.

The test input isn't truncated. It's malformed, an extra } inside a structurally valid tool call. The parser has plenty of input, it just doesn't match the grammar. That's a mismatch, not end-of-input. LENIENT doesn't help. result.fail() is true. Line 1749 is reached. Line 1770 throws.

(I went through the whole file, peg-parser.cpp has 22 RESULT_FAIL return sites. 8 check is_lenient(). These 14 don't: 1 2 3 4 5 6 7 8 9 10 11 12 13 14. Not to belabour it, just want to be thorough since we've gone back and forth a few times now.)

On unpatched master:

cmake -B build_test -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build_test --target test-chat
./build_test/bin/test-chat
# std::runtime_error: Failed to parse input at pos 34: <tool_call>

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 17, 2026

Just as a third party here, I have to say that most of these responses boil down to "there shouldn't be" and "It works for me".
I have had, in other repos, defensive maintainers dismiss PRs and it does not feel good. As a developer, I'm also well aware not all PRs are valid ones. Is there another maintainer who can step in, run these tests, and decide or ask for more info?

That's why we've both looked at it with aldehir and we've both reached the same conclusion.

To be fair, that doesn't exclude their claim a maintainer that will run tests would be handy :)

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 18, 2026

Closing in favor of a new #20708. spent a couple hours wading through the parser code and writing tests and debugging test infra. it's definitely built-in that the parser will fail sometimes. I think the max_tokens PR being adjacent-to but not the same caused a lot of confusion on maintainers end, so I will refrain from mentioning it in the new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants