Fix chat parser regressions: inference crashes/frozen; output backtracked#20660
Fix chat parser regressions: inference crashes/frozen; output backtracked#20660jpohhhh wants to merge 1 commit intoggml-org:masterfrom
Conversation
Which version are you testing this against? To my knowledge, this was fixed in #20191. It now applied the partial streaming path at all times. |
This is a PR against master, diff seems to be baseline'd against it.
This is a completely different codepath, see diff. This comment was accurate and another way of putting the diff/my commentary, if it helps: #20191 (comment) |
|
Yes, I saw the diff. The Now back to the failures that can occur: mismatch in output. The parsers are crafted to be forgiving when parsing content, at least until tool calling or structured outputs are required. At those points, the underlying grammar sampler is responsible for ensuring the model output remains parseable. If there is a mismatch between the grammar sampler and the parser, then you may reach a failure, which is something we absolutely don't want. In fact, we see this often when the model produces gibberish, which is usually a symptom of a generation issue, not a parsing one. Nonetheless, I am amenable to approving the change in the
These have since been resolved in #20191. I cannot reproduce any errors with
The author of that comment has since mentioned the PR resolved their issue. |
|
TL;DR: The I think we may be talking past each other on "which version." You mean what commit/binary, I mean the tests are in the PR and they exercise current master. The PR is based on HEAD. I am testing using the llama-server binary and my own API client. Point by point:
#20191 changed the PEG parse context to always use lenient mode. It did not touch the guard: Lenient mode affects how the PEG engine parses. The guard affects what happens after parsing fails. These are different code, different lines, different concerns. #20191 reduced how often
The
No. Lenient mode means the PEG engine internally produces
For
Agreed, and the fix is to not throw away the entire AST when it happens. The reasoning parsed. The tool name parsed. One argument value has an extra
An extra
See test commands above.
@trshimizu's first response to #20191 was: "a quick check on my side didn't eliminate the error." Their analysis: "The root problem seems to be that there is no error recovery at the common_chat_parse() level for PEG formats. When is_partial=false and the final token cuts off mid-UTF-8 character, the parse fails with no fallback." That is exactly our bug. They later confirmed the latest commit worked for their specific case, Qwen3.5-397B-A17B in non-reasoning mode. A 397B model produces cleaner output than a 0.8B. The guard didn't go away; their model just stopped triggering it.
The grammar constraining generation and the parser extracting structure from already-generated output are two different things. The trigger fires, the grammar constrains, and the model still produces |
|
I cannot reproduce the issue I reported in #18675 (comment) even without this PR. As far as I know it was patched. There's a different regression where qwen3 used to work with enums, but doesn't anymore, but other than that 503 errors were fixed. |
This is a rare(?) blessed case where we can trace directly back to code directly from your report instead of vibereproing :) Your report logs "Failed to parse input at pos...", the only place that appears in the codebase is at this callsite. |
|
Let me remind you about the Contribution Policy:
|
Which part of this is AI written for me? I'm really disheartened by the reaction to this. There's a fatal crash, at HEAD. It's obvious at its face what causes it. It's a one line fix. The feedback so far consists of one maintainer saying it feels spiritually similar to another commit they made recently, so I must not be testing at HEAD, but maybe they'll be amenable to letting it in if we can prove it happens. The other is another maintainer generally waving at some comment I'm written must be AI, and I'm uncertain how to respond because the Markdown is pretty damn good, and I did make my AI run the CLI commands and give me back Markdown, so technically some portion of it is AI? So what do we do then? In any case, we continue crashing, at HEAD, and there's tests. I'm open to a fun back and forth with judges on style points for the PR and my comments, but I think it'd be more fun for everyone if we do that after fixing the crash. |
|
It would really help if instead of producing hallucinated reports and hallucinated fixes you'd actually produce a verifiable, reproducable issue. Server running master with Qwen3.5 finetune: ilintar@LinuksowaJaskinia:/devel/alt/tests$ curl -d '{"messages":[{"role":"user","content":"hello"}],"max_tokens":1}' http://localhost:2345/v1/chat/completions
{"choices":[{"finish_reason":"length","index":0,"message":{"role":"assistant","content":"","reasoning_content":"Thinking"}}],"created":1773755425,"model":"local","system_fingerprint":"b8367-a21d219a0","object":"chat.completion","usage":{"completion_tokens":1,"prompt_tokens":11,"total_tokens":12},"id":"chatcmpl-gDcPAg6k4Oh63NanAOVbfzORLxDnpEEv","timings":{"cache_n":0,"prompt_n":11,"prompt_ms":140.046,"prompt_per_token_ms":12.731454545454545,"prompt_per_second":78.54562072461906,"predicted_n":1,"predicted_ms":0.001,"predicted_per_token_ms":0.001,"predicted_per_second":1000000.0}} |
I'm not hallucinating anything. You have tests on the PR. They don't disappear because a one-off inference attempt doesn't repro the error because backtracking didn't occur because max tokens = 1 and there wasn't partial thinking/tools. (relevant test: https://github.com/ggml-org/llama.cpp/pull/20660/changes#diff-1c7deadc6d883e0ace74903fdc5d9f93cd403aee331139e18a49ea30a20b84e8R1927) I just spent 3 days on root causing and fixing this forward. This sort of behavior towards a contributor is novel to me in my 37 years on earth. I feel for you guys, dilletantes with AI must be absolutely swarming the repo. That being said, this is such a simple blindingly obvious error that we need to get it fixed and make sure the contributor (me) is put in their place later. The server/cli do call the method with (is_partial=false) when done inferencing. There is, in fact, a public method with is_partial as an argument name, at HEAD, even after the changes aldehir made. The server, CLI, do, and API callers will, pass is_partial=false when inference is done. The method does have a special case where it throws and backtracks completely only when is_partial=false. |
|
I asked for a reproducible example. You provided an example, saying it crashes "on any thinking model". I ran it 10 times just to rule out some random seed stuff on two thinking models now, zero errors. You're claiming it's not a problem because you reproduced it so it surely must still be a problem. Maybe let's get back to this when you can actually provide an example that can reproduce the problem on a live build on the newest master. |
There's tests in the PR with exact output from the model that induces this. I don't mean to be curt when I say that. The longer version is the one above that's tl;dr: "see tests and here's the CLI command to run tests, revert, run tests". It is AI-inflected one, i.e. you correctly pointed out the markdown-ified CLI commands were AI-generated. If the too-perfect sed and markdown threw you off, I don't blame you for not running them. But, please, look at test #3: it's exactly what you're looking for. Cut-off generation leads to exception => in server/cli context, incomplete inference / crash.
You're right. That example was WAY too simplified. I even want to go further, let's blame it on my morning coffee: that example is dumb and it was dumb of me to throw it out there. A global statement that any reasoning model with max_tokens=1 is doomed to repro was ridiculous. I should have let the tests speak for themselves.
You and I are on the same page here: repro'ing via a non-deterministic process is difficult and not helpful, tests would be better.
Maybe I am hallucinating, I'm definitely missing something: A) this diff is against master B) this PR includes tests with the malformed output I'd get at runtime from the model that triggers the error in the code. Generally, I'm making fun of myself because I'm hoping that makes you see I'm not talking down to you. Then, I'm hoping you think "gee, this guy sure is confident and willing to argue everything but maybe it's worth me considering A) stop claiming it's made up B) being nice to him, he's been through days investigating, root-causing this, and patching it C) look at the code and tests in the PR and get specific request on why the repros provided, and tests added for them, aren't sufficient" It's really, really, scary to have a simple "server calls method(false). when false, method throws on any malformed input and throws away AST", sitting around and even a fix forward with tests, the kindest approach, be treated as hostile. Let's make it less scary. |
|
Tests which focus on impossible input are not really valid tests though - the way grammar works is it constraints the model from outputting incorrect JSON. If that's broken, then that needs to be fixed - but for that we need a real-life example, not tests. That's why the reproduction code needs to be a query to a live model, not a test case that might or might not correspond to a real scenario. |
Can you help me out here? I'm being dead honest with the following, not talking back:
Don't really know what to do or give. We seem kinda stuck at "the jpohhhh dude may be a vibecoding angry teenager, so we can't really trust when he says the test cases he gives are real output from a real model", which, fair enough man. I'm just an anonymous avatar on the internet. But then I'm kinda stuck. I can't force you to trust it, it's fair you don't, I'm an anonymous avatar on the internet, etc. etc. I can't force you to just consider the code in the abstract. It's fair that this it looks like the last line of defense against bad outputs, instead of the thing that'll cause crashes on malformed outputs. Something that might be fruitful is getting 1:1 with aldehir: the refactoring he did is spiritually aligned with "this needs to be is_lenient instead of is_partial", it's just, there was one call site left and it happened to be the public call site, and we've all been calling it with is_partial = false on the last inference. (also worth noting, thought I had when falling asleep last night: aldehir's changes would increase the frequency of this, more outputs are "okay" now in the is_partial == is_lenient == true case, then on that last call with is_partial = false, now they're not okay) I don't want to spam you with more random outputs from models that you can't trust anyway because I'm an an anonymous avatar wielding AI, etc. Actually....maybe what I was doing was right...just pulling the 2 issues I found with the exception log, and output, and adding tests. Then its sort of like you're getting repros from people who are verifiably not me. What do you think? |
|
@jpohhhh There was an error with max_token handling earlier. It was fixed quite a while ago (see @aldehir 's first comment about #20191 ). Since then, there should not be any errors related to it. There is another lingering error that we know of regarding tool calls in reasoning blocks - occasionally in some cases some thinking models will do a tool call within reasoning blocks and then the problem is that the grammar sampler kicks in, causing an I don't know what to tell you other than "if you did reproduce it before, but can't reproduce it now, then maybe it was just fixed in the meantime" - unless you can provide a reproduction scenario. But note that I've been using Qwen3.5 pretty extensively these past few days, ran a few entire benchmarks on it and have yet to see this problem occur. |
Narrow, shortest version, I'm afraid it'll still sound combative but on my side it's "being concise and clear" not "snarky", and I don't want to go to AI to whine I'm scared it'll sound snarky and reword it: I have repro'd. The exact repro is in the tests. The exact output from me repo'ing with a binary is in the tests. Additionally, I am testing at HEAD. I am editing at HEAD. We can set any other idea aside. I'm not spending 3 days root causing, fixing, then arguing with credentialed people telling me in public that, inter alia, I'm hallucinating and that my fix isn't on master because it's the exact same edit they made recently. Trust me, I would have given up by now. This really sucks. I am having an awful time. I have never, ever, had a figure in authority do "but it does work on my machine" when faced with a straightforward code edit with unit tests. In my experience, that's how you make authority figures upset. So I have 0 idea how to interact and it is very uncomfortable for me. Longer version: Same disclaimer as above re: I'm not trying to be combative, at all, just spell out the technical stuff so we can figure out where our disconnect. And, again, no AI. I just like Markdown :) Short version of longer version: my guess is the There's been so many claims re: whether this diff is even possible on master or if I'm lying via vibecoding/vibecommenting, that I think it'll be fruitful to re-lay this out as simple premises / assertions. Premises
Theorem
(n.b. I came back and hand-edited all the source links to have |
Just as a third party here, I have to say that most of these responses boil down to "there shouldn't be" and "It works for me". I have had, in other repos, defensive maintainers dismiss PRs and it does not feel good. As a developer, I'm also well aware not all PRs are valid ones. Is there another maintainer who can step in, run these tests, and decide or ask for more info? To show I'm not taking sides here:
I think at this point I think you're probably blowing off steam, but a few things does feel like a combative AI:
I get it, really. I've been there. Advice: I treat PRs and bug reports like I treat emails to my boss - stay short and to the point, and assume they won't read half of it. More text just means more opportunity for misunderstanding. At a certain point, be willing to move on. (Example: just recently I was told that a live feature does not need to be documented. #20384 This is very counter to my (long) experience as a developer.) |
Respectfully, using backticks and #'d lists isn't something worth passing judgement on. I'm a designer by nature, engineer by accident, I'm not making stuff unreadable by my standards because AI uses Markdown too.
You bet I'm invested. :) This is my PR, if no one responds when the maintainers have technical questions, it doesn't get in. Full stop.
This isn't a bug report, this is a PR. I'd totally understand if I was filing an issue. Generally, you're right about over-verbosity. Shouldn't do it. It's a fail-over strategy I use when the only other option is getting in the dirt with people. ex. "you are hallucinating issues and fixes" is concisely responded to with "you are lying", which won't help anything. Being ultra-very-clear about what's on your mind has the opposite effect. It signals "I don't think you're incompetent, I do think this is important, and I have enough knowledge that you can't file me under hallucinating"
Appreciate the advice, understand where it's coming from, doesn't apply here. The short version - unit tests + fix + concise explanation - got me a polite version of "this PR can't be on master because the edit was made already" and "this doesn't fix anything, but maybe I'll be amenable to it if you prove it does" and "you're hallucinating." On a PR with unit tests and two lines of changes. And note no one has given an inch on any of those. We're still stuck at "this can't happen and we can't talk about it until you provide proof it does". On a commit with unit tests, with a runtime repro, and a one-line command that shows they fail without the change and pass with the change. On the most serious type of bug: a method that is called on every inference, throws an exception, on unexceptional behavior. Less was never going to help, unless I just didn't care at all & gave up completely. |
Well, I did start off by asking for another opinion. And see the edit about docs at the bottom if you missed it, at a certain point there's just good reason to say: "Not my monkeys, not my circus." You can't always argue a PR into master. |
I'm not "arguing" into master. I'm responding to "this was fixed" "you are hallucinating" "I need a repro from runtime" with answers and links to code. There's a massive difference between "I don't want your docs" and "We introduced an exception during unexceptional behavior that is a massive regression and short circuits inference on our own tools, much less API callers. Btw ,you're hallucinating the issue, and the fix, and in fact are so incompetent that I don't trust the GitHub UI that says there's an actual change here, because I made the same change a week ago, there's no way the code you're mentioning is in the current codebase" You're right that the first isn't worth discussing past "no". Maybe this isn't worth it, either. Eventually, the server maintainer gets tired of getting bug reports and searches for the log text, finds the exception, sees peg, and puts 2 + 2 together, right? And what do I care what strangers on the internet say? Idk man. I just want the bug fixed, and I'm willing to answer what they ask. I feel for these dudes because you only react this way when you're burned out and it looks like they've had a two week marathon of fix forwards. |
|
All right, I'l bite:
This is false.
common_peg_parse_flags flags = COMMON_PEG_PARSE_FLAG_LENIENT
llama.cpp/common/peg-parser.cpp Line 352 in cf23ee2
This is exactly what #20191 was about, which we've tried to tell you with @aldehir now quite a few times. |
That's why we've both looked at it with aldehir and we've both reached the same conclusion. |
|
I'll adopt your framing. Premise 7 says Your claim is that // peg-parser.cpp, literal parser
if (pos >= ctx.input.size()) { // line 350: ran out of input
if (!ctx.is_lenient()) {
return RESULT_FAIL; // line 353 — end-of-input, guarded
}
return RESULT_NEED_MORE_INPUT; // line 355 — this is what LENIENT does
}
if (ctx.input[pos] != p.literal[i]) { // line 357 — character doesnt match
return RESULT_FAIL;
}Same function, six lines apart. The test input isn't truncated. It's malformed, an extra (I went through the whole file, On unpatched master: |
To be fair, that doesn't exclude their claim a maintainer that will run tests would be handy :) |
|
Closing in favor of a new #20708. spent a couple hours wading through the parser code and writing tests and debugging test infra. it's definitely built-in that the parser will fail sometimes. I think the |
The new chat parser throws
std::runtime_erroronresult.fail()during final parse. If the PEG engine returnsRESULT_FAILfor any reason, inference crashes. When repro'd with llama.cpp API, ex. by llama-server, this looks like a log withFailed to parse input at pos $X. With llama-server specifically, it looks like nofinish_reason, HTTP 500. Here's every report.The current approach to this is fixing forward: patch the parser to return
RESULT_FAILless often, one class of input at a time, with the implicit assumption that someday no input will ever reachRESULT_FAIL. That's not going to happen. There are 14RESULT_FAILreturn sites inpeg-parser.cppthat fire on byte mismatches regardless of LENIENT, and routine agentic workflows hit them: multi-tool-call responses, trailing bytes after</tool_call>, models tacking on text after a tool call.#20191 got the right idea: don't crash on incomplete parses, extract what you can. It made the parser report
NEED_MORE_INPUTinstead ofFAILat end-of-input, which falls through to AST extraction. But it only covers end-of-input.Fix 1: Extend #20191's precedent to mismatches. If the parser did useful work (
result.end > 0), return the AST instead of throwing, same as it already does forNEED_MORE_INPUT. Remove theis_partialguard. One line.Fix 2:
until(value_suffix)instead ofschema(json())for TAG_WITH_TAGGED arg capture. Strictjson()validation on already-generated output causes PEG backtracking that wipes the entire AST (reasoning, tool names, all of it) over a single malformed character. Grammar constraints still enforce the schema during generation.Tests crash on unpatched master, pass with fix: