chat: new parser should not crash inference#20708
chat: new parser should not crash inference#20708jpohhhh wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
Do you have a way to reproduce this through like the server API? The tests provided are a bit confusing, I'm not sure I understand what you're trying to exercise it looks like you're trying to show if a model gave malformed json it would crash? but doesn't that seem appropriate? How are you meant to parse JSON that's not valid JSON |
You nailed it: that's the core issue: how can you parse invalid JSON? When you can't, is throw and crash the server only on the final call to parse the right behavior? If so, was with a new parser a good time to introduce that? For building a perfect parser? Maybe. You'll get crash reports and can do workarounds for some cases. For API clients? No. The tests are the repro from calling the server APIs. (tl;dr: qwen 3.5 0.8B, trailing char causes invalid JSON) The chain:
Looking across the crash reports from this, the parser maintainers are valiantly in fix-forward one by one mode, making the parser handle more inputs correctly so RESULT_FAIL is returned less often. That's work, and it can work one off, ex. they work around the most obvious case (truncation) by hand-patching it to return NEEDS_MORE_OUTPUT instead of FAIL if it would otherwise fail. But it doesn't change the fact that the parser can fail and it's a regression to make that exceptional. |
What would the alternative be, absorb the error and pretend it didn't happen? Resulting in who knows what kind of behavior? That sounds far more dangerous to me, unpredictable behavior, especially in tool calling, sounds like a bad idea.. crashing is the right call IMO |
|
Hi, the old parser also crashed on invalid JSON. Are you advocating to remove grammar constrain decoding (for JSON) from llama.cpp? |
No, and no. (I'm not sure why you asked re: removing grammar constrained decoding, please feel free to follow up more with any code or commentary that indicates that, I do not want it to give other readers that impression, either) I tested on 34df42f (commit before #18675). Same models, same requests, same temperatures. 200 with finish_reason every time. The throw at chat.cpp L1740 is new. Not advocating removing grammar constraints. Advocating not throwing when the parser fails on the final call, same as the streaming path already handles it, and how the API functioned before the new parser. Stock model repros in #20729 if helpful. |
The alternative is what the streaming path already does. During streaming, the exact same text is parsed with Separately, I found repros on stock models. See #20729 for details, but the short version: Llama 3.2 3B with tools defined, model decides not to call them, HTTP 500. Cohere Command-R7B with tools, any prompt, HTTP 500. 10 of 48 bundled templates are affected. All regressed. The thought experiment side of this does fascinate me: what should we do? Put another way: there are 0 LLM APIs I am familiar with that, when inference is done, turn strict, rollback all output internally to a 0 response, then return a 500, and never complete the SSE stream. Put another, another, way: This is a clear regression from before the new parser, and I'm not sure why the new parser is placing such a heavy burden on itself, i.e. expecting that it can never fail and if it does it should catastrophically take down everything with it. Put another, another, another way: The behavior change here wouldn't be you suddenly get invalid JSON that's presented as valid. It's that the parser doesn't take down the whole stack suddenly and only on the final call, so you do get your finish_reason, and there's 0 deltas to it, which is what I think you'd expect in this scenario |
|
Thank you @aldehir: it's best, for me, to just revert that line. PR is now just the one line for |
|
See comments. Please adjust tests to use the existing |
|
How do you recommend we reconcile with allowing bugs in the parser/grammar to go unnoticed due to being overly permissive. For example, #20650 demonstrates a bug when there are no required arguments. However, if we extract partial content and return success as this PR does, this class of bug would likely go unnoticed. This feels like an exceptional case. I'd appreciate your input on the matter. |
Ty for your help, apologies that the tests were so messy. I kept appending and experimenting and should have cleaned up for review. This no args case is interesting ty for sharing, added a few things in this PR:
A no args tool is an interesting idea, but something I've never tried. I am not familiar with what inference APIs do in general in this case but it seems reasonable Perhaps separate from your question: if you're thinking about the parser x no args causing issues in general, I'm going to take a look at the tool parsing code for the no-arg case after I write this comment. The parser is predictable and has high correctness, the grammar structure is clear, it might be a straightforward addition and I am excited to challenge myself a bit. |
|
Ah, I wasn't expecting you to resolve the issue. Just wondered how we could avoid issues like that going unnoticed. A warning seems appropriate. |
|
Fixed some nits and added the template back in, it was failing on my end. Since #20424 was merged in, the tests might fail. Can you rebase onto of master to see? I believe we just need to add |
There's a lot of work to do AFAIK (10/49 builtins failing, see #20729), and I'm happy to help :) Once this gets in I'll start pitching in on those. (didn't mean to request re-review, apologies. And I don't think I can undo it. But I'll have comments resolved in 10 min) |
|
All set! (I shouldn't have said 10 minutes, that guaranteed it'd take longer :) |
|
I think you should adapt the tests for the changes in done in #20424 , seems some stuff is failing now. |
|
I'm not a fan of adding support for handling an impossible case. As of now, tool calling grammar enforcement will force an As a general design decision, I really do not like the idea of parser errors getting silently ignored and producing only warnings. From my perspective, this generates a cascade of the following: I definitely prefer the current way of doing things where the parser returns a hard error. That way, whoever encounters the error knows that something is broken, files a bug report and we get it fixed. The way you're proposing is basically masking errors with temporary recovery that really doesn't do anything (since if we got a partial parse, the answer is probably broken anyway, so it's better to let the client know we've hit an error than pretend everything is going well). |
This is a compelling argument, and one I agree with. Example: #20532 recently surfaced after the change to the gpt-oss parser. The issue was a problem with how checkpoints were handled, and this was missed in the past because the old parser was really lenient: it would skip invalid sequences until it found a valid one. I know, because I wrote it. This most likely impacted workflows that forked off, but uncaught because it was simply let through. We did emit warnings when this happened, yet no issue--that I can see--was opened. In light of this, I'm feeling less comfortable with this PR. |
|
Here's what actually happens from the client's perspective:
Do we agree that's wrong? |
|
Hate to say it, but suppressing the error isn't the right path forward here. Perhaps if we had some sort of telemetry, but that's a whole can of worms and undoubtedly unpopular. The fact is, an error thrown at this point is exceptional and indicative of a bug. There is some churn after the autoparser implementation, but that's to be expected. The lack of stable releases makes this difficult to navigate, but we'll eventually stabilize. FWIW, @pwilkin had his branch open for at least a month and did plenty of advertising for testers. I don't see how this could have been managed differently. I appreciate your contributions that spun off from this PR and the work you put into it, even if not accepted. @pwilkin I think we need to expand the error to include the whole generation, this should help tracking the root cause of these issues. |
|
@jpohhhh Yeah, as I said above, we have a reproduction scenario that might describe the case you're mentioning - it's the "tool calling within reasoning" scenario. Since the reasoning prefill fixes have been merged, I'll start working on a fix there. I know it's frustrating when the transition happens, but the autoparser was a huge endeavor and there's bound to be bugs - I tested as much as I can, a lot of people contributed, we found some very interesting and non-trivial bugs, but in the end there's always going to be more bug reports when it goes to production. I think I've done like 10 PRs since the launch fixing various bugs, some were one liners, some like #20424 ended up as over 500 LoC modifications. That's just how it is. The whole idea was to have a stable system where bugs are reproducible and I think we're getting close to that. |
|
I want to be direct about what I'm seeing from the outside. I've been working on this for four days. I am not frustrated. I can hand-patch and move on. It's one line. I am trying to help because I've been in a fix-forward swamp myself. It feels exhilarating and honest work. But:
This is not "the autoparser has some bugs, we'll fix-forward." This is the error handling strategy itself producing errors that it can't catch because it doesn't test its own failure modes. This is a team grinding and doing fantastic work but too close to it's own work to see the fundamental thing: this is a regression, at HEAD, introduced with a large PR, that demands the large PR immediately is stable enough that it's exceptional when there's an error. And even that's wrong, because cc @ggerganov: only because the technical evidence is clear, the PR was closed on reasoning that contradicts the shipped code, and I don't think the people closest to this work have the distance to see it. They're doing incredible work and grinding through a massive transition. But the error handling policy, "the regression that causes hard crashes is good because they surface bugs", is itself untested, already inconsistent with LENIENT, and the throw it defends is currently broken. What's going on here needs a second pair of eyes, and the team needs permission from someone they respect to make things easier on themselves while relieving users. Code section// GLM-4.7-Flash, final parse (is_partial=false), tools enabled.
// Case A: input truncated mid-word (simulates max_tokens)
auto msg = common_chat_peg_parse(arena, "Thinking.\n</think><tool_call>get_wea",
/*is_partial=*/false, pp);
assert(msg.reasoning_content == "Thinking.\n");
// No throw. No warning. Returns reasoning silently.
// Case B: complete valid tool call, one trailing \n
common_chat_peg_parse(arena, "...<tool_call>get_weather"
"<arg_key>city</arg_key><arg_value>Tokyo</arg_value>"
"</tool_call>\n",
/*is_partial=*/false, pp);
// Crashes.Both are final parse. Both are "the parser couldn't consume all the input." Case A returns content silently. Case B crashes. This is master right now. This PR was closed on the principle that parser errors should not be silently ignored on final parse. That is already not the policy. Case A proves it. LENIENT is hardcoded always-on, including on final parse. When the parser runs out of input, LENIENT silently returns whatever was extracted. No throw, no warning, no The only failures that crash are byte mismatches: when the input has a byte the parser doesn't expect, like a "Grammar enforcement makes this impossible." Six bundled templates crash on stock requests right now: Cohere Command-R, Command-R7B, Nemotron Nano v2, Functionary v3.2, GPT-OSS, Solar. "The throw surfaces bugs so users can file reports." The throw is broken. #20424 uses |
|
@jpohhhh I'm really struggling to understand why you're doing things this way. Instead of actually starting with meaningful issue reports, you come in with a PR that purportedly fixes some issues that you cannot show reproduction for. Only after a ton of explanations you backtrack and yet still insist on going the "my PR is good because it fixes all those issues", yet again, you have failed to create a meaningful issue report for them. You say stuff like: Great, that's entirely possible. Where's the issue report with the crashes? Why are you insisting that we mask bad behavior instead of providing examples so we can actually fix them? And no, nobody is "regressing public API". The behavior of the server when handing with bad / malformed requests is exactly the opposite of public API and nobody should be relying on it, like, ever. You're saying "the throw is broken" - no, one path in the throw has a bug that might make the error message less readable under some circumstances, that's all that is to it. Nice that you found it, nice that you provided a fix for it, less nice that you're spending time quarreling here instead of, let's say, verifying if your fix actually compiles. Your trying to go over our head to @ggerganov is frankly disrespectful and off-putting. You are not the maintainer of this code and it's not you who has to deal with all the issue reports about it. |
|
The throw path didn't exist before #18675. Before that commit, these requests returned 200 with finish_reason. Now, non-streaming gets 500, streaming gets content deltas then silence. No final event with finish_reason, no [DONE]. On text the parser already validated and relayed to the client during streaming. That is a regression. The throw path's error message, the entire justification for the throw over a LOG_WRN, is You thumbs-upped my comment asking whether streaming delivering content and then the final parse throwing on the same text was wrong. You replied saying "Yeah,..." then the PR was closed without it being addressed or discussed. On #18675, users posted repros. On #20660, you called my reports and fixes hallucinated. Two are now on master. You wrote that I "failed to create a meaningful issue report." #20729 has curl commands for every crashing template. I hear the frustration in your response. I think you're carrying too much. You're fielding crash reports while insisting crash reports are the only valid way to find bugs, and then dismissing them when they come in. The decision to keep the throw creates the reports you don't have bandwidth to process. It's not sustainable and it's not necessary. The codebase already made the decision you're arguing against. LENIENT already silently returns content on final parse for truncated input (Case A / Case B, runnable on master). You're guaranteed not to emit invalid output because the reason is fail. The parser is succeeding at its job and claiming it is exceptionally failing. Every response parse gave was valid. No new invalid outputs are emitted as a result of this PR. Every parser crash is a symptom of the same decision: that the throw path should exist. Every fix I file, every fix you file, every one of the 10 PRs you've done since launch, is fix-forwarding around a regression that could be resolved by not throwing. I am being conscripted into fixing forward on a project that has decided it can break userspace, and so is everyone who files a crash report. I have been telling people for years that llama.cpp is stable and I know that because I rely on its API and it have never regressed userspace. Right now I can't. I don't think that's what you intend. That's also why I tagged ggerganov. The output was fine. You already sent the client all the output during stream. If it wasn't parsed properly, if it was invalid JSON, it would have fail'd before the regression site. I'm right there with you on the PR volume, I think I'm at 6 in 48 hours. ILet me know if you have any particular help you need, otherwise, I'll keep at it. I understand your intent now. |
The new chat parser throws
std::runtime_erroronresult.fail()during final parse. If the PEG engine returnsRESULT_FAILfor any reason, inference crashes. In llama-server this exhibits as nofinish_reason, HTTP 500. Here's every report.The current approach to this is fixing forward: patch the parser to return
RESULT_FAILless often, one class of input at a time, with the implicit assumption that someday no input will ever reachRESULT_FAIL. That's not going to happen. There are 14RESULT_FAILreturn sites inpeg-parser.cppthat fire on byte mismatches regardless of LENIENT, and routine agentic workflows hit them: multi-tool-call responses, trailing bytes after</tool_call>, models tacking on text after a tool call. (example in unit tests, discovered at runtime via Qwen 3.5)#20191 got the right idea: don't crash on incomplete parses, extract what you can. It made the parser report
NEED_MORE_INPUTinstead ofFAILat end-of-input, which falls through to AST extraction. But it only covers end-of-input. The throw is still there for everything else.Fix 1: Extend #20191's precedent to mismatches. If the parser did useful work (
result.end > 0), return the AST instead of throwing, same as it already does forNEED_MORE_INPUT. Remove theis_partialguard. One line.Fix 2:- grammar constraints won't be enforced, ty @aldehiruntil(value_suffix)instead ofschema(json())for TAG_WITH_TAGGED arg capture. Strictjson()validation on already-generated output causes PEG backtracking that wipes the entire AST (reasoning, tool names, all of it) over a single malformed character. Grammar constraints still enforce the schema during generation.Tests crash on unpatched master, pass with fix: