Skip to content

chat: new parser should not crash inference#20708

Closed
jpohhhh wants to merge 1 commit intoggml-org:masterfrom
jpohhhh:master
Closed

chat: new parser should not crash inference#20708
jpohhhh wants to merge 1 commit intoggml-org:masterfrom
jpohhhh:master

Conversation

@jpohhhh
Copy link
Contributor

@jpohhhh jpohhhh commented Mar 18, 2026

The new chat parser throws std::runtime_error on result.fail() during final parse. If the PEG engine returns RESULT_FAIL for any reason, inference crashes. In llama-server this exhibits as no finish_reason, HTTP 500. Here's every report.

The current approach to this is fixing forward: patch the parser to return RESULT_FAIL less often, one class of input at a time, with the implicit assumption that someday no input will ever reach RESULT_FAIL. That's not going to happen. There are 14 RESULT_FAIL return sites in peg-parser.cpp that fire on byte mismatches regardless of LENIENT, and routine agentic workflows hit them: multi-tool-call responses, trailing bytes after </tool_call>, models tacking on text after a tool call. (example in unit tests, discovered at runtime via Qwen 3.5)

#20191 got the right idea: don't crash on incomplete parses, extract what you can. It made the parser report NEED_MORE_INPUT instead of FAIL at end-of-input, which falls through to AST extraction. But it only covers end-of-input. The throw is still there for everything else.

Fix 1: Extend #20191's precedent to mismatches. If the parser did useful work (result.end > 0), return the AST instead of throwing, same as it already does for NEED_MORE_INPUT. Remove the is_partial guard. One line.

Fix 2: until(value_suffix) instead of schema(json()) for TAG_WITH_TAGGED arg capture. Strict json() validation on already-generated output causes PEG backtracking that wipes the entire AST (reasoning, tool names, all of it) over a single malformed character. Grammar constraints still enforce the schema during generation. - grammar constraints won't be enforced, ty @aldehir

Tests crash on unpatched master, pass with fix:

cmake -B build_test -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build_test --target test-chat
./build_test/bin/test-chat
# std::runtime_error: Failed to parse input at pos 34: <tool_call>

@github-actions github-actions bot added the testing Everything test related label Mar 18, 2026
@bartowski1182
Copy link
Contributor

Do you have a way to reproduce this through like the server API? The tests provided are a bit confusing, I'm not sure I understand what you're trying to exercise

it looks like you're trying to show if a model gave malformed json it would crash? but doesn't that seem appropriate? How are you meant to parse JSON that's not valid JSON

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 18, 2026

Do you have a way to reproduce this through like the server API? The tests provided are a bit confusing, I'm not sure I understand what you're trying to exercise

it looks like you're trying to show if a model gave malformed json it would crash? but doesn't that seem appropriate? How are you meant to parse JSON that's not valid JSON

You nailed it: that's the core issue: how can you parse invalid JSON? When you can't, is throw and crash the server only on the final call to parse the right behavior? If so, was with a new parser a good time to introduce that? For building a perfect parser? Maybe. You'll get crash reports and can do workarounds for some cases. For API clients? No.

The tests are the repro from calling the server APIs. (tl;dr: qwen 3.5 0.8B, trailing char causes invalid JSON)

The chain:

  1. Server calls final parse with is_partial=false: server-task.h L381
  2. Parser throws std::runtime_error: chat.cpp L1740
  3. Exception caught as server_error, returns HTTP 500: server.cpp L47-57

Looking across the crash reports from this, the parser maintainers are valiantly in fix-forward one by one mode, making the parser handle more inputs correctly so RESULT_FAIL is returned less often.

That's work, and it can work one off, ex. they work around the most obvious case (truncation) by hand-patching it to return NEEDS_MORE_OUTPUT instead of FAIL if it would otherwise fail.

But it doesn't change the fact that the parser can fail and it's a regression to make that exceptional.

@bartowski1182
Copy link
Contributor

When you can't, is throw and crash the server only on the final call to parse the right behavior?

What would the alternative be, absorb the error and pretend it didn't happen? Resulting in who knows what kind of behavior? That sounds far more dangerous to me, unpredictable behavior, especially in tool calling, sounds like a bad idea.. crashing is the right call IMO

@aldehir
Copy link
Contributor

aldehir commented Mar 18, 2026

Hi, the old parser also crashed on invalid JSON.

Are you advocating to remove grammar constrain decoding (for JSON) from llama.cpp?

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 18, 2026

Hi, the old parser also crashed on invalid JSON.

Are you advocating to remove grammar constrain decoding (for JSON) from llama.cpp?

No, and no. (I'm not sure why you asked re: removing grammar constrained decoding, please feel free to follow up more with any code or commentary that indicates that, I do not want it to give other readers that impression, either)

I tested on 34df42f (commit before #18675). Same models, same requests, same temperatures. 200 with finish_reason every time. The throw at chat.cpp L1740 is new.

Not advocating removing grammar constraints. Advocating not throwing when the parser fails on the final call, same as the streaming path already handles it, and how the API functioned before the new parser.

Stock model repros in #20729 if helpful.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 18, 2026

When you can't, is throw and crash the server only on the final call to parse the right behavior?

What would the alternative be, absorb the error and pretend it didn't happen? Resulting in who knows what kind of behavior? That sounds far more dangerous to me, unpredictable behavior, especially in tool calling, sounds like a bad idea.. crashing is the right call IMO

The alternative is what the streaming path already does. During streaming, the exact same text is parsed with is_partial=true, the parse fails, and it falls back to partial AST extraction. Reasoning streams, tool names appear, everything works. Then the final parse runs on the exact same text with is_partial=false, and instead of the fallback, it throws. The data is already there from streaming. #20729 just catches the throw and uses it. #20728 (this one) catches the throw and won't emit further output.

Separately, I found repros on stock models. See #20729 for details, but the short version: Llama 3.2 3B with tools defined, model decides not to call them, HTTP 500. Cohere Command-R7B with tools, any prompt, HTTP 500. 10 of 48 bundled templates are affected. All regressed.

The thought experiment side of this does fascinate me: what should we do?

Put another way: there are 0 LLM APIs I am familiar with that, when inference is done, turn strict, rollback all output internally to a 0 response, then return a 500, and never complete the SSE stream.

Put another, another, way: This is a clear regression from before the new parser, and I'm not sure why the new parser is placing such a heavy burden on itself, i.e. expecting that it can never fail and if it does it should catastrophically take down everything with it.

Put another, another, another way: The behavior change here wouldn't be you suddenly get invalid JSON that's presented as valid. It's that the parser doesn't take down the whole stack suddenly and only on the final call, so you do get your finish_reason, and there's 0 deltas to it, which is what I think you'd expect in this scenario

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 18, 2026

Thank you @aldehir: it's best, for me, to just revert that line. PR is now just the one line for is_partial and the runtime exception regression.

@aldehir
Copy link
Contributor

aldehir commented Mar 19, 2026

See comments. Please adjust tests to use the existing peg_tester infrastructure. If we need to add additional methods, that's OK as long as they make sense.

@aldehir
Copy link
Contributor

aldehir commented Mar 19, 2026

@jpohhhh,

How do you recommend we reconcile with allowing bugs in the parser/grammar to go unnoticed due to being overly permissive.

For example, #20650 demonstrates a bug when there are no required arguments. However, if we extract partial content and return success as this PR does, this class of bug would likely go unnoticed. This feels like an exceptional case.

I'd appreciate your input on the matter.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 19, 2026

@jpohhhh,

How do you recommend we reconcile with allowing bugs in the parser/grammar to go unnoticed due to being overly permissive.

For example, #20650 demonstrates a bug when there are no required arguments. However, if we extract partial content and return success as this PR does, this class of bug would likely go unnoticed. This feels like an exceptional case.

I'd appreciate your input on the matter.

Ty for your help, apologies that the tests were so messy. I kept appending and experimenting and should have cleaned up for review.

This no args case is interesting ty for sharing, added a few things in this PR:

  • LOG_WRN when result.fail() && result.end > 0 with failure position and the unparsed tail. Avoids cases going silently unnoticed, if people have logs and are reporting odd behavior they've observed, we have an idea of where to look. (also lets us get data from arbitrary logs shared on other bugs, i.e. can search GitHub issues for the string to get test cases)
    - A Misc. bug: Getting 500 - Failed to parse input at pos x when tool calling #20650 test case using expect_partial_rollback for GLM-4.7-Flash with <tool_call>name</tool_call> (no arg tags). removed, and is in chat : handle tool calls with no required args in TAG_WITH_TAGGED format #20764 PR for supporting no args tools. Decided it didn't make sense in this PR since we do not expect this state to be long lasting
  • peg_tester.expect_partial_rollback() - lets us doc cases where grammar doesn't yet cover a particular output shape. If the partial rollback doesn't happen, test failure messages say this is probably a good thing and to update test to expect extracted content instead. - removed, see above

A no args tool is an interesting idea, but something I've never tried. I am not familiar with what inference APIs do in general in this case but it seems reasonable

Perhaps separate from your question: if you're thinking about the parser x no args causing issues in general, I'm going to take a look at the tool parsing code for the no-arg case after I write this comment. The parser is predictable and has high correctness, the grammar structure is clear, it might be a straightforward addition and I am excited to challenge myself a bit.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 19, 2026

no args parsing in #20764, so I took out the peg_tester.expect_partial_rollback() and #20650 test from this commit

@aldehir
Copy link
Contributor

aldehir commented Mar 19, 2026

Ah, I wasn't expecting you to resolve the issue. Just wondered how we could avoid issues like that going unnoticed. A warning seems appropriate.

@aldehir
Copy link
Contributor

aldehir commented Mar 19, 2026

Fixed some nits and added the template back in, it was failing on my end.

Since #20424 was merged in, the tests might fail. Can you rebase onto of master to see? I believe we just need to add <think> to the start of every test input.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 19, 2026

Ah, I wasn't expecting you to resolve the issue. Just wondered how we could avoid issues like that going unnoticed. A warning seems appropriate.

There's a lot of work to do AFAIK (10/49 builtins failing, see #20729), and I'm happy to help :) Once this gets in I'll start pitching in on those.

(didn't mean to request re-review, apologies. And I don't think I can undo it. But I'll have comments resolved in 10 min)

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 19, 2026

All set! (I shouldn't have said 10 minutes, that guaranteed it'd take longer :)

Copy link
Contributor

@aldehir aldehir left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I'll let @pwilkin look at the peg tester inclusions. Otherwise, LGTM. Thank you for the contribution, and for pointing out the broken templates. We'll tackle those next.

(And wait for CI)

@pwilkin
Copy link
Contributor

pwilkin commented Mar 19, 2026

I think you should adapt the tests for the changes in done in #20424 , seems some stuff is failing now.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 19, 2026

I'm not a fan of adding support for handling an impossible case. As of now, tool calling grammar enforcement will force an <eog> right after the terminating tool marker. Therefore, I don't know why we would support this case.

As a general design decision, I really do not like the idea of parser errors getting silently ignored and producing only warnings. From my perspective, this generates a cascade of the following:
-> there is a subtle bug in the parser, but since there is no actual bug displayed during execution, people never see it
-> instead, people become convinced that "model is broken" and/or "cannot handle X", where X is the scenario where this keeps happening
-> nobody reports a specific issue because there is no clear error message

I definitely prefer the current way of doing things where the parser returns a hard error. That way, whoever encounters the error knows that something is broken, files a bug report and we get it fixed. The way you're proposing is basically masking errors with temporary recovery that really doesn't do anything (since if we got a partial parse, the answer is probably broken anyway, so it's better to let the client know we've hit an error than pretend everything is going well).

@aldehir
Copy link
Contributor

aldehir commented Mar 19, 2026

As a general design decision, I really do not like the idea of parser errors getting silently ignored and producing only warnings. From my perspective, this generates a cascade of the following:
-> there is a subtle bug in the parser, but since there is no actual bug displayed during execution, people never see it
-> instead, people become convinced that "model is broken" and/or "cannot handle X", where X is the scenario where this keeps happening
-> nobody reports a specific issue because there is no clear error message

This is a compelling argument, and one I agree with. Example: #20532 recently surfaced after the change to the gpt-oss parser. The issue was a problem with how checkpoints were handled, and this was missed in the past because the old parser was really lenient: it would skip invalid sequences until it found a valid one. I know, because I wrote it. This most likely impacted workflows that forked off, but uncaught because it was simply let through. We did emit warnings when this happened, yet no issue--that I can see--was opened.

In light of this, I'm feeling less comfortable with this PR.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 19, 2026

Here's what actually happens from the client's perspective:

  1. Streaming begins. The server calls common_chat_parse with is_partial=true on every token.
  2. The client receives reasoning tokens, tool names, arguments — all successfully extracted and streamed as SSE deltas.
  3. Generation finishes. The server calls common_chat_parse on the exact same accumulated text, but with is_partial=false.
  4. This time it throws. The client gets HTTP 500. No finish_reason. The content it already received just... vanishes from the API's perspective.

Do we agree that's wrong?

@aldehir
Copy link
Contributor

aldehir commented Mar 19, 2026

Hate to say it, but suppressing the error isn't the right path forward here. Perhaps if we had some sort of telemetry, but that's a whole can of worms and undoubtedly unpopular. The fact is, an error thrown at this point is exceptional and indicative of a bug.

There is some churn after the autoparser implementation, but that's to be expected. The lack of stable releases makes this difficult to navigate, but we'll eventually stabilize. FWIW, @pwilkin had his branch open for at least a month and did plenty of advertising for testers. I don't see how this could have been managed differently.

I appreciate your contributions that spun off from this PR and the work you put into it, even if not accepted.

@pwilkin I think we need to expand the error to include the whole generation, this should help tracking the root cause of these issues.

@aldehir aldehir closed this Mar 19, 2026
@pwilkin
Copy link
Contributor

pwilkin commented Mar 19, 2026

@jpohhhh Yeah, as I said above, we have a reproduction scenario that might describe the case you're mentioning - it's the "tool calling within reasoning" scenario. Since the reasoning prefill fixes have been merged, I'll start working on a fix there.

I know it's frustrating when the transition happens, but the autoparser was a huge endeavor and there's bound to be bugs - I tested as much as I can, a lot of people contributed, we found some very interesting and non-trivial bugs, but in the end there's always going to be more bug reports when it goes to production. I think I've done like 10 PRs since the launch fixing various bugs, some were one liners, some like #20424 ended up as over 500 LoC modifications. That's just how it is. The whole idea was to have a stable system where bugs are reproducible and I think we're getting close to that.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 19, 2026

I want to be direct about what I'm seeing from the outside. I've been working on this for four days. I am not frustrated. I can hand-patch and move on. It's one line. I am trying to help because I've been in a fix-forward swamp myself. It feels exhilarating and honest work. But:

  • The decision here is to regress public API
  • The argument for the throw is "it surfaces bugs." The throw itself has a bug that no one caught.
  • The argument is "errors shouldn't be silently ignored." The codebase already silently ignores them for truncation.
  • The argument is "grammar makes crashes impossible." Six built-in templates crash.
  • The argument is unjustifiable given the actual outcome - the client gets all of this "failed parse", it's perfectly fine JSON.
  • We agree that the outcome is wrong.
  • The PR was closed as a "tool calling within reasoning" issue. None of the repros are that.
  • Until chat : handle tool calls with no required args in TAG_WITH_TAGGED format #20764 (the fix I did this AM), every Llama 3.x request with tools where the model didn't call one was a 500. That took you from 10/42 built-in templates failing to 6, currently. There's tons and tons of road left to get to the point the parser never fails, and...
  • ...the idea the call site means the parser failed, in a substantive way, is a red herring: a \n after a tool call does it, among innumerable other things. "Failed" doesn't mean "we screwed up and have invalid output" it means "we didn't consume literally every byte" (see "code section" below)
  • There are zero unit tests exercising the throw path. Zero.

This is not "the autoparser has some bugs, we'll fix-forward." This is the error handling strategy itself producing errors that it can't catch because it doesn't test its own failure modes. This is a team grinding and doing fantastic work but too close to it's own work to see the fundamental thing: this is a regression, at HEAD, introduced with a large PR, that demands the large PR immediately is stable enough that it's exceptional when there's an error. And even that's wrong, because fail() does not mean "the parser failed in a way we can fix." It can mean all sorts of things, the cleanest one being "the model added a \n after a tool call". Please reopen.

cc @ggerganov: only because the technical evidence is clear, the PR was closed on reasoning that contradicts the shipped code, and I don't think the people closest to this work have the distance to see it. They're doing incredible work and grinding through a massive transition. But the error handling policy, "the regression that causes hard crashes is good because they surface bugs", is itself untested, already inconsistent with LENIENT, and the throw it defends is currently broken. What's going on here needs a second pair of eyes, and the team needs permission from someone they respect to make things easier on themselves while relieving users.

Code section

// GLM-4.7-Flash, final parse (is_partial=false), tools enabled.

// Case A: input truncated mid-word (simulates max_tokens)
auto msg = common_chat_peg_parse(arena, "Thinking.\n</think><tool_call>get_wea",
                                  /*is_partial=*/false, pp);
assert(msg.reasoning_content == "Thinking.\n");
// No throw. No warning. Returns reasoning silently.

// Case B: complete valid tool call, one trailing \n
common_chat_peg_parse(arena, "...<tool_call>get_weather"
                              "<arg_key>city</arg_key><arg_value>Tokyo</arg_value>"
                              "</tool_call>\n",
                       /*is_partial=*/false, pp);
// Crashes.

Both are final parse. Both are "the parser couldn't consume all the input." Case A returns content silently. Case B crashes. This is master right now.

This PR was closed on the principle that parser errors should not be silently ignored on final parse. That is already not the policy. Case A proves it. LENIENT is hardcoded always-on, including on final parse. When the parser runs out of input, LENIENT silently returns whatever was extracted. No throw, no warning, no is_partial check. Every max_tokens truncation of structured output hits this path. It's correct. Nobody wants max_tokens to crash the server.

The only failures that crash are byte mismatches: when the input has a byte the parser doesn't expect, like a \n after </tool_call>. Same situation ("parser didn't consume everything"), different outcome. This PR makes them consistent.

"Grammar enforcement makes this impossible." Six bundled templates crash on stock requests right now: Cohere Command-R, Command-R7B, Nemotron Nano v2, Functionary v3.2, GPT-OSS, Solar.

"The throw surfaces bugs so users can file reports." The throw is broken. #20424 uses input.substr(result.end) where result.end is in effective_input space. After #20424, crashes with std::out_of_range: basic_string. No content, just a C++ exception. The thing that's better than a LOG_WRN because it produces actionable error messages does not produce an actionable error message. It doesn't produce a readable one. And there's still not one unit test for the throw path, which is how this shipped without anyone noticing.

@pwilkin
Copy link
Contributor

pwilkin commented Mar 20, 2026

@jpohhhh I'm really struggling to understand why you're doing things this way. Instead of actually starting with meaningful issue reports, you come in with a PR that purportedly fixes some issues that you cannot show reproduction for. Only after a ton of explanations you backtrack and yet still insist on going the "my PR is good because it fixes all those issues", yet again, you have failed to create a meaningful issue report for them.

You say stuff like:
"Six bundled templates crash on stock requests right now: Cohere Command-R, Command-R7B, Nemotron Nano v2, Functionary v3.2, GPT-OSS, Solar."

Great, that's entirely possible. Where's the issue report with the crashes? Why are you insisting that we mask bad behavior instead of providing examples so we can actually fix them?

And no, nobody is "regressing public API". The behavior of the server when handing with bad / malformed requests is exactly the opposite of public API and nobody should be relying on it, like, ever.

You're saying "the throw is broken" - no, one path in the throw has a bug that might make the error message less readable under some circumstances, that's all that is to it. Nice that you found it, nice that you provided a fix for it, less nice that you're spending time quarreling here instead of, let's say, verifying if your fix actually compiles.

Your trying to go over our head to @ggerganov is frankly disrespectful and off-putting. You are not the maintainer of this code and it's not you who has to deal with all the issue reports about it.

@jpohhhh
Copy link
Contributor Author

jpohhhh commented Mar 20, 2026

The throw path didn't exist before #18675. Before that commit, these requests returned 200 with finish_reason. Now, non-streaming gets 500, streaming gets content deltas then silence. No final event with finish_reason, no [DONE]. On text the parser already validated and relayed to the client during streaming. That is a regression.

The throw path's error message, the entire justification for the throw over a LOG_WRN, is std::out_of_range: basic_string. Every thinking model. From the day #20424 shipped. Zero tests on the throw path. Your commit broke it, nobody caught it, because the path was never tested. I wrote the test and the fix. You merged it.

You thumbs-upped my comment asking whether streaming delivering content and then the final parse throwing on the same text was wrong. You replied saying "Yeah,..." then the PR was closed without it being addressed or discussed.

On #18675, users posted repros. On #20660, you called my reports and fixes hallucinated. Two are now on master. You wrote that I "failed to create a meaningful issue report." #20729 has curl commands for every crashing template.

I hear the frustration in your response. I think you're carrying too much. You're fielding crash reports while insisting crash reports are the only valid way to find bugs, and then dismissing them when they come in. The decision to keep the throw creates the reports you don't have bandwidth to process. It's not sustainable and it's not necessary. The codebase already made the decision you're arguing against. LENIENT already silently returns content on final parse for truncated input (Case A / Case B, runnable on master). You're guaranteed not to emit invalid output because the reason is fail. The parser is succeeding at its job and claiming it is exceptionally failing. Every response parse gave was valid. No new invalid outputs are emitted as a result of this PR.

Every parser crash is a symptom of the same decision: that the throw path should exist. Every fix I file, every fix you file, every one of the 10 PRs you've done since launch, is fix-forwarding around a regression that could be resolved by not throwing. I am being conscripted into fixing forward on a project that has decided it can break userspace, and so is everyone who files a crash report.

I have been telling people for years that llama.cpp is stable and I know that because I rely on its API and it have never regressed userspace. Right now I can't. I don't think that's what you intend. That's also why I tagged ggerganov. The output was fine. You already sent the client all the output during stream. If it wasn't parsed properly, if it was invalid JSON, it would have fail'd before the regression site.

I'm right there with you on the PR volume, I think I'm at 6 in 48 hours. ILet me know if you have any particular help you need, otherwise, I'll keep at it. I understand your intent now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants