server: workaround new chat parser regression by jpohhhh · Pull Request #20729 · ggml-org/llama.cpp

jpohhhh · 2026-03-18T19:55:05Z

The new chat parser (#18675, 566059a26) added a std::runtime_error (chat.cpp L1740) when parsing fails on the final call (is_partial=false). During streaming, parse failures fall back to partial results and everything works. On the final parse (server-task.h L381) of the exact same text, it throws instead.

Server users get HTTP 500 and no final SSE event, even though they already got all the output, and no further would be emitted if this didn't throw. API users calling common_chat_parse directly get a std::runtime_error that didn't exist before the new parser. Both cases: inference completes, output is generated, then discarded on the final parse. Same requests return 200 with finish_reason and timings on 34df42f7b (commit before #18675).

This PR catches the throw on the final parse and falls back to the last successful streaming result.

I use the server API and would prefer the throw on final parse be removed in common_chat_parse itself. (#20708) This is the key regression. But the parser maintainers haven't been amenable to that, so this PR works around it at the server level instead.

Repros with stock models. Does not repro on 34df42f7b (commit before #18675):

# Llama 3.2 3B, tools defined, model doesn't call them
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Write a haiku"}],"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],"temperature":1.5,"max_tokens":200}'
# {"error":{"code":500,"message":"Failed to parse input at pos 0: {\"name\": \"write_haiku\", \"parameters\": {\"city\": \"none\"}}","type":"server_error"}}

# Cohere Command-R7B, tools defined, any prompt
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],"temperature":0.7,"max_tokens":200}'
# {"error":{"code":500,"message":"Failed to parse input at pos 0: You got it backwards! The answer is 5...","type":"server_error"}}

# GPT-OSS, json_schema response format
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"name: Alice, age: 30"}],"response_format":{"type":"json_schema","json_schema":{"name":"person","schema":{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"}},"required":["name","age"]}}},"temperature":0.7,"max_tokens":500}'
# {"error":{"code":500,"message":"Failed to parse input at pos 487: ...Got it! Could you let me know what you'd like me to do with that information?...","type":"server_error"}}

Why these fail

The parser and grammar are both generated from the chat template. The grammar constrains generation, the parser extracts structure afterward. When they're aligned, everything works. Three ways they become misaligned:

Grammar is lazy with tool_choice=auto (chat-auto-parser-generator.cpp L64): grammar only activates after a trigger pattern. If the model never hits the trigger (decides not to call tools), output is unconstrained but the parser still expects tool call structure.
Grammar not generated for some request combos. GPT-OSS with json_schema builds a parser (chat.cpp L858) but skips grammar (chat.cpp L904). Parser validates, grammar doesn't constrain.
Model ignores template tags. Cohere expects <|START_RESPONSE|> wrapping. Model skips them*, parser fails at pos 0. *: not sure why, this is a very interesting case

10 of 48 bundled templates crash

--chat-template-file overrides the model's built-in template. Any small model can test all 48 bundled templates. The model generates output that doesn't match the foreign template, which is the kind of unexpected input the parser needs to handle without crashing.

Affected: Llama 3.2, Llama 3.3, Cohere Command-R, Cohere Command-R7B, GLM-4.7, Nemotron Nano v2, Functionary v3.2, Mistral Small 3.2, GPT-OSS, Solar.

for TPL in models/templates/*.jinja; do
  pkill -f llama-server; sleep 1
  ./build/bin/llama-server -m model.gguf --port 9999 --jinja --chat-template-file "$TPL" &
  sleep 8
  RESP=$(curl -s http://localhost:9999/v1/chat/completions -H "Content-Type: application/json" \
    -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"tools":[{"type":"function","function":{"name":"f","description":"f","parameters":{"type":"object","properties":{"x":{"type":"string"}},"required":["x"]}}}],"temperature":1.5,"max_tokens":100}')
  echo "$RESP" | grep -q "Failed to parse" && echo "CRASH: $(basename $TPL)" || echo "OK: $(basename $TPL)"
done

server: workaround new chat parser regression

841e09e

github-actions bot added examples server labels Mar 18, 2026

loci-dev mentioned this pull request Mar 19, 2026

UPSTREAM PR #20729: server: workaround new chat parser regression auroralabs-loci/llama.cpp#1272

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: workaround new chat parser regression#20729

server: workaround new chat parser regression#20729
jpohhhh wants to merge 1 commit intoggml-org:masterfrom
jpohhhh:workaround_parser_throw

jpohhhh commented Mar 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jpohhhh commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jpohhhh commented Mar 18, 2026 •

edited

Loading