Skip to content

UPSTREAM PR #20729: server: workaround new chat parser regression#1272

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20729-workaround_parser_throw
Open

UPSTREAM PR #20729: server: workaround new chat parser regression#1272
loci-dev wants to merge 1 commit intomainfrom
loci/pr-20729-workaround_parser_throw

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#20729

The new chat parser (#18675, 566059a26) added a std::runtime_error (chat.cpp L1740) when parsing fails on the final call (is_partial=false). During streaming, parse failures fall back to partial results and everything works. On the final parse (server-task.h L381) of the exact same text, it throws instead.

Server users get HTTP 500 and no final SSE event, even though they already got all the output, and no further would be emitted if this didn't throw. API users calling common_chat_parse directly get a std::runtime_error that didn't exist before the new parser. Both cases: inference completes, output is generated, then discarded on the final parse. Same requests return 200 with finish_reason and timings on 34df42f7b (commit before #18675).

This PR catches the throw on the final parse and falls back to the last successful streaming result.

I use the server API and would prefer the throw on final parse be removed in common_chat_parse itself. (#20708) This is the key regression. But the parser maintainers haven't been amenable to that, so this PR works around it at the server level instead.

Repros with stock models. Does not repro on 34df42f7b (commit before #18675):

# Llama 3.2 3B, tools defined, model doesn't call them
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Write a haiku"}],"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],"temperature":1.5,"max_tokens":200}'
# {"error":{"code":500,"message":"Failed to parse input at pos 0: {\"name\": \"write_haiku\", \"parameters\": {\"city\": \"none\"}}","type":"server_error"}}

# Cohere Command-R7B, tools defined, any prompt
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"tools":[{"type":"function","function":{"name":"get_weather","description":"Get weather","parameters":{"type":"object","properties":{"city":{"type":"string"}},"required":["city"]}}}],"temperature":0.7,"max_tokens":200}'
# {"error":{"code":500,"message":"Failed to parse input at pos 0: You got it backwards! The answer is 5...","type":"server_error"}}

# GPT-OSS, json_schema response format
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"name: Alice, age: 30"}],"response_format":{"type":"json_schema","json_schema":{"name":"person","schema":{"type":"object","properties":{"name":{"type":"string"},"age":{"type":"integer"}},"required":["name","age"]}}},"temperature":0.7,"max_tokens":500}'
# {"error":{"code":500,"message":"Failed to parse input at pos 487: ...Got it! Could you let me know what you'd like me to do with that information?...","type":"server_error"}}
Why these fail

The parser and grammar are both generated from the chat template. The grammar constrains generation, the parser extracts structure afterward. When they're aligned, everything works. Three ways they become misaligned:

  1. Grammar is lazy with tool_choice=auto (chat-auto-parser-generator.cpp L64): grammar only activates after a trigger pattern. If the model never hits the trigger (decides not to call tools), output is unconstrained but the parser still expects tool call structure.

  2. Grammar not generated for some request combos. GPT-OSS with json_schema builds a parser (chat.cpp L858) but skips grammar (chat.cpp L904). Parser validates, grammar doesn't constrain.

  3. Model ignores template tags. Cohere expects <|START_RESPONSE|> wrapping. Model skips them*, parser fails at pos 0. *: not sure why, this is a very interesting case

10 of 48 bundled templates crash

--chat-template-file overrides the model's built-in template. Any small model can test all 48 bundled templates. The model generates output that doesn't match the foreign template, which is the kind of unexpected input the parser needs to handle without crashing.

Affected: Llama 3.2, Llama 3.3, Cohere Command-R, Cohere Command-R7B, GLM-4.7, Nemotron Nano v2, Functionary v3.2, Mistral Small 3.2, GPT-OSS, Solar.

for TPL in models/templates/*.jinja; do
  pkill -f llama-server; sleep 1
  ./build/bin/llama-server -m model.gguf --port 9999 --jinja --chat-template-file "$TPL" &
  sleep 8
  RESP=$(curl -s http://localhost:9999/v1/chat/completions -H "Content-Type: application/json" \
    -d '{"messages":[{"role":"user","content":"What is 2+2?"}],"tools":[{"type":"function","function":{"name":"f","description":"f","parameters":{"type":"object","properties":{"x":{"type":"string"}},"required":["x"]}}}],"temperature":1.5,"max_tokens":100}')
  echo "$RESP" | grep -q "Failed to parse" && echo "CRASH: $(basename $TPL)" || echo "OK: $(basename $TPL)"
done

@loci-review
Copy link

loci-review bot commented Mar 19, 2026

No meaningful performance changes were detected across 120773 analyzed functions in the following binaries: build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.llama-bench, build.bin.libmtmd.so, build.bin.libllama.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.libggml.so, build.bin.llama-quantize, build.bin.llama-qwen2vl-cli, build.bin.llama-tokenize, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli.

🔎 Full breakdown: Loci Inspector
💬 Questions? Tag @loci-dev

@loci-dev loci-dev force-pushed the main branch 5 times, most recently from e6c519b to 59f2b25 Compare March 23, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants