json-schema-to-grammar: expand PCRE shorthands in pattern strings#23436
json-schema-to-grammar: expand PCRE shorthands in pattern strings#23436iOptimizeThings wants to merge 3 commits into
Conversation
|
Hi @iOptimizeThings, thanks for your contribution! Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:
Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below. |
I did the code/testing manually, traced the bug through the error message, found the issue in _visit_pattern in json-schema-to-grammar.cpp, wrote the fixes, and tested it on my RTX 5070 with 8 cases (standalone and bracketed for \d, \w, \s, \b). AI was used only to help format the PR description text (disclosed). regarding multiple PRs: my other open PR is #23381 which is currently under review. I wasn't aware of the one-PR limit for new contributors sorry!! happy to close this and reopen once that one merges if that's preferred, or leave it to your discretion. |
|
No worries, can you add some tests? |
7dc6302 to
d3dd76d
Compare
bb07162 to
7241d8f
Compare
|
@aldehir added 14 C++ tests covering all 8 shorthands (\d \D \w \W \s \S \b \B), both standalone and bracket-class forms where applicable. All pass on a WSL build with -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_TESTS=ON Placed them in the C++ only section, |
| if (sub_pattern[i] == '\\') { | ||
| square_brackets += sub_pattern.substr(i, 2); | ||
| i += 2; | ||
| char next = sub_pattern[i + 1]; |
There was a problem hiding this comment.
Needs a check for i + 1 < length.
There was a problem hiding this comment.
Fixed, added i + 1 < length guard.
|
The naive approach of expanding the character class inline doesn't work for negated classes. For example, which is incorrect. It should probably expand to something like: We should probably handle the general case of any combination of shorthand notations, not just this one. For example, |
|
Good catch. new commit handles these now, negated shorthands now go into a separate bucket and emit alternation instead of being inlined. Tests added and passing: |
Overview
Tools that use PCRE shorthands like \d, \w, \s in their JSON schema
patternfields cause the server to fail witherror parsing grammar: unknown escape at \d. The grammar constraint gets silently dropped and the model runs unconstrained.This happens because the JSON schema to GBNF converter was copying those shorthands through verbatim, but the GBNF parser only knows standard escapes (\n, \t, etc.) not PCRE character class shorthands.
Additional information
The fix expands them to their GBNF equivalents at conversion time, both standalone and inside bracket expressions:
This comes up a lot with MCP tools, TypeScript MCP SDK and OpenAPI-generated schemas use \d and \w routinely.
Fixes #22314
@deiteris verified the repro on my end, all 6 pattern variants pass now.
Requirements