Skip to content

json-schema-to-grammar: expand PCRE shorthands in pattern strings#23436

Open
iOptimizeThings wants to merge 3 commits into
ggml-org:masterfrom
iOptimizeThings:fix/gbnf-pcre-shorthands
Open

json-schema-to-grammar: expand PCRE shorthands in pattern strings#23436
iOptimizeThings wants to merge 3 commits into
ggml-org:masterfrom
iOptimizeThings:fix/gbnf-pcre-shorthands

Conversation

@iOptimizeThings
Copy link
Copy Markdown

@iOptimizeThings iOptimizeThings commented May 20, 2026

Overview

Tools that use PCRE shorthands like \d, \w, \s in their JSON schema pattern fields cause the server to fail with error parsing grammar: unknown escape at \d. The grammar constraint gets silently dropped and the model runs unconstrained.

This happens because the JSON schema to GBNF converter was copying those shorthands through verbatim, but the GBNF parser only knows standard escapes (\n, \t, etc.) not PCRE character class shorthands.

Additional information

The fix expands them to their GBNF equivalents at conversion time, both standalone and inside bracket expressions:

  • \d / \D → [0-9] / [^0-9]
  • \w / \W → [a-zA-Z0-9_] / [^a-zA-Z0-9_]
  • \s / \S → [ \t\n\r] / [^ \t\n\r]
  • \b / \B → skipped (word boundary has no GBNF equivalent)

This comes up a lot with MCP tools, TypeScript MCP SDK and OpenAPI-generated schemas use \d and \w routinely.

Fixes #22314

@deiteris verified the repro on my end, all 6 pattern variants pass now.

Requirements

@iOptimizeThings iOptimizeThings requested a review from a team as a code owner May 20, 2026 18:31
@ggml-gh-bot
Copy link
Copy Markdown

ggml-gh-bot Bot commented May 20, 2026

Hi @iOptimizeThings, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.

  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.


Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

@iOptimizeThings
Copy link
Copy Markdown
Author

Hi @iOptimizeThings, thanks for your contribution!

Per our contribution guidelines, the automated PR checker found the following issue(s) that need your attention:

  • Multiple open PRs from a new contributor: We limit new contributors (those without a previously merged PR) to 1 open PR at a time. You currently have 2 open PRs.
  • AI-generated content: This project does not accept PRs, descriptions or commit messages that are fully or predominantly AI-generated. If you have used AI to assist you in writing code, please make sure to disclose that explicitly.

Please note that maintainers reserve the right to make final decisions on PRs. If you believe there is a mistake, please comment below.

I did the code/testing manually, traced the bug through the error message, found the issue in _visit_pattern in json-schema-to-grammar.cpp, wrote the fixes, and tested it on my RTX 5070 with 8 cases (standalone and bracketed for \d, \w, \s, \b). AI was used only to help format the PR description text (disclosed).

regarding multiple PRs: my other open PR is #23381 which is currently under review. I wasn't aware of the one-PR limit for new contributors sorry!! happy to close this and reopen once that one merges if that's preferred, or leave it to your discretion.

@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented May 20, 2026

No worries, can you add some tests?

@iOptimizeThings iOptimizeThings force-pushed the fix/gbnf-pcre-shorthands branch from 7dc6302 to d3dd76d Compare May 20, 2026 22:34
@iOptimizeThings iOptimizeThings force-pushed the fix/gbnf-pcre-shorthands branch from bb07162 to 7241d8f Compare May 20, 2026 22:38
@iOptimizeThings
Copy link
Copy Markdown
Author

@aldehir added 14 C++ tests covering all 8 shorthands (\d \D \w \W \s \S \b \B), both standalone and bracket-class forms where applicable. All pass on a WSL build with -DBUILD_SHARED_LIBS=OFF -DLLAMA_BUILD_TESTS=ON

Placed them in the C++ only section, json_schema_to_grammar.py has the same gap and would fail these, but that's separate from what #22314 reports so I'll leave it for a follow-up.

if (sub_pattern[i] == '\\') {
square_brackets += sub_pattern.substr(i, 2);
i += 2;
char next = sub_pattern[i + 1];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a check for i + 1 < length.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, added i + 1 < length guard.

@aldehir
Copy link
Copy Markdown
Contributor

aldehir commented May 22, 2026

The naive approach of expanding the character class inline doesn't work for negated classes.

For example, [\d\W] would expand to:

[0-9^a-zA-Z0-9_]

which is incorrect. It should probably expand to something like:

([0-9] | [^a-zA-Z0-9_])

We should probably handle the general case of any combination of shorthand notations, not just this one. For example, [\s\S] is a common pattern for matching any character, since . doesn't match [\n\r] without DOTALL enabled.

@iOptimizeThings
Copy link
Copy Markdown
Author

Good catch. new commit handles these now, negated shorthands now go into a separate bucket and emit alternation instead of being inlined. Tests added and passing:

  - regexp [\d\W] mixed pos-and-neg class                                                                                                                                                                                 
  - regexp [\s\S] any-char class                                                                                                                                                                                          
  - regexp [a-z\D] literal-and-neg class                                                                                                                                                                                  
  - regexp [\d\w\D] multi-shorthand mixed class                                                                                                                                                                           

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: JSON Schema to GBNF grammar fails with tools that use PCRE shorthands

2 participants