Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the coverage of the JSON Schema specification #342

Merged
merged 6 commits into from
Nov 12, 2023

Conversation

rlouf
Copy link
Member

@rlouf rlouf commented Nov 8, 2023

In this PR I simplify the codebase by using the referencing library to dereference the specification. We also check the validity of the schema before compiling it. We add coverage for the following:

  • Situations where minLength and maxLength are both defined;
  • The pattern keyword that allows to constrain the string to follow a regular expression;
  • oneOf, allOf, anyOf are now all properly covered;
  • type given an array value
  • array without a specified type
  • enums without a specified type

TODO

  • More testing
  • Handle optional fields (those not specified in "required")
  • minItems for arrays
  • maxItems for arrays

Note

We cannot handle minItems and maxItems because of interegular's lack of support for lookaheads:

 interegular.patterns.Unsupported: Group can not have lookbacks/lookaheads that go beyond the group bounds.

Related issues

#330 #215

We use the `referencing` library to dereference fields in the JSON
Schema, which simplifies the codebase a lot and prevents reference
errors. We also support combination of `minLength` and `maxLength`
as well as the `pattern` keyword.
@rlouf rlouf marked this pull request as ready for review November 9, 2023 15:33
@janvandervegt-db
Copy link

Regarding the minItems and maxItems lookaheads issue, would it be possible if both minItems and maxItems are the same value? As you could generate a regular expression that "hardcodes" the number of items?

@rlouf
Copy link
Member Author

rlouf commented Nov 10, 2023

Regarding the minItems and maxItems lookaheads issue, would it be possible if both minItems and maxItems are the same value? As you could generate a regular expression that "hardcodes" the number of items?

We could, would that be useful?

@janvandervegt-db
Copy link

For our use case that would be super helpful. We are using LLMs to synthesize a lot of structured data. Frequently a single input maps to a list of different outputs, the exact number is not super important but it needs to be at least a few. Doing them in the same call leads to higher diversity in the outputs and much simpler pipelines.

Two alternatives for us would be:

  • Manually provide hardcoded properties for each of the entries (item1, item2, ...)
  • Manipulate the JSON schema generated by Pydantic to dynamically do this before passing it to outlines and then include custom parsing logic

Both of these options are quite ugly compared to restricting the guided generation to a fixed number of items in the array.

@rlouf
Copy link
Member Author

rlouf commented Nov 10, 2023

Noted, I added the feature. Don't hesitate to open an issue if there is something missing.

@brandonwillard brandonwillard merged commit e73d7fd into dottxt-ai:main Nov 12, 2023
5 checks passed
@rlouf rlouf deleted the json-schema-support branch January 12, 2024 07:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants