Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add an optional extended parser subclass (YAMLAnchorReplayingFactory) able to inline anchors #502

Conversation

HeikoBoettger-KarlStorz
Copy link
Contributor

This pull request proposes a workaround for the lack of being able validate yaml files using anchors. The solution is to simply inline all anchors.

What's the matter with the upstream YAMLParser?

The YAMLParser currently produces the JsonEvents representing String and sets an additional status that a consumer needs to retrieve calling an additional method. However when trying to perform a YAMLSchema validation the current approach is to use the existing JSONSchema validation code which doesn't know about this extra method. As a result validating a yaml-file using aliases and anchors is failing to validate due to lack of anchor support.

The backlog contains already some issue talking about the need of a bigger refactoring to fully support anchors and aliases.

What's is the workaround?

This pull request adds a YAMLParserExt class (I didn't had a good idea how to name it) and a factory to make use of it. When the events are requested the implementation remembers the events produced by yaml content that has an anchor. Later when a alias is found and the anchor exists, it simply returns the same events that were part of the anchored content. As a result the schema validator will see the document as if the anchored content was inlined.

What's the risk pulling this in?

Repeating the events might have some unknown side-effects such as that a document will appear larger in terms of produced events than it actually is or code expecting the events to be unique might not work. I also haven't looked into whether there might be issue determining the position (line & column number) of whatever produced the event in the original document. However it works for use to validate our yaml-files and using a separate class to implement this option doesn't break any existing code.

Some thoughts:

  • I haven't checked whether there are solution using $ref from JSON but I can imaging the yaml aliases can be positioned more flexible.
  • It might be possible to solve this differently by using a smaller ParserImpl interface (a EventSource-interface).
  • The event based approach is not ideal when it comes to handling aliases since there is no DOM to just point to past data structures.
  • Parsing huge yaml files with big anchored content might require a lot of memory

In case you consider pulling this in, please let me know where I can find an example of a unit test, I am more than happy to provide one. The approach I have taken in our internal project was simply running the validation across some yaml documents containing anchors and checking the validation results.

@cowtowncoder
Copy link
Member

Ok, this is interesting but I need to digest it for a bit to know what I think. :)

As to unit tests, they are under yaml/src/test/java/....

@cowtowncoder
Copy link
Member

Ok, no big objections, but I think we should figure out better name -- "Ext" would be ok for general-purpose extension, but here it is quite specific. I don't have an immediate good name suggestion either, but let's think of something.

@HeikoBoettger-KarlStorz
Copy link
Contributor Author

Here are some ideas for a better name:

  • YAMLInlineAnchorFactory/Parser

  • YAMLInliningFactory/Parser

  • YAMLInlineFactory/Parser

  • YAMLExpandAliasesFactory/Parser

  • YAMLExpandingFactory/Parser

  • YAMLExpandReferencesFactory/Parser

  • YAMLExpandRefsFactory/Parser

  • YAMLFollowRefsFactory/Parser

I personally like the wording "expand"/"follow" more since the anchored nodes are not inlined in the sense that they are recreated as duplicated objects and changing any potential line number information. Using "Refs" instead of "Anchors"/"Aliases" allows for future changes in terms of supporting whatever will be the replacement for Merge Keys (only in YAML 1.1 which is deprecated) in future YAML versions.

@yawkat
Copy link
Member

yawkat commented Dec 2, 2024

I think this also needs some basic recursion depth / event list size limits, similar to the core decoders. While this feature is opt-in, any CVE reports will still have knock-on effects on other yaml users, and possibly other dataformats-text users.

@yawkat
Copy link
Member

yawkat commented Dec 2, 2024

How about YAMLAnchorReplayingParser

@HeikoBoettger-KarlStorz
Copy link
Contributor Author

@yawkat Do you use any coding style checker or specific IDE in the project that helps being compliant with the coding style?

@HeikoBoettger-KarlStorz
Copy link
Contributor Author

@yawkat I added all the suggested changes and renamed. Do you prefer semi-linear history? If so I will rebase the branch.

@HeikoBoettger-KarlStorz
Copy link
Contributor Author

@yawkat I added the upper limits as suggested but I wonder whether this should be better made configurable via some property set through on the factory.

@cowtowncoder cowtowncoder added the cla-needed PR looks good (although may also require code review), but CLA needed from submitter label Dec 7, 2024
@cowtowncoder
Copy link
Member

Looks like code doesn't quite compile (refactoring?).

@HeikoBoettger-KarlStorz HeikoBoettger-KarlStorz force-pushed the improved-yaml-anchors-support branch from d6686c9 to ffdafed Compare December 9, 2024 10:07
@HeikoBoettger-KarlStorz
Copy link
Contributor Author

@cowtowncoder Fixed it and added a unittest similar to the one existing for the YAMLParser

@cowtowncoder
Copy link
Member

Thank you @HeikoBoettger-KarlStorz. One last thing (aside from my needing to re-review this) is CLA. Unless we've asked for it earlier (it only needs to be sent once & it's good for all other contributions), it's here:

https://github.com/FasterXML/jackson/blob/master/contributor-agreement.pdf

and is usually easiest to do by printing, filling & signing, scanning/taking photo, emailing to cla at fasterxml dot com.

Once that is received I can proceed with merging, assuming review goes well.

@HeikoBoettger-KarlStorz
Copy link
Contributor Author

@cowtowncoder Since there were two contributor agreement file, I used the one for corporate and it was sent to clas at fasterxml dot com on in 03.10.2024 as written in https://github.com/FasterXML/jackson/blob/master/contributor-agreement-corporate.txt. Is that the wrong mail address? If I need to get the other pdf signed by somebody in the company, this will probably take some time. I can resent the document to the mail address you have mentioned.

@yawkat
Copy link
Member

yawkat commented Dec 10, 2024

The corporate CLA is correct. You do not need to get the individual contributor CLA signed by your company.

@cowtowncoder
Copy link
Member

@HeikoBoettger-KarlStorz as per @yawkat's comment sending either one is enough.

However, need to send cla NOT clas (seems to be common misunderstanding). Since I don't see the CLA could you try re-sending to [email protected]?
Thanks!

yawkat added a commit to yawkat/jackson that referenced this pull request Dec 11, 2024
cowtowncoder pushed a commit to FasterXML/jackson that referenced this pull request Dec 11, 2024
@cowtowncoder cowtowncoder removed the cla-needed PR looks good (although may also require code review), but CLA needed from submitter label Dec 12, 2024
@cowtowncoder
Copy link
Member

CLA received. Also, ref to [email protected] fixed -- but I also did create alias so both cla and clas should actually work.

Will need to review soon, bit overloaded but hoping to get back to this soon.

@cowtowncoder
Copy link
Member

This looks good to me, but since I am leaving for an extended weekend, will not merge yet (partly since merge to master/3.0 may take some time) -- but will do so early next week.

@cowtowncoder cowtowncoder changed the title Improved yaml parsing by adding an extended parser subclass able to inline anchors Add an optional extended parser subclass able to inline anchors Dec 21, 2024
@cowtowncoder cowtowncoder changed the title Add an optional extended parser subclass able to inline anchors Add an optional extended parser subclass (YAMLAnchorReplayingFactory) able to inline anchors Dec 21, 2024
@cowtowncoder cowtowncoder merged commit 95a4300 into FasterXML:2.19 Dec 21, 2024
4 checks passed
@cowtowncoder cowtowncoder added the 2.19 Fix or feature targeted at 2.19 release label Dec 21, 2024
@cowtowncoder cowtowncoder added this to the 2.19.0 milestone Dec 21, 2024
@cowtowncoder
Copy link
Member

Merged in 2.19 for 2.19.0.

NOTE: did NOT include in master/3.0 since merging was very difficult: partly due to Jackson side changes (which are sizable), but also since 3.0 uses snakeyaml-engine implementation instead of older snakeyaml.

Porting this over later on should not be much more work, just noting for now.

If anyone wants to I'd of course be happy to help with a PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.19 Fix or feature targeted at 2.19 release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants