Skip to content

Support JSON_EXISTS, JSON_VALUE and JSON_QUERY functions#9831

Merged
kasiafi merged 9 commits intotrinodb:masterfrom
kasiafi:286jsonFeature
May 27, 2022
Merged

Support JSON_EXISTS, JSON_VALUE and JSON_QUERY functions#9831
kasiafi merged 9 commits intotrinodb:masterfrom
kasiafi:286jsonFeature

Conversation

@kasiafi
Copy link
Copy Markdown
Member

@kasiafi kasiafi commented Nov 2, 2021

No description provided.

@cla-bot cla-bot bot added the cla-signed label Nov 2, 2021
@kasiafi kasiafi force-pushed the 286jsonFeature branch 6 times, most recently from fe236ff to 70e48c9 Compare November 8, 2021 18:54
@kasiafi kasiafi force-pushed the 286jsonFeature branch 2 times, most recently from 0acac64 to e1f3ccd Compare November 12, 2021 17:19
@kasiafi kasiafi changed the title Support JSON_EXISTS, JSON_VALUE and JSON_QUERY functions in grammar and AST Support JSON_EXISTS, JSON_VALUE and JSON_QUERY functions in grammar, AST and Analyzer Nov 29, 2021
@kasiafi kasiafi force-pushed the 286jsonFeature branch 5 times, most recently from 5fa3108 to 1866670 Compare November 30, 2021 13:23
@kasiafi kasiafi force-pushed the 286jsonFeature branch 7 times, most recently from a53e24a to 5b355b5 Compare December 14, 2021 14:42
@kasiafi kasiafi force-pushed the 286jsonFeature branch 3 times, most recently from 4a4b801 to a17b750 Compare December 16, 2021 07:45
@kasiafi kasiafi force-pushed the 286jsonFeature branch 4 times, most recently from b1c768f to b5d5f26 Compare December 30, 2021 09:43
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, the rewriter should create the JsonExists node with the same location as the original one, and we should remove the location-less constructor. Same for the other ones.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd call this serialize

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this class for? The name is too vague and the method below is not very clear. I'm thinking of whether it should be renamed or placed elsewhere.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a javadoc, and renamed the class to ParameterUtil.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When would this happen?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When function is not registered, which shouldn't happen. Should I remove the try-catch?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call .build() here directly and declare the variable as List<Type>. No need to have an explicit variable to hold the intermediate.

Comment on lines 464 to 466
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove or add a TODO comment.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be in the "planner" package, but in the io.trino.json package.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems backward. The fixed part should be the path (i.e., the "program" or "IR"), the functions, etc. The variable part should be the json over which the path is being evaluated.

That will allow caching resolutions and other similar decisions and attach them to the corresponding IR nodes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the operator being cached? Ideally, it should be attached to the node (via a map of node->operator) to avoid having to re-resolve it every time.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operators are cached in CachingResolver.

@kasiafi kasiafi force-pushed the 286jsonFeature branch 2 times, most recently from b5341a7 to 04d22e4 Compare May 6, 2022 13:23
@kasiafi
Copy link
Copy Markdown
Member Author

kasiafi commented May 6, 2022

@martint I refactored operator caching according to our discussion. Please take a look at the last commit.

@kasiafi kasiafi force-pushed the 286jsonFeature branch 3 times, most recently from 32c0f72 to 4838a88 Compare May 19, 2022 13:09
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not obvious by looking at this why it's safe to skip re-initializing the evaluator if it's already set. Technically, and from the JsonPathInvocationContext's perspective, the path could be different. The fact that it's not, is a side effect of how the JSON constructs in SQL work (the path is constant) and how they are mapped to functions. But that knowledge is too far removed from this site.

I don't think this class is needed. Instead, the functions should initialize and hold on to the JsonPathEvaluator instance upon first invocation (i.e., wherever they are calling initializeIfNecessary)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that JsonPathInvocationContext is needed. It is passed through the instanceFactory and this way an instance of JsonPathInvocationContext persists through multiple rows processed by the JSON-funciton.

We can't skip it and just pass JsonPathEvaluator as the context object, because we can create the JsonPathEvaluator only after the function is invoked with certain arguments (including path). The context object already exists at that moment (it comes as one of the arguments to the function).

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. This is the state object for the (stateful) function. However, it could be changed to be just that -- a state object, and leave all logic to be done in the caller.

kasiafi added 8 commits May 26, 2022 16:48
Let `handleSubqueries` find subqueries nested under `Expression`
nodes, linked via children not being `Expression`.
`json_extract` and `json_extract_scalar` are old JSON-processing
functions which have very limited capabilities, and are not
compliant with the spec. However, they have simple lightweight
implementation, optimized for the use case.

`json_query` and `json_value` are new spec-compliant JSON-processing
functions, which support the complete specification for JSON path.
Their implementation is much more complicated, resulting both from
the scope of the supported feature, and the "streaming" semantics.
These benchmarks are to compare both implementations applied to the
common use-case.

Benchmark json_extract vs json_query:
Benchmark                                            (depth)  Mode  Cnt     Score    Error  Units
BenchmarkJsonFunctions.benchmarkJsonExtractFunction        1  avgt   15   813.030 ± 12.263  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractFunction        3  avgt   15   939.202 ± 94.621  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractFunction        6  avgt   15  1004.352 ± 12.945  ns/op
BenchmarkJsonFunctions.benchmarkJsonQueryFunction          1  avgt   15  1136.371 ± 17.502  ns/op
BenchmarkJsonFunctions.benchmarkJsonQueryFunction          3  avgt   15  1399.780 ± 27.967  ns/op
BenchmarkJsonFunctions.benchmarkJsonQueryFunction          6  avgt   15  1666.810 ± 29.572  ns/op

Benchmark json_extract_scalar vs json_value:
Benchmark                                                  (depth)  Mode  Cnt     Score    Error  Units
BenchmarkJsonFunctions.benchmarkJsonExtractScalarFunction        1  avgt   15   644.762 ±  9.195  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractScalarFunction        3  avgt   15   720.244 ± 13.288  ns/op
BenchmarkJsonFunctions.benchmarkJsonExtractScalarFunction       10  avgt   15   928.069 ± 17.623  ns/op
BenchmarkJsonFunctions.benchmarkJsonValueFunction                1  avgt   15   811.158 ± 14.694  ns/op
BenchmarkJsonFunctions.benchmarkJsonValueFunction                3  avgt   15  1042.795 ± 40.555  ns/op
BenchmarkJsonFunctions.benchmarkJsonValueFunction               10  avgt   15  1663.328 ± 27.320  ns/op
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not flatten the caches an use a composite key made of IrPathNode ref, left type and right type?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then one node could take too much space.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be ok, especially if there's a case where a single node sees many combinations of types.

It would also make the code simpler. If later we find that this is not sufficient, we can think of better caching and eviction strategies, or JIT-style codegen/execution

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache is now flattened.

@kasiafi kasiafi mentioned this pull request May 27, 2022
@findepi
Copy link
Copy Markdown
Member

findepi commented May 28, 2022

Nice!
it took a while, so the more happy it landed. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Add support for json_path wildcard

4 participants