Skip to content

Conversation

@duxiao1212
Copy link
Contributor

@duxiao1212 duxiao1212 commented Feb 24, 2025

Description

Canonicalize json_extract output by replacing the copy operation with writing the JSON to output stream with sorted keys

Motivation and Context

This relates to issue #24563.
Ensuring JSON output canonicalization guarantees the accuracy of JSON comparisons and aligns the behavior with json_parse

Impact

low impact

Test Plan

SET SESSION canonicalized_json_extract = true

  1. json_extract result should be canonicalized
image
  1. If the JSON result is invalid, a PrestoException will be thrown
image
  1. Benchmark test result
    Enabling canonicalized JSON extract (isCanonicalizedJsonExtract = true) causes a 11.7% performance drop compared to disabling it (isCanonicalizedJsonExtract = false), specifically in complex test cases involving nested JSON objects in the JSON path.
Pasted Graphic

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

Presto Changes

  • Add session property canonicalized_json_extract. :pr:24614

@duxiao1212 duxiao1212 requested a review from a team as a code owner February 24, 2025 03:21
@prestodb-ci prestodb-ci added the from:Meta PR from Meta label Feb 24, 2025
@shangm2
Copy link
Contributor

shangm2 commented Feb 24, 2025

LGTM. Maybe also add a verifier run?

@duxiao1212 duxiao1212 force-pushed the master branch 4 times, most recently from d7fc7ce to 01e3b68 Compare February 24, 2025 15:34
@rschlussel
Copy link
Contributor

Also let's gate this change behind a configuration/session property. You can look at usages of other properties in SqlFunctionProperties for examples

@duxiao1212 duxiao1212 changed the title fix Update json_extract to Produce Canonicalized Output [WIP] fix Update json_extract to Produce Canonicalized Output Mar 5, 2025
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

general approach looks good to me!

@duxiao1212 duxiao1212 changed the title [WIP] fix Update json_extract to Produce Canonicalized Output fix Update json_extract to Produce Canonicalized Output Mar 9, 2025
@duxiao1212 duxiao1212 force-pushed the master branch 3 times, most recently from 9038dd7 to a97427a Compare March 9, 2025 01:20
@duxiao1212 duxiao1212 requested a review from rschlussel March 9, 2025 07:15
Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run a verifier comparing this with Prestissimo and check if they now match?

private boolean warnOnPossibleNans;
private boolean legacyCharToVarcharCoercion;
private boolean legacyJsonCast = true;
private boolean canonicalizedJsonExtract = true;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rschlussel, shall we enable canonicalizedJsonExtract by default?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should because it brings parity with presto c++. do you have any performance numbers about the extra cost? You can create a benchmark and get numbers for both on and off (see https://github.com/prestodb/presto/blob/master/presto-main/src/test/java/com/facebook/presto/operator/scalar/BenchmarkJsonToArrayCast.java for an example) and if the regression isn't too significant we can enable by default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rschlussel, @amitkdutta, please check the benchmark presto-main/src/test/java/com/facebook/presto/operator/scalar/BenchmarkJsonExtract.java and the test result in the summary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given the 11% regression, let's disable by default and we can enable it ourselves in our configuration. Otherwise, this looks good to go.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @rschlussel, I've updated the default to false. Thanks for your thoughtful review! Can I get a stamp on this?

@duxiao1212 duxiao1212 force-pushed the master branch 4 times, most recently from 138e49a to a80d9af Compare March 13, 2025 00:58
@duxiao1212
Copy link
Contributor Author

Can you run a verifier comparing this with Prestissimo and check if they now match?

@duxiao1212 duxiao1212 closed this Mar 13, 2025
@duxiao1212 duxiao1212 reopened this Mar 13, 2025
@duxiao1212
Copy link
Contributor Author

Can you run a verifier comparing this with Prestissimo and check if they now match?
@bikramSingh91, @kgpai, can you please confirm whether the JSON_EXTRACT function is currently available and functional in prestissimo?

@duxiao1212 duxiao1212 merged commit 06c8e35 into prestodb:master Mar 14, 2025
92 checks passed
@facebook-github-bot
Copy link
Collaborator

@duxiao1212 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@kgpai
Copy link
Contributor

kgpai commented Mar 14, 2025

@duxiao1212 Yes json_extract is available in Prestissimo . Feel free to verify against it.

@prestodb-ci prestodb-ci mentioned this pull request Mar 28, 2025
30 tasks
duxiao1212 added a commit that referenced this pull request Apr 7, 2025
## Description
The original pull request
[#24614](#24614) incorrectly
compares canonicalizedJsonExtract and legacyJsonCast in the equals
function of an object. This issue can be seen in the code
[here](https://github.com/prestodb/presto/pull/24614/files#diff-e921c5d186f9d5daa836bc7330f52caf8c1b84d19cf42288d5a8a7c9a6d2a5d5R156).

As a result, whenever a SQL function requires caching, the cache is
never hit, leading to the creation of new SQL function objects
repeatedly. This behavior eventually causes an OOM error in the JVM
metaspace. and eventually this error led to UER SEV.

After the problematic comparison was updated and tested through shadow
cluster by @rschlussel , we are confident that the issue has been
resolved in this PR. Therefore, we plan to bring back the json
canonicalized extract


## Motivation and Context
Reintroduced json_extract to generate canonicalized output

## Impact
<!---Describe any public API or user-facing feature change or any
performance impact-->
low impact 

## Test Plan
<!---Please fill in how you tested your change-->
N/A

## Contributor checklist

- [x] Please make sure your submission complies with our [contributing
guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md),
in particular [code
style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style)
and [commit
standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards).
- [x] PR description addresses the issue accurately and concisely. If
the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax,
functions, or other functionality.
- [x] If release notes are required, they follow the [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines).
- [x] Adequate tests were added if applicable.
- [x] CI passed.

## Release Notes
Please follow [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines)
and fill in the release notes below.

```
== NO RELEASE NOTE ==
```
Leziak pushed a commit to Leziak/presto that referenced this pull request Apr 9, 2025
# This is the 1st commit message:

[native] Add a test group for async data cache e2e tests.

# This is the commit message prestodb#2:

[native] Advance velox.

# This is the commit message prestodb#3:

Fix error when describing a nonexistent table

# This is the commit message prestodb#4:

[native] test modification.

# This is the commit message prestodb#5:

Changes to enable ssl/tls in hms
Co-authored-by: Arin Mathew <[email protected]>

Changes to move ssl related properties to seperate class

# This is the commit message prestodb#6:

[native] Add tests for UUID type

# This is the commit message prestodb#7:

Reintroduced json_extract to generate canonicalized output (prestodb#24879)

## Description
The original pull request
[prestodb#24614](prestodb#24614) incorrectly
compares canonicalizedJsonExtract and legacyJsonCast in the equals
function of an object. This issue can be seen in the code
[here](https://github.com/prestodb/presto/pull/24614/files#diff-e921c5d186f9d5daa836bc7330f52caf8c1b84d19cf42288d5a8a7c9a6d2a5d5R156).

As a result, whenever a SQL function requires caching, the cache is
never hit, leading to the creation of new SQL function objects
repeatedly. This behavior eventually causes an OOM error in the JVM
metaspace. and eventually this error led to UER SEV.

After the problematic comparison was updated and tested through shadow
cluster by @rschlussel , we are confident that the issue has been
resolved in this PR. Therefore, we plan to bring back the json
canonicalized extract


## Motivation and Context
Reintroduced json_extract to generate canonicalized output

## Impact
<!---Describe any public API or user-facing feature change or any
performance impact-->
low impact 

## Test Plan
<!---Please fill in how you tested your change-->
N/A

## Contributor checklist

- [x] Please make sure your submission complies with our [contributing
guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md),
in particular [code
style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style)
and [commit
standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards).
- [x] PR description addresses the issue accurately and concisely. If
the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax,
functions, or other functionality.
- [x] If release notes are required, they follow the [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines).
- [x] Adequate tests were added if applicable.
- [x] CI passed.

## Release Notes
Please follow [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines)
and fill in the release notes below.

```
== NO RELEASE NOTE ==
```
# This is the commit message prestodb#8:

Reuse JoinType in IndexJoinNode

Reuse JoinType instead of creating IndexJoinNode's own. JoinType is
already part of Prestissimo protocol. Adding IndexJoinNode with another
JoinType would cause naming conflict.

# This is the commit message prestodb#9:

Add jaro-winkler implementation, documentation and tests

# This is the commit message prestodb#10:

Add documentation for Iceberg support in PrestoCPP

# This is the commit message prestodb#11:

Add memory pool debug regex

# This is the commit message prestodb#12:

doc on hive csv limitations

# This is the commit message prestodb#13:

Add support for S3 WebIdentity authentication

# This is the commit message prestodb#14:

use com.facebook.airlift:security in presto-hive-metastore

# This is the commit message prestodb#15:

[native] Advance Velox

# This is the commit message prestodb#16:

[native] Add protocol for index lookup join plan

# This is the commit message prestodb#17:

[native] Add sidecar and sidecar plugin documentation
pradeepvaka pushed a commit to pradeepvaka/presto that referenced this pull request Apr 11, 2025
pradeepvaka pushed a commit to pradeepvaka/presto that referenced this pull request Apr 11, 2025
…24879)

## Description
The original pull request
[prestodb#24614](prestodb#24614) incorrectly
compares canonicalizedJsonExtract and legacyJsonCast in the equals
function of an object. This issue can be seen in the code
[here](https://github.com/prestodb/presto/pull/24614/files#diff-e921c5d186f9d5daa836bc7330f52caf8c1b84d19cf42288d5a8a7c9a6d2a5d5R156).

As a result, whenever a SQL function requires caching, the cache is
never hit, leading to the creation of new SQL function objects
repeatedly. This behavior eventually causes an OOM error in the JVM
metaspace. and eventually this error led to UER SEV.

After the problematic comparison was updated and tested through shadow
cluster by @rschlussel , we are confident that the issue has been
resolved in this PR. Therefore, we plan to bring back the json
canonicalized extract


## Motivation and Context
Reintroduced json_extract to generate canonicalized output

## Impact
<!---Describe any public API or user-facing feature change or any
performance impact-->
low impact 

## Test Plan
<!---Please fill in how you tested your change-->
N/A

## Contributor checklist

- [x] Please make sure your submission complies with our [contributing
guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md),
in particular [code
style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style)
and [commit
standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards).
- [x] PR description addresses the issue accurately and concisely. If
the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax,
functions, or other functionality.
- [x] If release notes are required, they follow the [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines).
- [x] Adequate tests were added if applicable.
- [x] CI passed.

## Release Notes
Please follow [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines)
and fill in the release notes below.

```
== NO RELEASE NOTE ==
```
duxiao1212 added a commit that referenced this pull request Apr 30, 2025
## Description
Previously, in the PR #24614, we
accidentally compared different fields in the SqlFunctionProperties,
which caused an OOM error in the JVM metaspace and eventually led to a
SEV. To prevent this issue from occurring again, we need to ensure that
the SqlFunctionProperties equality function is implemented correctly.
Additionally, we should create a unit test to reproduce the issue and
prevent it from happening in the future.


## Motivation and Context
Prevent incorrect implementation of the equals method in
SqlFunctionProperties

## Impact
low impact

## Test Plan
N/A

## Contributor checklist

- [x] Please make sure your submission complies with our [contributing
guide](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md),
in particular [code
style](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#code-style)
and [commit
standards](https://github.com/prestodb/presto/blob/master/CONTRIBUTING.md#commit-standards).
- [x] PR description addresses the issue accurately and concisely. If
the change is non-trivial, a GitHub Issue is referenced.
- [x] Documented new properties (with its default value), SQL syntax,
functions, or other functionality.
- [x] If release notes are required, they follow the [release notes
guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines).
- [x] Adequate tests were added if applicable.
- [x] CI passed.

## Release Notes

```
== NO RELEASE NOTE ==
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:Meta PR from Meta

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants