Skip to content

feat(function): Implement theta sketch math functions#13844

Closed
nmahadevuni wants to merge 1 commit intofacebookincubator:mainfrom
nmahadevuni:theta_sketch_math_funcs
Closed

feat(function): Implement theta sketch math functions#13844
nmahadevuni wants to merge 1 commit intofacebookincubator:mainfrom
nmahadevuni:theta_sketch_math_funcs

Conversation

@nmahadevuni
Copy link
Collaborator

@nmahadevuni nmahadevuni commented Jun 23, 2025

Required for the new Iceberg statistics introduced in Presto Java. Ported from https://github.com/apache/datasketches-cpp/tree/master/theta.

Sketches are data structures that can approximately answer particular questions about a dataset when full accuracy is not required. The benefit of approximate answers is that they are often faster and more efficient to compute than functions which result in full accuracy.

Theta sketches enable distinct value counting on datasets and also provide the ability to perform set operations. For more information on Theta sketches, please see the Apache Datasketches Theta sketch documentation

The Presto PR which introduced these changes is prestodb/presto#20993. A brief intro to these functions

New Sketch Functions

Iceberg's Puffin spec defines the format that NDVs must be written in. Currently, the only available format is a binary
blob representing an Apache Datasketches Theta Sketch, so we implemented three basic functions which expose the sketch so that Iceberg can eventually consume it when writing statistics.

  1. sketch_theta(<column>) -> varbinary: An aggregation function which accepts a column and generates a binary representation of the org.apache.datasketches.theta.CompactSketch. Applications can easily consume this raw binary
    format to gain access to a CompactSketch instance and associated methods.

  2. sketch_theta_estimate(<varbinary sketch>) -> double: A scalar function which consumes a raw binary sketch and produces the estimate. This is effectively the same as calling CompactSketch::getEstimate. I've exposed this as a convenience for checking the sketch's output

  3. sketch_theta_summary(<varbinary sketch>) -> row(estimate double, theta double, upper_bound_std1 double, lower_bound_std1 double, retained_entries int): This is another scalar function, but returns a row type containing
    more human-readable information about the sketch such as the theta parameter as well as upper and lower bounds
    for 1 standard deviation from the estimate

@netlify
Copy link

netlify bot commented Jun 23, 2025

Deploy Preview for meta-velox canceled.

Name Link
🔨 Latest commit b8ba93b
🔍 Latest deploy log https://app.netlify.com/projects/meta-velox/deploys/685cdb337deca20008e97288

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 23, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_math_funcs branch from d134ed2 to 233237d Compare June 23, 2025 11:33
@nmahadevuni nmahadevuni changed the title Implement theta sketch math functions feat(function): Implement theta sketch math functions Jun 23, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_math_funcs branch from 233237d to 7610b92 Compare June 23, 2025 12:21
@nmahadevuni nmahadevuni changed the title feat(function): Implement theta sketch math functions [WIP]: feat(function): Implement theta sketch math functions Jun 23, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_math_funcs branch 2 times, most recently from 678c42d to 2ffde26 Compare June 23, 2025 12:39
@aditi-pandit
Copy link
Collaborator

@nmahadevuni : The theta code adapted from Data sketches is better fit under velox/external directory than velox/common. Can you try moving it there ?

@nmahadevuni nmahadevuni force-pushed the theta_sketch_math_funcs branch from 2ffde26 to 1117707 Compare June 23, 2025 19:37
@nmahadevuni nmahadevuni requested a review from aditi-pandit June 23, 2025 19:38
@nmahadevuni
Copy link
Collaborator Author

Thanks @aditi-pandit. I have moved the code to external directory.

@nmahadevuni nmahadevuni changed the title [WIP]: feat(function): Implement theta sketch math functions feat(function): Implement theta sketch math functions Jun 24, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_math_funcs branch from 1117707 to b8ba93b Compare June 26, 2025 05:31
Copy link
Contributor

@yingsu00 yingsu00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nmahadevuni Where is the function registration stuff?

@nmahadevuni
Copy link
Collaborator Author

@nmahadevuni Where is the function registration stuff?

I'm working on it in a separate PR. Adding it here would make it even bigger.

@PingLiuPing
Copy link
Collaborator

The sketch document link is wrong.
Besides the unit tests, how did you verify this works as expected when integrate with Prestissimo?

@majetideepak
Copy link
Collaborator

majetideepak commented Jul 14, 2025

@nmahadevuni Is this the Java PR that describes the sketches prestodb/presto#20993?
Can you update the PR description about the usage of these sketches? Does Presto optimizer currently use these sketches? How can we test these? Will Gluten also benefit from this?
It looks like we will need other functions from the library in the future as well. So does it make sense to add a dependency instead?

@majetideepak
Copy link
Collaborator

@nmahadevuni Can we use KLL sketches, which are already present in Velox?

@yingsu00
Copy link
Contributor

@nmahadevuni Can we use KLL sketches, which are already present in Velox?

Unfortunately Iceberg doesn't use KLL sketch. In Apache Iceberg’s Puffin statistics format, only Theta sketches (apache‑datasketches‑theta‑v1) and deletion vectors (deletion‑vector‑v1) are defined as built‑in blob types. KLL was discussed but not planned. see: Request to add KLL Datasketch and hive ColumnStatisticsObj and as standard blob types to puffin file.

@nmahadevuni
Copy link
Collaborator Author

nmahadevuni commented Jul 18, 2025

The sketch document link is wrong. Besides the unit tests, how did you verify this works as expected when integrate with Prestissimo?

@PingLiuPing Thank you. Sorry for the late reply. I have tested by updating my local presto-native-execution/velox and making sure the test output from Java functions match with Velox functions with a few known exceptions which were discussed and confirmed with datasketch project architect in apache/datasketches-cpp#460

Corrected the documentation link.

@nmahadevuni
Copy link
Collaborator Author

Once this is merged, I will write integration tests in presto-native-execution module

@nmahadevuni
Copy link
Collaborator Author

nmahadevuni commented Jul 18, 2025

@nmahadevuni Is this the Java PR that describes the sketches prestodb/presto#20993? Can you update the PR description about the usage of these sketches? Does Presto optimizer currently use these sketches? How can we test these? Will Gluten also benefit from this? It looks like we will need other functions from the library in the future as well. So does it make sense to add a dependency instead?

Yes, prestodb/presto#20993 is the Java PR that added support for these functions. The write and read support for these statistics was added in the same PR, so optimizer would see the new statistics if they were written.

We can add a dependency too. Do you want to switch to adding a dependency?

@nmahadevuni
Copy link
Collaborator Author

btw, KLL sketches are already implemented in Velox. velox/functions/lib/KllSketch.cpp

* limitations under the License.
*/

// Adapted from Apache DataSketches
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a proper section to the NOTICE.txt, you will have to add the contents of the Apache DataSketches NOTICE.txt there too, as is required by the ASL 2.0.

@majetideepak
Copy link
Collaborator

@nmahadevuni It is better to add a dependency since the code size is huge. Tim, @aditi-pandit, and I believe we can start by adding this to Presto C++ and then moving to Velox once we have more Iceberg functionality implemented here.

@nmahadevuni
Copy link
Collaborator Author

@nmahadevuni It is better to add a dependency since the code size is huge. Tim, @aditi-pandit, and I believe we can start by adding this to Presto C++ and then moving to Velox once we have more Iceberg functionality implemented here.

How does it work adding a dependency to Presto C++? We have to add aggregate and scalar functions in Velox which depends on this.

@majetideepak
Copy link
Collaborator

We have to add aggregate and scalar functions in Velox which depends on this.

Can't we register these functions in Presto C++?

@nmahadevuni
Copy link
Collaborator Author

We have to add aggregate and scalar functions in Velox which depends on this.

Can't we register these functions in Presto C++?

We should implement the aggregate function also in PrestoC++, I don't see any prior example of this.

@aditi-pandit
Copy link
Collaborator

aditi-pandit commented Jul 24, 2025

We have to add aggregate and scalar functions in Velox which depends on this.

Can't we register these functions in Presto C++?

We should implement the aggregate function also in PrestoC++, I don't see any prior example of this.

@nmahadevuni : Agree there aren't prior pure Presto aggregate functions. What is needed is for you to move your code :
i) External theta sketch library code to a folder in https://github.com/prestodb/presto/tree/master/presto-native-execution/presto_cpp/external
ii) Add function code in a new folder 'functions' in https://github.com/prestodb/presto/tree/master/presto-native-execution/presto_cpp/main
iii) Add function registration in https://github.com/prestodb/presto/blob/master/presto-native-execution/presto_cpp/main/PrestoServer.cpp#L1320
iv) Write e2e test SQL for the functions in some file in https://github.com/prestodb/presto/tree/master/presto-native-execution/presto_cpp/main

@majetideepak
Copy link
Collaborator

External theta sketch library code to a folder in

This needs to be a dependency instead.

@nmahadevuni
Copy link
Collaborator Author

Closing this for prestodb/presto#25685

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants