feat(function): Implement theta sketch math functions#13844
feat(function): Implement theta sketch math functions#13844nmahadevuni wants to merge 1 commit intofacebookincubator:mainfrom
Conversation
✅ Deploy Preview for meta-velox canceled.
|
d134ed2 to
233237d
Compare
233237d to
7610b92
Compare
678c42d to
2ffde26
Compare
|
@nmahadevuni : The theta code adapted from Data sketches is better fit under velox/external directory than velox/common. Can you try moving it there ? |
2ffde26 to
1117707
Compare
|
Thanks @aditi-pandit. I have moved the code to external directory. |
1117707 to
b8ba93b
Compare
yingsu00
left a comment
There was a problem hiding this comment.
@nmahadevuni Where is the function registration stuff?
I'm working on it in a separate PR. Adding it here would make it even bigger. |
|
The sketch document link is wrong. |
|
@nmahadevuni Is this the Java PR that describes the sketches prestodb/presto#20993? |
|
@nmahadevuni Can we use KLL sketches, which are already present in Velox? |
Unfortunately Iceberg doesn't use KLL sketch. In Apache Iceberg’s Puffin statistics format, only Theta sketches (apache‑datasketches‑theta‑v1) and deletion vectors (deletion‑vector‑v1) are defined as built‑in blob types. KLL was discussed but not planned. see: Request to add KLL Datasketch and hive ColumnStatisticsObj and as standard blob types to puffin file. |
@PingLiuPing Thank you. Sorry for the late reply. I have tested by updating my local Corrected the documentation link. |
|
Once this is merged, I will write integration tests in presto-native-execution module |
Yes, prestodb/presto#20993 is the Java PR that added support for these functions. The write and read support for these statistics was added in the same PR, so optimizer would see the new statistics if they were written. We can add a dependency too. Do you want to switch to adding a dependency? |
|
btw, KLL sketches are already implemented in Velox. |
| * limitations under the License. | ||
| */ | ||
|
|
||
| // Adapted from Apache DataSketches |
There was a problem hiding this comment.
Please add a proper section to the NOTICE.txt, you will have to add the contents of the Apache DataSketches NOTICE.txt there too, as is required by the ASL 2.0.
|
@nmahadevuni It is better to add a dependency since the code size is huge. Tim, @aditi-pandit, and I believe we can start by adding this to Presto C++ and then moving to Velox once we have more Iceberg functionality implemented here. |
How does it work adding a dependency to Presto C++? We have to add aggregate and scalar functions in Velox which depends on this. |
Can't we register these functions in Presto C++? |
We should implement the aggregate function also in PrestoC++, I don't see any prior example of this. |
@nmahadevuni : Agree there aren't prior pure Presto aggregate functions. What is needed is for you to move your code : |
This needs to be a dependency instead. |
|
Closing this for prestodb/presto#25685 |
Required for the new Iceberg statistics introduced in Presto Java. Ported from https://github.com/apache/datasketches-cpp/tree/master/theta.
Sketches are data structures that can approximately answer particular questions about a dataset when full accuracy is not required. The benefit of approximate answers is that they are often faster and more efficient to compute than functions which result in full accuracy.
Theta sketches enable distinct value counting on datasets and also provide the ability to perform set operations. For more information on Theta sketches, please see the Apache Datasketches Theta sketch documentation
The Presto PR which introduced these changes is prestodb/presto#20993. A brief intro to these functions
New Sketch Functions
Iceberg's Puffin spec defines the format that NDVs must be written in. Currently, the only available format is a binary
blob representing an Apache Datasketches Theta Sketch, so we implemented three basic functions which expose the sketch so that Iceberg can eventually consume it when writing statistics.
sketch_theta(<column>) -> varbinary:An aggregation function which accepts a column and generates a binary representation of the org.apache.datasketches.theta.CompactSketch. Applications can easily consume this raw binaryformat to gain access to a CompactSketch instance and associated methods.
sketch_theta_estimate(<varbinary sketch>) -> double: A scalar function which consumes a raw binary sketch and produces the estimate. This is effectively the same as calling CompactSketch::getEstimate. I've exposed this as a convenience for checking the sketch's outputsketch_theta_summary(<varbinary sketch>) -> row(estimate double, theta double, upper_bound_std1 double, lower_bound_std1 double, retained_entries int): This is another scalar function, but returns a row type containingmore human-readable information about the sketch such as the theta parameter as well as upper and lower bounds
for 1 standard deviation from the estimate