Apache DataSketches plugin for Trino#6643
Apache DataSketches plugin for Trino#6643ShashwatArghode wants to merge 1 commit intotrinodb:masterfrom
Conversation
Praveen2112
left a comment
There was a problem hiding this comment.
Thanks for working on this. Can we have some tests for like validating the result when we hit it to Trino. And a docs describing the functions would also be helpful.
...ino-datasketches/src/main/java/io/trino/plugin/datasketches/state/SketchStateSerializer.java
Outdated
Show resolved
Hide resolved
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Estimate.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Since they are more details to capture, can we move it to a new file ? WDYT ?
There was a problem hiding this comment.
I think keeping the structure consistent with other state factories would make more sense. Ex. HyperLogLog, LongApproximateMostFrequent, NumericHistogram etc.
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Union.java
Outdated
Show resolved
Hide resolved
|
Please change the commit message to Trino |
b1d0670 to
6fec5ed
Compare
6fec5ed to
52fc9b9
Compare
636a2f5 to
f814770
Compare
|
Apologies for the delay in getting back to this. I have addressed all the comments and added the doc as well. Thanks for reviewing the PR, do let me know if it needs more changes. |
There was a problem hiding this comment.
I think this plugin is set of functions as Teradata functions and Machine learning functions. I would move to functions directory.
There was a problem hiding this comment.
I have made the suggested changes to the doc. Please let me know if the there is any issue with sphinx syntax or anything needs to be changed
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/SketchUDFPlugin.java
Outdated
Show resolved
Hide resolved
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Estimate.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Could you also add tests using Trino queries? You can refer to TestMLQueries or any other function tests.
There was a problem hiding this comment.
Will look into this as soon as I get some time.
There was a problem hiding this comment.
Looked into TestMLQueries and adding test cases using Trino queries, but, for that, we will either need to support the creation of datasketches in Presto or register Hive UDFs for test cases for inserting Sketch in the database in binary format. We can add this when we add support for the creation of Sketches in Presto.
There was a problem hiding this comment.
As we talked in Slack, I assume it's testable without such supports. We can prepare Sketches data in Java library and then cast to varbinary format at this time.
There was a problem hiding this comment.
BytesWritable from hadoop turned out to be the key to insert the sketches in binary format in the database. Added the test case.
plugin/trino-datasketches/src/main/java/io/trino/plugin/datasketches/theta/Union.java
Outdated
Show resolved
Hide resolved
f814770 to
8a09d22
Compare
127131b to
a638a5e
Compare
a638a5e to
b772682
Compare
|
@ebyhr, Do let me know if any more changes are needed to this. |
|
👋 @ShashwatArghode - this PR has become inactive. If you're still interested in working on it, please let us know, and we can try to get reviewers to help with that. We're working on closing out old and inactive PRs, so if you're too busy or this has too many merge conflicts to be worth picking back up, we'll be making another pass to close it out in a few weeks. |
|
Hi @bitsondatadev, |
| public interface SketchState | ||
| extends AccumulatorState | ||
| { | ||
| Slice getSketch(); |
There was a problem hiding this comment.
#14290 added a Theta sketch aggregation function: IcebergThetaSketchForStats
I forgot about this PR (and it is still waiting for "syntax review"), but anyway that function is for Iceberg plugin's internal use. Also this happens to be sketch building aggregation function that's not being added here.
Anyway, i was able to use UpdateSketch directly in the state class.
For sketch union aggregation, the state could probably hold directly onto an org.apache.datasketches.theta.Union object
| private UnionWithParams() {} | ||
|
|
||
| @InputFunction | ||
| public static void input(@AggregationState SketchState state, @SqlType(StandardTypes.VARBINARY) Slice inputValue, @SqlType(StandardTypes.INTEGER) Integer normEntries, @SqlType(StandardTypes.BIGINT) Long seed) |
| { | ||
| state.setNominalEntries(normEntries); | ||
| state.setSeed(seed); | ||
| state.setSketch(inputValue); |
There was a problem hiding this comment.
Can this overwrite the state/result after previous input function invocation?
| DataSketches functions | ||
| ---------------------- | ||
|
|
||
| .. function:: theta_sketch_union(sketches) -> sketch |
There was a problem hiding this comment.
there are two variants of theta_sketch_union, one having parameters. The parameters need to be documented.
There was a problem hiding this comment.
also, should the docs say this is an aggregation function?
There was a problem hiding this comment.
-> sketch
there is no "sketch" type. should this be "varbinary"?
|
I was just skimming. Not reviewing until the syntax is reviewed (#6643 (comment)) |
|
Hi @findepi , could I ask what open concerns are there on the syntax? I couldn't find relevant context from the comment above. |
|
We are interested in this as well and can provide any help required to take this forward. |
|
👋 @ShashwatArghode - this PR is inactive and doesn't seem to be under development. If you'd like to continue work on this at any point in the future, feel free to re-open. |
|
@sopel39 i'd suggest creating a new PR instead of reopening 3-4 year old one. |
|
And we need to wait for the new release https://github.com/apache/datasketches-java that will be JDK25 compatible. |
Plugin to query Apache Datasketches (https://datasketches.apache.org/docs/Theta/ThetaSketchFramework.html). This includes merge and estimate function for theta sketch.