feat(native): Implement Sketch Theta aggregate and scalar functions#25685
feat(native): Implement Sketch Theta aggregate and scalar functions#25685nmahadevuni wants to merge 1 commit intoprestodb:masterfrom
Conversation
bf2c33e to
2bc57ba
Compare
c600fd9 to
c46ffb4
Compare
|
Do these new functions need documentation? Perhaps in https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/functions/sketch.rst. |
c46ffb4 to
3601af1
Compare
3601af1 to
4636f1d
Compare
These functions are already documented on this page. This PR implements the same functions for Prestissimo. |
9d6c62c to
6109e9b
Compare
|
@aditi-pandit @czentgr Can you please review this? |
| github.event_name == 'schedule' || needs.changes.outputs.codechange == 'true' | ||
| run: ccache -sz | ||
|
|
||
| - name: Build required adapter dependencies |
There was a problem hiding this comment.
Does the code change in this file expected?
There was a problem hiding this comment.
Yes. It installs the required dependency Apache DataSketches.
NOTICES
Outdated
|
|
||
| Prior to moving to ASF, the software for this project was developed at | ||
| Yahoo Inc. (https://developer.yahoo.com). | ||
| ------- No newline at end of file |
There was a problem hiding this comment.
Please add new line here.
| endif() | ||
| endif() | ||
|
|
||
| if(PRESTO_ENABLE_TESTING) |
There was a problem hiding this comment.
| if(PRESTO_ENABLE_TESTING) | |
| if(PRESTO_ENABLE_TESTING AND PRESTO_ENABLE_DATASKETCHES) | |
| add_subdirectory(functions/aggregates/tests) | |
| endif() |
There was a problem hiding this comment.
Thanks for this code @nmahadevuni.
i) Can you add some documentation for these functions as well ?
ii) Did you check if these functions are reported by side-car ?
iii) Can you also add e2e tests in https://github.com/prestodb/presto/tree/master/presto-native-tests/src/test/java/com/facebook/presto/nativetests for comparing native vs java results ? Add some of the iceberg ANALYZE tests as well ?
| } | ||
|
|
||
| function install_datasketches { | ||
| # grpc |
There was a problem hiding this comment.
What is this comment for ?
| #pragma once | ||
|
|
||
| #include "DataSketches/theta_sketch.hpp" | ||
| #include "velox/functions/Macros.h" |
There was a problem hiding this comment.
Leave a blank line before velox includes.
| const arg_type<velox::Varbinary>& in) { | ||
| auto compactSketch = | ||
| datasketches::wrapped_compact_theta_sketch::wrap(in.data(), in.size()); | ||
| double estimate = compactSketch.get_estimate(); |
There was a problem hiding this comment.
Do we need all these local variables ? Can these be initialized within std::make_tuple for the result ?
|
|
||
| #include "DataSketches/theta_sketch.hpp" | ||
| #include "DataSketches/theta_union.hpp" | ||
| #include "velox/exec/Aggregate.h" |
There was a problem hiding this comment.
Add blank line before velox includes.
| #include "velox/exec/SimpleAggregateAdapter.h" | ||
| #include "velox/functions/prestosql/aggregates/AggregateNames.h" | ||
|
|
||
| using namespace facebook::velox; |
There was a problem hiding this comment.
using namespace isn't allowed in header files.
There was a problem hiding this comment.
I feel it might be better to have a single folder under functions called theta_sketch and add both the scalar and aggregate functions in it. The current setup is a bit odd.
| bool addInput( | ||
| HashStringAllocator* /*allocator*/, | ||
| exec::optional_arg_type<T> data) { | ||
| if (!data.has_value()) |
There was a problem hiding this comment.
I feel
if (data.has_value()) {
...
}
return true;
is more readable.
Can you follow that style ?
| exec::optional_arg_type<Varbinary> other) { | ||
| if (!other.has_value()) | ||
| return true; | ||
| thetaUnion.update(updateSketch); |
There was a problem hiding this comment.
Seems like all the functions are calling updateSketch.reset() at the end of the function, then what is the point of calling thetaUnion.update(updateSketch) at the beginning of these functions ?
There was a problem hiding this comment.
updateSketch will be updated in addInput and thetaUnion needs to be updated with these entries and reset updateSketch to avoid duplicate entries. In Java implementation of DataSketches, only ThetaUnion class is sufficient because it also has an UpdateSketch data member, but in C++ implementation, it doesn't and so we need these two separately maintained.
| bool writeIntermediateResult( | ||
| bool nonNullGroup, | ||
| exec::out_type<Varbinary>& out) { | ||
| thetaUnion.update(updateSketch); |
There was a problem hiding this comment.
Can you abstract a function for this code, it seems to be repeated in several write functions ?
| {getExpectedResult<T>(values)}, VARBINARY())}); | ||
|
|
||
| testAggregations( | ||
| {vectors}, {}, {"pressto.default.sketch_theta(c0)"}, {expected}); |
6109e9b to
fc9fdae
Compare
|
@aditi-pandit @PingLiuPing Thanks for the review. I have addressed the comments. Please review. |
09e9bdd to
0ffa161
Compare
steveburnett
left a comment
There was a problem hiding this comment.
Thanks for the doc, looks great! One minor rephrasing suggested, let me know if my suggestion changes your intended meaning in a way that you disagree with.
| ================ | ||
|
|
||
| Sketches are data structures that can approximately answer particular questions | ||
| about a dataset when full accuracy is not required. The benefit of approximate |
There was a problem hiding this comment.
| about a dataset when full accuracy is not required. The benefit of approximate | |
| about a dataset when full accuracy is not required. Approximate |
|
|
||
| Sketches are data structures that can approximately answer particular questions | ||
| about a dataset when full accuracy is not required. The benefit of approximate | ||
| answers is that they are often faster and more efficient to compute than |
There was a problem hiding this comment.
| answers is that they are often faster and more efficient to compute than | |
| answers are often faster and more efficient to compute than |
Suggestion for conciseness and readability.
0ffa161 to
a21d7a8
Compare
a21d7a8 to
8c74f5f
Compare
8c74f5f to
768041f
Compare
768041f to
fce440c
Compare
aditi-pandit
left a comment
There was a problem hiding this comment.
Thanks @nmahadevuni. Have bunch of comments.
Also, since we are mainly writing this function to support ANALYZE for iceberg and the logic in https://github.com/prestodb/presto/blob/master/presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java, then we should run some Iceberg e2e tests for it. wdyt ?
| #include "presto_cpp/main/common/Utils.h" | ||
| #include "presto_cpp/main/connectors/Registration.h" | ||
| #include "presto_cpp/main/connectors/SystemConnector.h" | ||
| #ifdef PRESTO_ENABLE_DATASKETCHES |
There was a problem hiding this comment.
I feel these can be included by default as Java also has them in presto-main-base
https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/operator/scalar/ThetaSketchFunctions.java
| argTypes.size(), 1, "{} takes at most one argument", name); | ||
| auto inputType = argTypes[0]; | ||
| if (velox::exec::isRawInput(step)) { | ||
| switch (inputType->kind()) { |
There was a problem hiding this comment.
The formating of the lines seems off... PTAL.
| std::vector<std::shared_ptr<velox::exec::AggregateFunctionSignature>> signatures; | ||
|
|
||
| for (const auto& inputType : | ||
| {"smallint", "integer", "bigint", "real", "double", "varchar"}) { |
There was a problem hiding this comment.
You don't have DATE/TIME and DECIMAL variants in theta_sketch. From this code it seems like the TableStatisticsMaker expects these types to be supported https://github.com/prestodb/presto/blob/master/presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java#L673.
| } | ||
|
|
||
| bool writeFinalResult(bool nonNullGroup, velox::exec::out_type<velox::Varbinary>& out) { | ||
| updateUnion(); |
There was a problem hiding this comment.
This code seems exactly as writeIntermediateResult. Can you abstract a common function for it ?
There was a problem hiding this comment.
You can simply call this file TestThetaSketchFunctions.java
| MaterializedResult result = nativeQueryRunner.execute(session, "SELECT " + functionNamespace + "sketch_theta_estimate(CAST(NULL as VARBINARY))"); | ||
|
|
||
| assertTrue(result.getOnlyValue() == null); | ||
| functionAssertions.assertFunction("sketch_theta_estimate(CAST(NULL as VARBINARY))", DOUBLE, null); |
There was a problem hiding this comment.
This is not typically how we test function. The general method is to have a SQL statement that is run on both java and native QueryRunner and validate the results match.
Check https://github.com/prestodb/presto/blob/master/presto-native-tests/src/test/java/com/facebook/presto/nativetests/TestOrderByQueries.java
fce440c to
e2c21d2
Compare
e2c21d2 to
7481560
Compare
| } | ||
|
|
||
| function install_datasketches { | ||
| github_checkout apache/datasketches-cpp 5.2.0 --depth 1 |
There was a problem hiding this comment.
Don't use git clone. Use wget_and_untar instead for https://github.com/apache/datasketches-cpp/archive/refs/tags/5.2.0.tar.gz.
| Prior to moving to ASF, the software for this project was developed at | ||
| Yahoo Inc. (https://developer.yahoo.com). | ||
| ------- | ||
| ------- |
|
@nmahadevuni, when you have time, please take a look at my suggestions for the doc and let me know what you think. |
…heta functions (#26831) ## Description This change adds a Apache DataSketches CPP package to the setup scripts ## Motivation and Context This package is required to implement Theta sketch aggregate and scalar functions required for Iceberg statistics. The functions will be implemented in a different PR #25685 . ## Impact No impact ## Test Plan <!---Please fill in how you tested your change--> ``` == NO RELEASE NOTE == ```
…heta functions (prestodb#26831) ## Description This change adds a Apache DataSketches CPP package to the setup scripts ## Motivation and Context This package is required to implement Theta sketch aggregate and scalar functions required for Iceberg statistics. The functions will be implemented in a different PR prestodb#25685 . ## Impact No impact ## Test Plan <!---Please fill in how you tested your change--> ``` == NO RELEASE NOTE == ```
presto-native-tests/pom.xml
Outdated
| <excludedGroups>remote-function</excludedGroups> | ||
| <systemPropertyVariables> | ||
| <PRESTO_SERVER>/root/project/build/debug/presto_cpp/main/presto_server</PRESTO_SERVER> | ||
| <PRESTO_SERVER>/Users/nmahadevuni/mywork/code/opensource/presto-fork/presto-native-execution/_build/debug/presto_cpp/main/presto_server</PRESTO_SERVER> |
There was a problem hiding this comment.
This looks like a change to make the tests work locally. Revert this or CI/CD could break
70c711d to
34baa74
Compare
|
There are compilation failures, probably because dependency image did not have this: #26831 |
34baa74 to
bad191b
Compare
401857b to
e8e9038
Compare
e8e9038 to
efa121d
Compare
Description
Implements Sketch Theta aggregate and scalar functions required for the new Iceberg statistics introduced in Presto Java.
Motivation and Context
Sketches are data structures that can approximately answer particular questions about a dataset when full accuracy is not required. The benefit of approximate answers is that they are often faster and more efficient to compute than functions which result in full accuracy.
Theta sketches enable distinct value counting on datasets and also provide the ability to perform set operations. For more information on Theta sketches, please see the Apache Datasketches Theta sketch documentation
The Presto PR which introduced these changes is #20993. A brief intro to these functions
New Sketch Functions
Iceberg's Puffin spec defines the format that NDVs must be written in. Currently, the only available format is a binary
blob representing an Apache Datasketches Theta Sketch, so we implemented three basic functions which expose the sketch so that Iceberg can eventually consume it when writing statistics.
sketch_theta(<column>) -> varbinary:An aggregation function which accepts a column and generates a binary representation of the org.apache.datasketches.theta.CompactSketch. Applications can easily consume this raw binaryformat to gain access to a CompactSketch instance and associated methods.
sketch_theta_estimate(<varbinary sketch>) -> double:A scalar function which consumes a raw binary sketch and produces the estimate. This is effectively the same as calling CompactSketch::getEstimate. I've exposed this as a convenience for checking the sketch's outputsketch_theta_summary(<varbinary sketch>) -> row(estimate double, theta double, upper_bound_std1 double, lower_bound_std1 double, retained_entries int):This is another scalar function, but returns a row type containingmore human-readable information about the sketch such as the theta parameter as well as upper and lower bounds
for 1 standard deviation from the estimate
Impact
No impact
Test Plan
Added tests in ThetaSketchAggregationTest.cpp