feat(native): Implement Sketch Theta aggregate and scalar functions by nmahadevuni · Pull Request #25685 · prestodb/presto

nmahadevuni · 2025-08-05T06:22:28Z

Description

Implements Sketch Theta aggregate and scalar functions required for the new Iceberg statistics introduced in Presto Java.

Motivation and Context

Sketches are data structures that can approximately answer particular questions about a dataset when full accuracy is not required. The benefit of approximate answers is that they are often faster and more efficient to compute than functions which result in full accuracy.

Theta sketches enable distinct value counting on datasets and also provide the ability to perform set operations. For more information on Theta sketches, please see the Apache Datasketches Theta sketch documentation

The Presto PR which introduced these changes is #20993. A brief intro to these functions

New Sketch Functions

Iceberg's Puffin spec defines the format that NDVs must be written in. Currently, the only available format is a binary
blob representing an Apache Datasketches Theta Sketch, so we implemented three basic functions which expose the sketch so that Iceberg can eventually consume it when writing statistics.

sketch_theta(<column>) -> varbinary: An aggregation function which accepts a column and generates a binary representation of the org.apache.datasketches.theta.CompactSketch. Applications can easily consume this raw binary
format to gain access to a CompactSketch instance and associated methods.

sketch_theta_estimate(<varbinary sketch>) -> double: A scalar function which consumes a raw binary sketch and produces the estimate. This is effectively the same as calling CompactSketch::getEstimate. I've exposed this as a convenience for checking the sketch's output

sketch_theta_summary(<varbinary sketch>) -> row(estimate double, theta double, upper_bound_std1 double, lower_bound_std1 double, retained_entries int): This is another scalar function, but returns a row type containing
more human-readable information about the sketch such as the theta parameter as well as upper and lower bounds
for 1 standard deviation from the estimate

Impact

No impact

Test Plan

Added tests in ThetaSketchAggregationTest.cpp

== NO RELEASE NOTE ==

steveburnett · 2025-08-05T13:47:03Z

Do these new functions need documentation? Perhaps in https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/functions/sketch.rst.

nmahadevuni · 2025-08-06T09:37:19Z

Do these new functions need documentation? Perhaps in https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/functions/sketch.rst.

These functions are already documented on this page. This PR implements the same functions for Prestissimo.

nmahadevuni · 2025-08-07T11:22:36Z

@aditi-pandit @czentgr Can you please review this?

PingLiuPing · 2025-08-07T12:42:02Z

.github/workflows/prestocpp-linux-build-and-unit-test.yml

          github.event_name == 'schedule' || needs.changes.outputs.codechange == 'true'
        run: ccache -sz

+      - name: Build required adapter dependencies


Does the code change in this file expected?

Yes. It installs the required dependency Apache DataSketches.

PingLiuPing · 2025-08-07T12:43:38Z

NOTICES

+
+Prior to moving to ASF, the software for this project was developed at
+Yahoo Inc. (https://developer.yahoo.com).
+-------


Please add new line here.

PingLiuPing · 2025-08-07T12:45:52Z

presto-native-execution/presto_cpp/main/CMakeLists.txt

  endif()
 endif()
+
+if(PRESTO_ENABLE_TESTING)


Suggested change

if(PRESTO_ENABLE_TESTING)

if(PRESTO_ENABLE_TESTING AND PRESTO_ENABLE_DATASKETCHES)

add_subdirectory(functions/aggregates/tests)

endif()

aditi-pandit

Thanks for this code @nmahadevuni.

i) Can you add some documentation for these functions as well ?
ii) Did you check if these functions are reported by side-car ?
iii) Can you also add e2e tests in https://github.com/prestodb/presto/tree/master/presto-native-tests/src/test/java/com/facebook/presto/nativetests for comparing native vs java results ? Add some of the iceberg ANALYZE tests as well ?

@pramodsatya

aditi-pandit · 2025-08-09T19:01:22Z

presto-native-execution/scripts/setup-adapters.sh

 }

+function install_datasketches {
+  # grpc


What is this comment for ?

Unrelated. Removed

aditi-pandit · 2025-08-09T19:03:53Z

presto-native-execution/presto_cpp/main/functions/ThetaSketchFunctions.h

+#pragma once
+
+#include "DataSketches/theta_sketch.hpp"
+#include "velox/functions/Macros.h"


Leave a blank line before velox includes.

aditi-pandit · 2025-08-09T19:05:32Z

presto-native-execution/presto_cpp/main/functions/ThetaSketchFunctions.h

+      const arg_type<velox::Varbinary>& in) {
+    auto compactSketch =
+        datasketches::wrapped_compact_theta_sketch::wrap(in.data(), in.size());
+    double estimate = compactSketch.get_estimate();


Do we need all these local variables ? Can these be initialized within std::make_tuple for the result ?

aditi-pandit · 2025-08-09T19:09:33Z

presto-native-execution/presto_cpp/main/functions/aggregates/ThetaSketchAggregate.h

+
+#include "DataSketches/theta_sketch.hpp"
+#include "DataSketches/theta_union.hpp"
+#include "velox/exec/Aggregate.h"


Add blank line before velox includes.

aditi-pandit · 2025-08-09T19:09:52Z

presto-native-execution/presto_cpp/main/functions/aggregates/ThetaSketchAggregate.h

+#include "velox/exec/SimpleAggregateAdapter.h"
+#include "velox/functions/prestosql/aggregates/AggregateNames.h"
+
+using namespace facebook::velox;


using namespace isn't allowed in header files.

aditi-pandit · 2025-08-09T19:52:53Z

presto-native-execution/presto_cpp/main/functions/ThetaSketchFunctions.h

I feel it might be better to have a single folder under functions called theta_sketch and add both the scalar and aggregate functions in it. The current setup is a bit odd.

aditi-pandit · 2025-08-09T19:54:41Z

presto-native-execution/presto_cpp/main/functions/aggregates/ThetaSketchAggregate.h

+    bool addInput(
+        HashStringAllocator* /*allocator*/,
+        exec::optional_arg_type<T> data) {
+      if (!data.has_value())


I feel

if (data.has_value()) { ... } return true;

is more readable.

Can you follow that style ?

aditi-pandit · 2025-08-09T19:57:22Z

presto-native-execution/presto_cpp/main/functions/aggregates/ThetaSketchAggregate.h

+        exec::optional_arg_type<Varbinary> other) {
+      if (!other.has_value())
+        return true;
+      thetaUnion.update(updateSketch);


Seems like all the functions are calling updateSketch.reset() at the end of the function, then what is the point of calling thetaUnion.update(updateSketch) at the beginning of these functions ?

updateSketch will be updated in addInput and thetaUnion needs to be updated with these entries and reset updateSketch to avoid duplicate entries. In Java implementation of DataSketches, only ThetaUnion class is sufficient because it also has an UpdateSketch data member, but in C++ implementation, it doesn't and so we need these two separately maintained.

aditi-pandit · 2025-08-09T19:58:21Z

presto-native-execution/presto_cpp/main/functions/aggregates/ThetaSketchAggregate.h

+    bool writeIntermediateResult(
+        bool nonNullGroup,
+        exec::out_type<Varbinary>& out) {
+      thetaUnion.update(updateSketch);


Can you abstract a function for this code, it seems to be repeated in several write functions ?

aditi-pandit · 2025-08-09T20:02:57Z

...o-native-execution/presto_cpp/main/functions/aggregates/tests/ThetaSketchAggregationTest.cpp

+        {getExpectedResult<T>(values)}, VARBINARY())});
+
+    testAggregations(
+        {vectors}, {}, {"pressto.default.sketch_theta(c0)"}, {expected});


"pressto" spelling.

nmahadevuni · 2025-09-15T07:19:03Z

@aditi-pandit @PingLiuPing Thanks for the review. I have addressed the comments. Please review.

steveburnett

Thanks for the doc, looks great! One minor rephrasing suggested, let me know if my suggestion changes your intended meaning in a way that you disagree with.

steveburnett · 2025-09-15T14:01:37Z

presto-docs/src/main/sphinx/presto_cpp/functions/sketch.rst

+================
+
+Sketches are data structures that can approximately answer particular questions
+about a dataset when full accuracy is not required. The benefit of approximate


Suggested change

about a dataset when full accuracy is not required. The benefit of approximate

about a dataset when full accuracy is not required. Approximate

steveburnett · 2025-09-15T14:02:17Z

presto-docs/src/main/sphinx/presto_cpp/functions/sketch.rst

+
+Sketches are data structures that can approximately answer particular questions
+about a dataset when full accuracy is not required. The benefit of approximate
+answers is that they are often faster and more efficient to compute than


Suggested change

answers is that they are often faster and more efficient to compute than

answers are often faster and more efficient to compute than

Suggestion for conciseness and readability.

aditi-pandit

Thanks @nmahadevuni. Have bunch of comments.

Also, since we are mainly writing this function to support ANALYZE for iceberg and the logic in https://github.com/prestodb/presto/blob/master/presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java, then we should run some Iceberg e2e tests for it. wdyt ?

aditi-pandit · 2025-10-26T02:51:12Z

presto-native-execution/presto_cpp/main/PrestoServer.cpp

 #include "presto_cpp/main/common/Utils.h"
 #include "presto_cpp/main/connectors/Registration.h"
 #include "presto_cpp/main/connectors/SystemConnector.h"
+#ifdef PRESTO_ENABLE_DATASKETCHES


I feel these can be included by default as Java also has them in presto-main-base
https://github.com/prestodb/presto/blob/master/presto-main-base/src/main/java/com/facebook/presto/operator/scalar/ThetaSketchFunctions.java

aditi-pandit · 2025-10-26T03:01:46Z

presto-native-execution/presto_cpp/main/functions/theta_sketch/ThetaSketchAggregate.cpp

+            argTypes.size(), 1, "{} takes at most one argument", name);
+        auto inputType = argTypes[0];
+        if (velox::exec::isRawInput(step)) {
+          switch (inputType->kind()) {


The formating of the lines seems off... PTAL.

aditi-pandit · 2025-10-26T03:09:54Z

presto-native-execution/presto_cpp/main/functions/theta_sketch/ThetaSketchAggregate.h

+  std::vector<std::shared_ptr<velox::exec::AggregateFunctionSignature>> signatures;
+
+  for (const auto& inputType :
+       {"smallint", "integer", "bigint", "real", "double", "varchar"}) {


You don't have DATE/TIME and DECIMAL variants in theta_sketch. From this code it seems like the TableStatisticsMaker expects these types to be supported https://github.com/prestodb/presto/blob/master/presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java#L673.

aditi-pandit · 2025-10-26T03:13:20Z

presto-native-execution/presto_cpp/main/functions/theta_sketch/ThetaSketchAggregate.cpp

+    }
+
+    bool writeFinalResult(bool nonNullGroup, velox::exec::out_type<velox::Varbinary>& out) {
+      updateUnion();


This code seems exactly as writeIntermediateResult. Can you abstract a common function for it ?

aditi-pandit · 2025-10-26T03:16:10Z

...est/java/com/facebook/presto/nativetests/functions/TestThetaSketchFunctionsNativeVsJava.java

You can simply call this file TestThetaSketchFunctions.java

aditi-pandit · 2025-10-26T03:29:34Z

...est/java/com/facebook/presto/nativetests/functions/TestThetaSketchFunctionsNativeVsJava.java

+        MaterializedResult result = nativeQueryRunner.execute(session, "SELECT " + functionNamespace + "sketch_theta_estimate(CAST(NULL as VARBINARY))");
+
+        assertTrue(result.getOnlyValue() == null);
+        functionAssertions.assertFunction("sketch_theta_estimate(CAST(NULL as VARBINARY))", DOUBLE, null);


This is not typically how we test function. The general method is to have a SQL statement that is run on both java and native QueryRunner and validate the results match.
Check https://github.com/prestodb/presto/blob/master/presto-native-tests/src/test/java/com/facebook/presto/nativetests/TestOrderByQueries.java

czentgr · 2025-12-18T16:19:57Z

presto-native-execution/scripts/setup-centos.sh

 }

+function install_datasketches {
+  github_checkout apache/datasketches-cpp 5.2.0 --depth 1


Don't use git clone. Use wget_and_untar instead for https://github.com/apache/datasketches-cpp/archive/refs/tags/5.2.0.tar.gz.

Joe-Abraham · 2025-12-19T11:24:40Z

NOTICES

 Prior to moving to ASF, the software for this project was developed at
 Yahoo Inc. (https://developer.yahoo.com).
-------
+-------


nit : EOF is missing.

steveburnett · 2026-01-06T19:18:51Z

@nmahadevuni, when you have time, please take a look at my suggestions for the doc and let me know what you think.

…heta functions (#26831) ## Description This change adds a Apache DataSketches CPP package to the setup scripts ## Motivation and Context This package is required to implement Theta sketch aggregate and scalar functions required for Iceberg statistics. The functions will be implemented in a different PR #25685 . ## Impact No impact ## Test Plan  ``` == NO RELEASE NOTE == ```

…heta functions (prestodb#26831) ## Description This change adds a Apache DataSketches CPP package to the setup scripts ## Motivation and Context This package is required to implement Theta sketch aggregate and scalar functions required for Iceberg statistics. The functions will be implemented in a different PR prestodb#25685 . ## Impact No impact ## Test Plan  ``` == NO RELEASE NOTE == ```

mblanco-denodo · 2026-01-19T08:35:47Z

presto-native-tests/pom.xml

                    <excludedGroups>remote-function</excludedGroups>
                    <systemPropertyVariables>
-                        <PRESTO_SERVER>/root/project/build/debug/presto_cpp/main/presto_server</PRESTO_SERVER>
+                        <PRESTO_SERVER>/Users/nmahadevuni/mywork/code/opensource/presto-fork/presto-native-execution/_build/debug/presto_cpp/main/presto_server</PRESTO_SERVER>


This looks like a change to make the tests work locally. Revert this or CI/CD could break

mblanco-denodo · 2026-02-12T09:32:06Z

There are compilation failures, probably because dependency image did not have this: #26831
Please rebase the branch so that the code compile on testing

nmahadevuni requested review from a team as code owners August 5, 2025 06:22

prestodb-ci added the from:IBM PR from IBM label Aug 5, 2025

prestodb-ci requested review from a team, Joe-Abraham and namya28 and removed request for a team August 5, 2025 06:22

nmahadevuni changed the title ~~Theta sketch native functions~~ [WIP]: Implement Sketch Theta aggregate and scalar functions Aug 5, 2025

nmahadevuni force-pushed the theta_sketch_native_functions branch from bf2c33e to 2bc57ba Compare August 5, 2025 07:28

nmahadevuni requested review from czentgr and unidevel as code owners August 5, 2025 10:32

nmahadevuni force-pushed the theta_sketch_native_functions branch 3 times, most recently from c600fd9 to c46ffb4 Compare August 5, 2025 12:28

nmahadevuni force-pushed the theta_sketch_native_functions branch from c46ffb4 to 3601af1 Compare August 6, 2025 06:02

nmahadevuni changed the title ~~[WIP]: Implement Sketch Theta aggregate and scalar functions~~ [WIP]: [native] Implement Sketch Theta aggregate and scalar functions Aug 6, 2025

nmahadevuni force-pushed the theta_sketch_native_functions branch from 3601af1 to 4636f1d Compare August 6, 2025 08:24

nmahadevuni force-pushed the theta_sketch_native_functions branch 3 times, most recently from 9d6c62c to 6109e9b Compare August 7, 2025 06:04

nmahadevuni changed the title ~~[WIP]: [native] Implement Sketch Theta aggregate and scalar functions~~ [native] Implement Sketch Theta aggregate and scalar functions Aug 7, 2025

nmahadevuni requested a review from aditi-pandit August 7, 2025 11:14

PingLiuPing reviewed Aug 7, 2025

View reviewed changes

aditi-pandit reviewed Aug 9, 2025

View reviewed changes

nmahadevuni mentioned this pull request Aug 12, 2025

feat(function): Implement theta sketch math functions facebookincubator/velox#13844

Closed

nmahadevuni mentioned this pull request Aug 22, 2025

fix: Aggregate destroyAccumulator memset call with dynamic class input facebookincubator/velox#14568

Closed

nmahadevuni force-pushed the theta_sketch_native_functions branch from 6109e9b to fc9fdae Compare August 25, 2025 07:43

nmahadevuni force-pushed the theta_sketch_native_functions branch from 09e9bdd to 0ffa161 Compare September 15, 2025 08:15

steveburnett requested changes Sep 15, 2025

View reviewed changes

nmahadevuni force-pushed the theta_sketch_native_functions branch from 0ffa161 to a21d7a8 Compare October 13, 2025 07:03

nmahadevuni changed the title ~~[native] Implement Sketch Theta aggregate and scalar functions~~ feat: [native] Implement Sketch Theta aggregate and scalar functions Oct 14, 2025

nmahadevuni force-pushed the theta_sketch_native_functions branch from a21d7a8 to 8c74f5f Compare October 15, 2025 06:38

nmahadevuni changed the title ~~feat: [native] Implement Sketch Theta aggregate and scalar functions~~ feat(native): Implement Sketch Theta aggregate and scalar functions Oct 15, 2025

nmahadevuni force-pushed the theta_sketch_native_functions branch from 8c74f5f to 768041f Compare October 16, 2025 16:28

nmahadevuni force-pushed the theta_sketch_native_functions branch from 768041f to fce440c Compare October 24, 2025 09:23

aditi-pandit reviewed Oct 26, 2025

View reviewed changes

nmahadevuni force-pushed the theta_sketch_native_functions branch from fce440c to e2c21d2 Compare November 13, 2025 12:03

nmahadevuni mentioned this pull request Dec 5, 2025

[native]: Implement Sketch Theta aggregate and scalar functions. #26745

Open

nmahadevuni force-pushed the theta_sketch_native_functions branch from e2c21d2 to 7481560 Compare December 17, 2025 18:57

czentgr reviewed Dec 18, 2025

View reviewed changes

nmahadevuni mentioned this pull request Dec 19, 2025

build(native): Add dependency on Apache DataSketches CPP for Sketch Theta functions #26831

Merged

Joe-Abraham reviewed Dec 19, 2025

View reviewed changes

mblanco-denodo suggested changes Jan 19, 2026

View reviewed changes

nmahadevuni force-pushed the theta_sketch_native_functions branch 2 times, most recently from 70c711d to 34baa74 Compare January 20, 2026 09:42

nmahadevuni force-pushed the theta_sketch_native_functions branch from 34baa74 to bad191b Compare February 12, 2026 19:53

nmahadevuni requested review from ZacBlanco and hantangwangd as code owners February 12, 2026 19:53

nmahadevuni force-pushed the theta_sketch_native_functions branch 2 times, most recently from 401857b to e8e9038 Compare February 13, 2026 07:13

feat(native): Implement Sketch Theta aggregate and scalar functions

efa121d

nmahadevuni force-pushed the theta_sketch_native_functions branch from e8e9038 to efa121d Compare February 13, 2026 11:17

-if(PRESTO_ENABLE_TESTING)
+if(PRESTO_ENABLE_TESTING AND PRESTO_ENABLE_DATASKETCHES)
+  add_subdirectory(functions/aggregates/tests)
+endif()

	about a dataset when full accuracy is not required. The benefit of approximate
	about a dataset when full accuracy is not required. Approximate

	answers is that they are often faster and more efficient to compute than
	answers are often faster and more efficient to compute than

Conversation

nmahadevuni commented Aug 5, 2025

Description

Motivation and Context

New Sketch Functions

Impact

Test Plan

Uh oh!

steveburnett commented Aug 5, 2025

Uh oh!

nmahadevuni commented Aug 6, 2025

Uh oh!

nmahadevuni commented Aug 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit Aug 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmahadevuni Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nmahadevuni commented Sep 15, 2025

Uh oh!

steveburnett left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aditi-pandit left a comment •

edited

Loading

aditi-pandit Aug 9, 2025 •

edited

Loading

nmahadevuni Aug 25, 2025 •

edited

Loading