Skip to content

feat(native): Implement Sketch Theta aggregate and scalar functions#25685

Open
nmahadevuni wants to merge 1 commit intoprestodb:masterfrom
nmahadevuni:theta_sketch_native_functions
Open

feat(native): Implement Sketch Theta aggregate and scalar functions#25685
nmahadevuni wants to merge 1 commit intoprestodb:masterfrom
nmahadevuni:theta_sketch_native_functions

Conversation

@nmahadevuni
Copy link
Member

Description

Implements Sketch Theta aggregate and scalar functions required for the new Iceberg statistics introduced in Presto Java.

Motivation and Context

Sketches are data structures that can approximately answer particular questions about a dataset when full accuracy is not required. The benefit of approximate answers is that they are often faster and more efficient to compute than functions which result in full accuracy.

Theta sketches enable distinct value counting on datasets and also provide the ability to perform set operations. For more information on Theta sketches, please see the Apache Datasketches Theta sketch documentation

The Presto PR which introduced these changes is #20993. A brief intro to these functions

New Sketch Functions

Iceberg's Puffin spec defines the format that NDVs must be written in. Currently, the only available format is a binary
blob representing an Apache Datasketches Theta Sketch, so we implemented three basic functions which expose the sketch so that Iceberg can eventually consume it when writing statistics.

sketch_theta(<column>) -> varbinary: An aggregation function which accepts a column and generates a binary representation of the org.apache.datasketches.theta.CompactSketch. Applications can easily consume this raw binary
format to gain access to a CompactSketch instance and associated methods.

sketch_theta_estimate(<varbinary sketch>) -> double: A scalar function which consumes a raw binary sketch and produces the estimate. This is effectively the same as calling CompactSketch::getEstimate. I've exposed this as a convenience for checking the sketch's output

sketch_theta_summary(<varbinary sketch>) -> row(estimate double, theta double, upper_bound_std1 double, lower_bound_std1 double, retained_entries int): This is another scalar function, but returns a row type containing
more human-readable information about the sketch such as the theta parameter as well as upper and lower bounds
for 1 standard deviation from the estimate

Impact

No impact

Test Plan

Added tests in ThetaSketchAggregationTest.cpp

== NO RELEASE NOTE ==

@nmahadevuni nmahadevuni requested review from a team as code owners August 5, 2025 06:22
@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Aug 5, 2025
@prestodb-ci prestodb-ci requested review from a team, Joe-Abraham and namya28 and removed request for a team August 5, 2025 06:22
@nmahadevuni nmahadevuni changed the title Theta sketch native functions [WIP]: Implement Sketch Theta aggregate and scalar functions Aug 5, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from bf2c33e to 2bc57ba Compare August 5, 2025 07:28
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch 3 times, most recently from c600fd9 to c46ffb4 Compare August 5, 2025 12:28
@steveburnett
Copy link
Contributor

@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from c46ffb4 to 3601af1 Compare August 6, 2025 06:02
@nmahadevuni nmahadevuni changed the title [WIP]: Implement Sketch Theta aggregate and scalar functions [WIP]: [native] Implement Sketch Theta aggregate and scalar functions Aug 6, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from 3601af1 to 4636f1d Compare August 6, 2025 08:24
@nmahadevuni
Copy link
Member Author

Do these new functions need documentation? Perhaps in https://github.com/prestodb/presto/blob/master/presto-docs/src/main/sphinx/functions/sketch.rst.

These functions are already documented on this page. This PR implements the same functions for Prestissimo.

@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch 3 times, most recently from 9d6c62c to 6109e9b Compare August 7, 2025 06:04
@nmahadevuni nmahadevuni changed the title [WIP]: [native] Implement Sketch Theta aggregate and scalar functions [native] Implement Sketch Theta aggregate and scalar functions Aug 7, 2025
@nmahadevuni
Copy link
Member Author

@aditi-pandit @czentgr Can you please review this?

github.event_name == 'schedule' || needs.changes.outputs.codechange == 'true'
run: ccache -sz

- name: Build required adapter dependencies
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the code change in this file expected?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It installs the required dependency Apache DataSketches.

NOTICES Outdated

Prior to moving to ASF, the software for this project was developed at
Yahoo Inc. (https://developer.yahoo.com).
------- No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add new line here.

endif()
endif()

if(PRESTO_ENABLE_TESTING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if(PRESTO_ENABLE_TESTING)
if(PRESTO_ENABLE_TESTING AND PRESTO_ENABLE_DATASKETCHES)
add_subdirectory(functions/aggregates/tests)
endif()

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this code @nmahadevuni.

i) Can you add some documentation for these functions as well ?
ii) Did you check if these functions are reported by side-car ?
iii) Can you also add e2e tests in https://github.com/prestodb/presto/tree/master/presto-native-tests/src/test/java/com/facebook/presto/nativetests for comparing native vs java results ? Add some of the iceberg ANALYZE tests as well ?

@pramodsatya

}

function install_datasketches {
# grpc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this comment for ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated. Removed

#pragma once

#include "DataSketches/theta_sketch.hpp"
#include "velox/functions/Macros.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave a blank line before velox includes.

const arg_type<velox::Varbinary>& in) {
auto compactSketch =
datasketches::wrapped_compact_theta_sketch::wrap(in.data(), in.size());
double estimate = compactSketch.get_estimate();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need all these local variables ? Can these be initialized within std::make_tuple for the result ?


#include "DataSketches/theta_sketch.hpp"
#include "DataSketches/theta_union.hpp"
#include "velox/exec/Aggregate.h"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add blank line before velox includes.

#include "velox/exec/SimpleAggregateAdapter.h"
#include "velox/functions/prestosql/aggregates/AggregateNames.h"

using namespace facebook::velox;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using namespace isn't allowed in header files.

Copy link
Contributor

@aditi-pandit aditi-pandit Aug 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel it might be better to have a single folder under functions called theta_sketch and add both the scalar and aggregate functions in it. The current setup is a bit odd.

bool addInput(
HashStringAllocator* /*allocator*/,
exec::optional_arg_type<T> data) {
if (!data.has_value())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel

if (data.has_value()) {
 ...
}
return true;

is more readable.

Can you follow that style ?

exec::optional_arg_type<Varbinary> other) {
if (!other.has_value())
return true;
thetaUnion.update(updateSketch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like all the functions are calling updateSketch.reset() at the end of the function, then what is the point of calling thetaUnion.update(updateSketch) at the beginning of these functions ?

Copy link
Member Author

@nmahadevuni nmahadevuni Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateSketch will be updated in addInput and thetaUnion needs to be updated with these entries and reset updateSketch to avoid duplicate entries. In Java implementation of DataSketches, only ThetaUnion class is sufficient because it also has an UpdateSketch data member, but in C++ implementation, it doesn't and so we need these two separately maintained.

bool writeIntermediateResult(
bool nonNullGroup,
exec::out_type<Varbinary>& out) {
thetaUnion.update(updateSketch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you abstract a function for this code, it seems to be repeated in several write functions ?

{getExpectedResult<T>(values)}, VARBINARY())});

testAggregations(
{vectors}, {}, {"pressto.default.sketch_theta(c0)"}, {expected});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"pressto" spelling.

@nmahadevuni
Copy link
Member Author

@aditi-pandit @PingLiuPing Thanks for the review. I have addressed the comments. Please review.

@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from 09e9bdd to 0ffa161 Compare September 15, 2025 08:15
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc, looks great! One minor rephrasing suggested, let me know if my suggestion changes your intended meaning in a way that you disagree with.

================

Sketches are data structures that can approximately answer particular questions
about a dataset when full accuracy is not required. The benefit of approximate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
about a dataset when full accuracy is not required. The benefit of approximate
about a dataset when full accuracy is not required. Approximate


Sketches are data structures that can approximately answer particular questions
about a dataset when full accuracy is not required. The benefit of approximate
answers is that they are often faster and more efficient to compute than
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
answers is that they are often faster and more efficient to compute than
answers are often faster and more efficient to compute than

Suggestion for conciseness and readability.

@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from 0ffa161 to a21d7a8 Compare October 13, 2025 07:03
@nmahadevuni nmahadevuni changed the title [native] Implement Sketch Theta aggregate and scalar functions feat: [native] Implement Sketch Theta aggregate and scalar functions Oct 14, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from a21d7a8 to 8c74f5f Compare October 15, 2025 06:38
@nmahadevuni nmahadevuni changed the title feat: [native] Implement Sketch Theta aggregate and scalar functions feat(native): Implement Sketch Theta aggregate and scalar functions Oct 15, 2025
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from 8c74f5f to 768041f Compare October 16, 2025 16:28
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from 768041f to fce440c Compare October 24, 2025 09:23
Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @nmahadevuni. Have bunch of comments.

Also, since we are mainly writing this function to support ANALYZE for iceberg and the logic in https://github.com/prestodb/presto/blob/master/presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java, then we should run some Iceberg e2e tests for it. wdyt ?

#include "presto_cpp/main/common/Utils.h"
#include "presto_cpp/main/connectors/Registration.h"
#include "presto_cpp/main/connectors/SystemConnector.h"
#ifdef PRESTO_ENABLE_DATASKETCHES
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argTypes.size(), 1, "{} takes at most one argument", name);
auto inputType = argTypes[0];
if (velox::exec::isRawInput(step)) {
switch (inputType->kind()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The formating of the lines seems off... PTAL.

std::vector<std::shared_ptr<velox::exec::AggregateFunctionSignature>> signatures;

for (const auto& inputType :
{"smallint", "integer", "bigint", "real", "double", "varchar"}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't have DATE/TIME and DECIMAL variants in theta_sketch. From this code it seems like the TableStatisticsMaker expects these types to be supported https://github.com/prestodb/presto/blob/master/presto-iceberg/src/main/java/com/facebook/presto/iceberg/TableStatisticsMaker.java#L673.

}

bool writeFinalResult(bool nonNullGroup, velox::exec::out_type<velox::Varbinary>& out) {
updateUnion();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code seems exactly as writeIntermediateResult. Can you abstract a common function for it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can simply call this file TestThetaSketchFunctions.java

MaterializedResult result = nativeQueryRunner.execute(session, "SELECT " + functionNamespace + "sketch_theta_estimate(CAST(NULL as VARBINARY))");

assertTrue(result.getOnlyValue() == null);
functionAssertions.assertFunction("sketch_theta_estimate(CAST(NULL as VARBINARY))", DOUBLE, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not typically how we test function. The general method is to have a SQL statement that is run on both java and native QueryRunner and validate the results match.
Check https://github.com/prestodb/presto/blob/master/presto-native-tests/src/test/java/com/facebook/presto/nativetests/TestOrderByQueries.java

}

function install_datasketches {
github_checkout apache/datasketches-cpp 5.2.0 --depth 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to moving to ASF, the software for this project was developed at
Yahoo Inc. (https://developer.yahoo.com).
-------
-------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit : EOF is missing.

@steveburnett
Copy link
Contributor

@nmahadevuni, when you have time, please take a look at my suggestions for the doc and let me know what you think.

aditi-pandit pushed a commit that referenced this pull request Jan 9, 2026
…heta functions (#26831)

## Description
This change adds a Apache DataSketches CPP package to the setup scripts

## Motivation and Context
This package is required to implement Theta sketch aggregate and scalar
functions required for Iceberg statistics. The functions will be
implemented in a different PR
#25685 .

## Impact
No impact

## Test Plan
<!---Please fill in how you tested your change-->


```
== NO RELEASE NOTE ==
```
tdcmeehan pushed a commit to rdtr/presto that referenced this pull request Jan 14, 2026
…heta functions (prestodb#26831)

## Description
This change adds a Apache DataSketches CPP package to the setup scripts

## Motivation and Context
This package is required to implement Theta sketch aggregate and scalar
functions required for Iceberg statistics. The functions will be
implemented in a different PR
prestodb#25685 .

## Impact
No impact

## Test Plan
<!---Please fill in how you tested your change-->


```
== NO RELEASE NOTE ==
```
<excludedGroups>remote-function</excludedGroups>
<systemPropertyVariables>
<PRESTO_SERVER>/root/project/build/debug/presto_cpp/main/presto_server</PRESTO_SERVER>
<PRESTO_SERVER>/Users/nmahadevuni/mywork/code/opensource/presto-fork/presto-native-execution/_build/debug/presto_cpp/main/presto_server</PRESTO_SERVER>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a change to make the tests work locally. Revert this or CI/CD could break

@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch 2 times, most recently from 70c711d to 34baa74 Compare January 20, 2026 09:42
@mblanco-denodo
Copy link
Contributor

There are compilation failures, probably because dependency image did not have this: #26831
Please rebase the branch so that the code compile on testing

@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from 34baa74 to bad191b Compare February 12, 2026 19:53
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch 2 times, most recently from 401857b to e8e9038 Compare February 13, 2026 07:13
@nmahadevuni nmahadevuni force-pushed the theta_sketch_native_functions branch from e8e9038 to efa121d Compare February 13, 2026 11:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants