Skip to content

Support arbitrary aggregation functions during ANALYZE (v1)#14220

Closed
findepi wants to merge 4 commits intotrinodb:masterfrom
findepi:findepi/arbitrary-stats
Closed

Support arbitrary aggregation functions during ANALYZE (v1)#14220
findepi wants to merge 4 commits intotrinodb:masterfrom
findepi:findepi/arbitrary-stats

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Sep 20, 2022

A connector may ask engine to collect anything defined by ColumnStatisticType SPI enum. This is convenient, but sometimes a connector needs to provide its own way of calculating statistics.

For example, Iceberg statistics include

apache-datasketches-theta-v1 blob type

A serialized form of a "compact" Theta sketch produced by the Apache
DataSketches
library. The sketch is obtained by
constructing Alpha family sketch with default seed, and feeding it with individual
distinct values converted to bytes using Iceberg's single-value serialization.

This has two components which are not supported today

  • a new data sketch with a specific configuration (so that results can be shared with different query engines)
  • a well-defined input pre-processing, which relies on existing Iceberg concepts, which are alien to Trino engine

This PR addresses the first limitation. It allows the connector to pick an aggregation function of its choice for statistics collection.

@cla-bot cla-bot bot added the cla-signed label Sep 20, 2022
@findepi findepi force-pushed the findepi/arbitrary-stats branch from a5068df to ccf6c0b Compare September 20, 2022 14:24
@findepi findepi force-pushed the findepi/arbitrary-stats branch 2 times, most recently from 4224b00 to 0649485 Compare September 20, 2022 21:28
@findepi findepi changed the title Support arbitrary statistics during ANALYZE Support arbitrary aggregation functions during ANALYZE Sep 20, 2022
@findepi findepi added the enhancement New feature or request label Sep 21, 2022
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Sep 21, 2022

Here is an alternative version of this PR, which maintains backward compatibility: #14233

@findepi findepi changed the title Support arbitrary aggregation functions during ANALYZE Support arbitrary aggregation functions during ANALYZE (1) Sep 21, 2022
@findepi findepi changed the title Support arbitrary aggregation functions during ANALYZE (1) Support arbitrary aggregation functions during ANALYZE (v1) Sep 21, 2022
No need to record that, since it's a pure local operation.
`ColumnStatisticMetadata` is used in `StatisticAggregationsDescriptor`
as a map key. Before the change, a hand-written serialization was used
for that. After the change, the map is replaced with a list of key/value
pairs for the purpose of the serialization.
The `ColumnStatisticType` enum was defining what is possible to collect
during statistics collection. While looking generic, the chosen options
matched exactly what stats Hive metastore collects. Different metadata
storages may require different statistics to be collected, for example
data sketches with some specific configuration.
@findepi findepi force-pushed the findepi/arbitrary-stats branch from 0649485 to 4973016 Compare September 21, 2022 12:20
@findepi findepi marked this pull request as draft September 23, 2022 15:30
@findepi findepi closed this Sep 26, 2022
@findepi findepi deleted the findepi/arbitrary-stats branch September 26, 2022 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed enhancement New feature or request

Development

Successfully merging this pull request may close these issues.

1 participant