Skip to content

Add function names to ColumnStatisticType#20871

Closed
ZacBlanco wants to merge 1 commit intoprestodb:masterfrom
ZacBlanco:colstat-function-upstream
Closed

Add function names to ColumnStatisticType#20871
ZacBlanco wants to merge 1 commit intoprestodb:masterfrom
ZacBlanco:colstat-function-upstream

Conversation

@ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Sep 14, 2023

Description

This change adds function name parameters to the enum values of ColumnStatisticType. The function names are used when generating ColumnStatisticMetadata for the ANALYZE query.

This change is in preparation to allow connectors to override the function used to execute the statistics aggregations. Mainly it is to support connectors that have differing underlying histogram or NDV representations. For example, Hive natively generates KLL sketches from Apache Datasketches for histograms, Spark has it's own custom format. Some other connectors like mysql use a JSON format. Iceberg tables can store NDV estimates in Apache DataSketches Theta sketches. If we intend to support a variety of stats from other connectors this change is necessary.

Allowing the connector to override the function for each statistic type will let us generate statistic data in the format needed by the connector in order to store it in the connector-specific catalog metadata.

Motivation and Context

Eventual implementation of more column statistics

Impact

No user facing impact

Test Plan

No additional features need to be tested

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== NO RELEASE NOTE ==

@ZacBlanco ZacBlanco requested a review from a team as a code owner September 14, 2023 21:45
@ZacBlanco ZacBlanco force-pushed the colstat-function-upstream branch from 64de279 to f86ff8c Compare September 15, 2023 00:15
This change adds function name parameters to the enum values of
ColumnStatisticType. The function names are used when generating
ColumnStatisticMetadata for the ANALYZE query.

This change is in preparation to allow connectors to override the
function used to execute the statistics aggregations. Mainly it is to
support connectors that have differing underlying histogram or NDV
representations. For example, Hive natively generates KLL sketches
from Apache Datasketches for histograms, Spark has it's own custom
format. Some other connectors like mysql use a JSON format. Iceberg
tables can store NDV estimates in Apache DataSketches Theta sketches.
If we intend to support a variety of stats from other connectors this
change is necessary.

Allowing the connector to override the function for each statistic type
will let us generate statistic data in the format needed by the connector
in order to store it in the connector-specific catalog metadata.
@ZacBlanco
Copy link
Contributor Author

Closing this in lieu of #20993

@ZacBlanco ZacBlanco closed this Sep 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant