Skip to content

Add reservoir_sample aggregation function#21296

Merged
tdcmeehan merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-reservoir-sample-functions
Dec 11, 2023
Merged

Add reservoir_sample aggregation function#21296
tdcmeehan merged 1 commit intoprestodb:masterfrom
ZacBlanco:upstream-reservoir-sample-functions

Conversation

@ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Nov 2, 2023

Description

This commit introduces a new reservoir_sample aggregate function which, as opposed to the existing TABLESAMPLE operator lets users pick a fixed sample size.

The fixed sample sizes lets users create samples of a known total size while guaranteeing every record has an equal probability of being chosen.

Motivation and Context

Useful for generating fixed-size samples

Impact

New reservoir_sample aggregation function.

Test Plan

Unit tests for the function are included.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* A reservoir_sample aggregation function is now available

@ZacBlanco ZacBlanco requested a review from a team as a code owner November 2, 2023 00:14
@ZacBlanco ZacBlanco requested a review from presto-oss November 2, 2023 00:14
@github-actions
Copy link

github-actions bot commented Nov 2, 2023

Codenotify: Notifying subscribers in CODENOTIFY files for diff 24aec09...6f47a4c.

Notify File(s)
@steveburnett presto-docs/src/main/sphinx/functions/aggregate.rst
presto-docs/src/main/sphinx/sql/select.rst

@ZacBlanco ZacBlanco force-pushed the upstream-reservoir-sample-functions branch 3 times, most recently from cb91ef5 to 5aac275 Compare November 2, 2023 17:52
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! I made some suggestions for conciseness and readability. As always, if my suggestions change your intended meaning in a way that doesn't accurately represent your intended meaning, let me know and I'm happy to discuss.

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Tested latest changes in a new local build, the docs look fine. Thanks!

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two small nits rechecking the doc in a new local build.

@ZacBlanco
Copy link
Contributor Author

@ClarenceThreepwood FYI..found the cause of the issue I was having earlier. I needed the function to be called on null inputs, but isCalledOnNullInput in the function header doesn't automatically fix it.

I have to put the undocumented @NullablePosition annotation alongside the nullable parameters as well to get the input function to be called

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a pointer in the docs or as a code comment on what the error message would be if a very large sample would need to cross an exchange boundary. IIRC it was the same error as any other non-scalar type that doesn't fit a max-width limit we have for non-scalars ?

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the additions! I've suggested some minor punctuation and phrasing changes for you to consider.

@ZacBlanco ZacBlanco force-pushed the upstream-reservoir-sample-functions branch from fd4f489 to 56145b5 Compare November 16, 2023 18:00
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

New local build, everything looks great. Thanks!

@ZacBlanco ZacBlanco force-pushed the upstream-reservoir-sample-functions branch from 56145b5 to 7d4a024 Compare November 21, 2023 09:14
@ZacBlanco ZacBlanco requested a review from aaneja November 21, 2023 09:14
@ZacBlanco ZacBlanco force-pushed the upstream-reservoir-sample-functions branch from 7d4a024 to 7df7367 Compare November 21, 2023 09:17
This commit introduces a new `reservoir_sample` aggregate
function which, as opposed to the existing TABLESAMPLE
operator lets users pick a fixed sample size.

The fixed sample sizes lets users create samples of a known
total size while guaranteeing every record has an equal
probability of being chosen.

Co-authored-by: xiz675 <32505316+xiz675@users.noreply.github.com>
@ZacBlanco ZacBlanco force-pushed the upstream-reservoir-sample-functions branch from 7df7367 to 6f47a4c Compare December 7, 2023 23:56
@tdcmeehan tdcmeehan self-assigned this Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants