Skip to content

Implement key-based sampling in the planner#16766

Merged
rongrong merged 1 commit intoprestodb:masterfrom
kaikalur:smart-sampling
Sep 29, 2021
Merged

Implement key-based sampling in the planner#16766
rongrong merged 1 commit intoprestodb:masterfrom
kaikalur:smart-sampling

Conversation

@kaikalur
Copy link
Copy Markdown
Contributor

@kaikalur kaikalur commented Sep 20, 2021

We have many ad hoc queries working on large datasets that keep failing/running long time/timing out. So for cases when the user just wants to get a sense of the results, we do smart sampling based on mapping hashes of keys to a percent. When the feature is enabled, we traverse the plan for the query and sample the first integer or string key found (in that order)

  • both sides of an equi-join
  • group by
  • partition by of a window/rownum/topnrownum
  • left-side (source) of semijoin

We apply the sampling predicate only once in every branch of graph so that eventually all qualifying scans will be sampled.

Test plan - added tests

Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines. Don't forget to follow our attribution guidelines for any code copied from other projects.

Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.

== RELEASE NOTES ==

General Changes
* Added Smart Sampling feature that samples the tables at query (planning) time. This helps in cases when the user is just exploring the data, testing a pipeline or looking for trends. We sample all the tables (and only tables) in such a way that all "key-based" operations like JOIN/GROUP BY/window functions etc will be complete for the sampled keys. 

@kaikalur kaikalur requested a review from rongrong September 20, 2021 18:03
@kaikalur kaikalur changed the title Implement key-based sampling in the oplanner Implement key-based sampling in the planner Sep 20, 2021
@kaikalur kaikalur requested a review from highker September 20, 2021 22:20
@kaikalur
Copy link
Copy Markdown
Contributor Author

Fixed it to not create new plan nodes, instead use the idiomatic replaceChildren call.

@kaikalur kaikalur force-pushed the smart-sampling branch 2 times, most recently from 4ccdedb to 22c9322 Compare September 21, 2021 16:54
@kaikalur kaikalur force-pushed the smart-sampling branch 2 times, most recently from cc04fa5 to 2113935 Compare September 23, 2021 00:36
@kaikalur kaikalur requested a review from rongrong September 23, 2021 00:37
@kaikalur kaikalur force-pushed the smart-sampling branch 4 times, most recently from de4a83e to e546149 Compare September 24, 2021 18:41
@kaikalur
Copy link
Copy Markdown
Contributor Author

Cleaned up more - renamed the function to key_sampling_percent so it's generally useful and also documented it.

@kaikalur kaikalur force-pushed the smart-sampling branch 2 times, most recently from 76a4d49 to e959496 Compare September 28, 2021 03:36
Copy link
Copy Markdown
Contributor

@rongrong rongrong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's merge conflict. Please rebase.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you also want to introduce a version of this function for varbinary?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm interesting. I will open an issue for later. For now are using only varchar version.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not really a description of what this feature does. If you think there's no way to describe it clearly here, add proper documentation of this feature and refer it so people can learn more about it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - where do I add it? Add a chapter to the user doc?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the PR message has pretty much the documentation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users won't read PR messages though. This is ok for now. But we probably want to document recommended usage, behavior and limitations once we polish the feature off.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Irritating that Intellij is inconsistent with this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

static import

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

keys.stream()
.filter(x -> TypeUtils.isIntegralType(x.getType().getTypeSignature(), functionAndTypeManager))
.findFirst();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits: break into multiple lines.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can throw if the function doesn't exist. You probably want to handle that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm - doesn't it throw PrestoException? Then we can just propagate it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error message might not be clear to users. We'd probably throw a function not found error, but the function is not used in user query. Ideally we'd probably want to catch here and rewrite the error message to something like sampling method not found.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 74 to 78
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these are necessary.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looked idiomatic - all planner classes use it. So copied lol. Removed now.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not all. Many didn't I think. 😛

@kaikalur kaikalur requested a review from rongrong September 29, 2021 03:05
@kaikalur kaikalur force-pushed the smart-sampling branch 3 times, most recently from 86968fa to d4afcc9 Compare September 29, 2021 17:45
@rongrong rongrong merged commit ae77100 into prestodb:master Sep 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants