Implement key-based sampling in the planner#16766
Conversation
3c24f8f to
35877a9
Compare
|
Fixed it to not create new plan nodes, instead use the idiomatic replaceChildren call. |
4ccdedb to
22c9322
Compare
presto-main/src/main/java/com/facebook/presto/sql/planner/PlanOptimizers.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/SystemSessionProperties.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/operator/scalar/sql/SimpleSamplingPercent.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/SmartSampler.java
Outdated
Show resolved
Hide resolved
presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/SmartSampler.java
Outdated
Show resolved
Hide resolved
cc04fa5 to
2113935
Compare
de4a83e to
e546149
Compare
|
Cleaned up more - renamed the function to key_sampling_percent so it's generally useful and also documented it. |
76a4d49 to
e959496
Compare
rongrong
left a comment
There was a problem hiding this comment.
There's merge conflict. Please rebase.
There was a problem hiding this comment.
Do you also want to introduce a version of this function for varbinary?
There was a problem hiding this comment.
Hmm interesting. I will open an issue for later. For now are using only varchar version.
There was a problem hiding this comment.
This is still not really a description of what this feature does. If you think there's no way to describe it clearly here, add proper documentation of this feature and refer it so people can learn more about it.
There was a problem hiding this comment.
Yeah - where do I add it? Add a chapter to the user doc?
There was a problem hiding this comment.
Actually the PR message has pretty much the documentation.
There was a problem hiding this comment.
Users won't read PR messages though. This is ok for now. But we probably want to document recommended usage, behavior and limitations once we polish the feature off.
e959496 to
75e54f7
Compare
There was a problem hiding this comment.
Done. Irritating that Intellij is inconsistent with this.
There was a problem hiding this comment.
nits:
keys.stream()
.filter(x -> TypeUtils.isIntegralType(x.getType().getTypeSignature(), functionAndTypeManager))
.findFirst();
There was a problem hiding this comment.
nits: break into multiple lines.
There was a problem hiding this comment.
This can throw if the function doesn't exist. You probably want to handle that.
There was a problem hiding this comment.
Hmm - doesn't it throw PrestoException? Then we can just propagate it?
There was a problem hiding this comment.
The error message might not be clear to users. We'd probably throw a function not found error, but the function is not used in user query. Ideally we'd probably want to catch here and rewrite the error message to something like sampling method not found.
There was a problem hiding this comment.
I don't think these are necessary.
There was a problem hiding this comment.
It looked idiomatic - all planner classes use it. So copied lol. Removed now.
There was a problem hiding this comment.
Not all. Many didn't I think. 😛
75e54f7 to
f7c937a
Compare
86968fa to
d4afcc9
Compare
d4afcc9 to
79b33c4
Compare
We have many ad hoc queries working on large datasets that keep failing/running long time/timing out. So for cases when the user just wants to get a sense of the results, we do smart sampling based on mapping hashes of keys to a percent. When the feature is enabled, we traverse the plan for the query and sample the first integer or string key found (in that order)
We apply the sampling predicate only once in every branch of graph so that eventually all qualifying scans will be sampled.
Test plan - added tests
Please make sure your submission complies with our Development, Formatting, and Commit Message guidelines. Don't forget to follow our attribution guidelines for any code copied from other projects.
Fill in the release notes towards the bottom of the PR description.
See Release Notes Guidelines for details.