-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17075][SQL] implemented filter estimation #16395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
|
Test build #70562 has finished for PR 16395 at commit
|
56d1579 to
e9d4f4d
Compare
|
Test build #70788 has finished for PR 16395 at commit
|
|
Test build #70789 has finished for PR 16395 at commit
|
b1932fb to
1fc44a9
Compare
|
Test build #70830 has finished for PR 16395 at commit
|
1fc44a9 to
784015e
Compare
|
cc @wzhfy @rxin @srinathshankar @hvanhovell @cloud-fan |
|
Test build #70894 has finished for PR 16395 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar to comment i made on the project PR, it'd be great to just create a leaf logical plan node in which we can pass arbitrary statistics and use that to make all the estimation suites unit test suites, rather than end-to-end test suites.
That way we can also have more control over the input we test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not thread safe. maybe turn FilterEstimation into a class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some documentation here on the high level algorithm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
basically i spent 2 mins reading this code and i have no idea how it works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
document what "update" means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null values or 0 rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two methods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use // to document inline comments
/** */ are reserved for classdocs/function docs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the return value? selectivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the return value? selectivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the return value? selectivity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.
|
@ron8hu Can you update the test cases based on the latest master? We have a new test infrastructure now. |
784015e to
210b11b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this necessary? isn't this just
case op @ EqualTo(ar: AttributeReference, l: Literal) =>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I can use patterns for variable binding. Fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably remove this since it doesn't really carry any information ... (plan's type is already Filter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this LogicalPlan so that we can access its child node's statistics information.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as i commented on the other pr, i think we should use named arguments here so readers would know what 0, 4 ,4 means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also i'd rename filteredColStats to just expected
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
954d2fc to
eac69af
Compare
|
Test build #73376 has finished for PR 16395 at commit
|
| } | ||
|
|
||
| case Or(cond1, cond2) => | ||
| // For ease of debugging, we compute percent1 and percent2 in 2 statements. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this can also apply to the And case
| * @return an optional double value to show the percentage of rows meeting a given condition | ||
| * It returns None if no statistics collected for a given column. | ||
| */ | ||
| def evaluateIsNull( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: evaluateNullCheck
| hSet: Set[Any], | ||
| update: Boolean) | ||
| : Option[Double] = { | ||
| if (!mutableColStats.contains(attrRef.exprId)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can have a method for this logic
| } | ||
|
|
||
| def rangeContainsLiteral(r: Range, lit: Literal): Boolean = r match { | ||
| case _: DefaultRange => true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we move this logic into each Range implementation?
| // To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal. | ||
| // Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec. | ||
| val hSetBigdec = hSet.map(e => BigDecimal(e.toString)) | ||
| val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && e <= statsRange.max) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can use rangeContainsLiteral here.
| // So we will change the order if not. | ||
|
|
||
| // EqualTo does not care about the order | ||
| case op @ EqualTo(ar: AttributeReference, l: Literal) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also handle EqualNullSafe?
|
|
||
| // To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal. | ||
| // Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec. | ||
| val hSetBigdec = hSet.map(e => BigDecimal(e.toString)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should filter out null values first
| ar: AttributeReference, | ||
| filterNode: Filter, | ||
| expectedColStats: ColumnStat, | ||
| rowCount: Option[BigInt] = None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use BigInt please, all the callers pass a Some(value)
|
LGTM except some minor comments, you can address them in follow-up |
|
Test build #73377 has finished for PR 16395 at commit
|
|
thanks, merging to master! |
## What changes were proposed in this pull request? This is a follow-up of #16395. It fixes some code style issues, naming issues, some missing cases in pattern match, etc. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes #17065 from cloud-fan/follow-up.
## What changes were proposed in this pull request? We traverse predicate and evaluate the logical expressions to compute the selectivity of a FILTER operator. ## How was this patch tested? We add a new test suite to test various logical operators. Author: Ron Hu <[email protected]> Closes apache#16395 from ron8hu/filterSelectivity.
## What changes were proposed in this pull request? This is a follow-up of apache#16395. It fixes some code style issues, naming issues, some missing cases in pattern match, etc. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#17065 from cloud-fan/follow-up.
What changes were proposed in this pull request?
We traverse predicate and evaluate the logical expressions to compute the selectivity of a FILTER operator.
How was this patch tested?
We add a new test suite to test various logical operators.