-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17077] [SQL] Cardinality estimation for project operator #16430
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #70703 has finished for PR 16430 at commit
|
|
Test build #70711 has started for PR 16430 at commit |
|
retest this please |
|
Test build #70716 has finished for PR 16430 at commit
|
|
retest this please |
|
Test build #70720 has finished for PR 16430 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in order to write a unit test, can we create a logical plan node with a some fake statistics that's passed in? that way we don't need everything end to end and can even put this in the catalyst package.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically it would be great to make this actually a unit test suite, rather than an end-to-end test suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(the way you can fix this is to create a leaf logical plan node with statistics you can pass in)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so rather than estimating like this, can we get the data size of the child node, and use that to estimate the data size of the parent?
for fixed length types, we know the size; for variable length types, we assume the size is evenly distributed.
e.g. if the total length is 1000, and we have rowcount 10, and we have 3 fields: a int, b long, c string
then we assume the avg length per row is 100, and the avg length of c would be 100 - 4 - 8 = 88?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can update to use this algorithm in a separate pr. we can merge this pr if we fix the issue with test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
case a: Alias if inputAttrStats.contains(a.child) =>
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also is it possible that we are seeing other NamedExpression like AttriuteReference here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to deal with AttributeReference here, we can get it directly from child.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extract attr: Attribute because inputAttrStats is a AttributeMap and only accepts Attribute
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rename this ProjectEstimationSuite?
|
Thanks for review! I'll fix it today |
d222020 to
a5ca31c
Compare
|
Test build #70975 has finished for PR 16430 at commit
|
| val expectedAttrStats = toAttributeMap(expectedColStats, project) | ||
| // The number of rows won't change for project. | ||
| val expectedStats = Statistics( | ||
| sizeInBytes = 2 * getRowSize(project.output, expectedAttrStats), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way this test is written getRowSize is completely untested. We can almost change getRowSize to always return 0 and all the tests would pass. Can you have test cases covering it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested getRowSize for int type. But yes, we should have a separate test for this method.
| val inputAttrStats = childStats.attributeStats | ||
| // Match alias with its child's column stat | ||
| val aliasStats = project.expressions.collect { | ||
| case alias @ Alias(attr: Attribute, _) if inputAttrStats.contains(attr) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my question from before was really whether we need to match on other things as well (that are not just Alias - e.g. can an attribute be other NamedExpression?)
cc @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the long run, we should define a statistics interface in Expression, so that we can propagate the column stats more naturally, for more cases(not only Alias, but also Add, Mod, etc.). But currently catalyst doesn't propagate attributes correctly, e.g. https://issues.apache.org/jira/browse/SPARK-17995 (Union, Except, etc. has the same problem), we may need to hack a lot of places to propagate column stats correctly.
According to @wzhfy 's benchmark, it turns out we can speed up most of the cases if we take care of Alias, so I'm ok with the current approach.
|
Alright I'm going to merge this since this patch introduces test infrastructure that can be used by other tests. Please submit a follow-up PR to add more test cases. |
## What changes were proposed in this pull request? Support cardinality estimation for project operator. ## How was this patch tested? Add a test suite and a base class in the catalyst package. Author: Zhenhua Wang <[email protected]> Closes apache#16430 from wzhfy/projectEstimation.
## What changes were proposed in this pull request? Support cardinality estimation for project operator. ## How was this patch tested? Add a test suite and a base class in the catalyst package. Author: Zhenhua Wang <[email protected]> Closes apache#16430 from wzhfy/projectEstimation.
What changes were proposed in this pull request?
Support cardinality estimation for project operator.
How was this patch tested?
Add a test suite and a base class in the catalyst package.