Skip to content

Conversation

@wzhfy
Copy link
Contributor

@wzhfy wzhfy commented Dec 29, 2016

What changes were proposed in this pull request?

Support cardinality estimation for project operator.

How was this patch tested?

Add a test suite and a base class in the catalyst package.

@wzhfy wzhfy changed the title [SPARK-17077] [SQL] Cardinality estimation project operator [SPARK-17077] [SQL] Cardinality estimation for project operator Dec 29, 2016
@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 29, 2016

cc @rxin @hvanhovell @cloud-fan

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70703 has finished for PR 16430 at commit 12a48fa.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70711 has started for PR 16430 at commit d222020.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 29, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70716 has finished for PR 16430 at commit d222020.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor Author

wzhfy commented Dec 29, 2016

retest this please

@SparkQA
Copy link

SparkQA commented Dec 29, 2016

Test build #70720 has finished for PR 16430 at commit d222020.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in order to write a unit test, can we create a logical plan node with a some fake statistics that's passed in? that way we don't need everything end to end and can even put this in the catalyst package.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically it would be great to make this actually a unit test suite, rather than an end-to-end test suite.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(the way you can fix this is to create a leaf logical plan node with statistics you can pass in)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rather than estimating like this, can we get the data size of the child node, and use that to estimate the data size of the parent?

for fixed length types, we know the size; for variable length types, we assume the size is evenly distributed.

e.g. if the total length is 1000, and we have rowcount 10, and we have 3 fields: a int, b long, c string

then we assume the avg length per row is 100, and the avg length of c would be 100 - 4 - 8 = 88?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can update to use this algorithm in a separate pr. we can merge this pr if we fix the issue with test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

case a: Alias if inputAttrStats.contains(a.child) =>
  ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also is it possible that we are seeing other NamedExpression like AttriuteReference here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to deal with AttributeReference here, we can get it directly from child.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extract attr: Attribute because inputAttrStats is a AttributeMap and only accepts Attribute

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this ProjectEstimationSuite?

@wzhfy
Copy link
Contributor Author

wzhfy commented Jan 6, 2017

Thanks for review! I'll fix it today

@wzhfy wzhfy force-pushed the projectEstimation branch from d222020 to a5ca31c Compare January 6, 2017 08:47
@SparkQA
Copy link

SparkQA commented Jan 6, 2017

Test build #70975 has finished for PR 16430 at commit a5ca31c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val expectedAttrStats = toAttributeMap(expectedColStats, project)
// The number of rows won't change for project.
val expectedStats = Statistics(
sizeInBytes = 2 * getRowSize(project.output, expectedAttrStats),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way this test is written getRowSize is completely untested. We can almost change getRowSize to always return 0 and all the tests would pass. Can you have test cases covering it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested getRowSize for int type. But yes, we should have a separate test for this method.

val inputAttrStats = childStats.attributeStats
// Match alias with its child's column stat
val aliasStats = project.expressions.collect {
case alias @ Alias(attr: Attribute, _) if inputAttrStats.contains(attr) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my question from before was really whether we need to match on other things as well (that are not just Alias - e.g. can an attribute be other NamedExpression?)

cc @cloud-fan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the long run, we should define a statistics interface in Expression, so that we can propagate the column stats more naturally, for more cases(not only Alias, but also Add, Mod, etc.). But currently catalyst doesn't propagate attributes correctly, e.g. https://issues.apache.org/jira/browse/SPARK-17995 (Union, Except, etc. has the same problem), we may need to hack a lot of places to propagate column stats correctly.

According to @wzhfy 's benchmark, it turns out we can speed up most of the cases if we take care of Alias, so I'm ok with the current approach.

@rxin
Copy link
Contributor

rxin commented Jan 9, 2017

Alright I'm going to merge this since this patch introduces test infrastructure that can be used by other tests. Please submit a follow-up PR to add more test cases.

@asfgit asfgit closed this in 3ccabdf Jan 9, 2017
uzadude pushed a commit to uzadude/spark that referenced this pull request Jan 27, 2017
## What changes were proposed in this pull request?

Support cardinality estimation for project operator.

## How was this patch tested?

Add a test suite and a base class in the catalyst package.

Author: Zhenhua Wang <[email protected]>

Closes apache#16430 from wzhfy/projectEstimation.
cmonkey pushed a commit to cmonkey/spark that referenced this pull request Feb 15, 2017
## What changes were proposed in this pull request?

Support cardinality estimation for project operator.

## How was this patch tested?

Add a test suite and a base class in the catalyst package.

Author: Zhenhua Wang <[email protected]>

Closes apache#16430 from wzhfy/projectEstimation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants