[SPARK-17077] [SQL] Cardinality estimation for project operator #16430

wzhfy · 2016-12-29T03:24:43Z

What changes were proposed in this pull request?

Support cardinality estimation for project operator.

How was this patch tested?

Add a test suite and a base class in the catalyst package.

wzhfy · 2016-12-29T03:37:31Z

cc @rxin @hvanhovell @cloud-fan

SparkQA · 2016-12-29T05:41:49Z

Test build #70703 has finished for PR 16430 at commit 12a48fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-12-29T06:42:39Z

Test build #70711 has started for PR 16430 at commit d222020.

wzhfy · 2016-12-29T08:22:54Z

retest this please

SparkQA · 2016-12-29T10:17:45Z

Test build #70716 has finished for PR 16430 at commit d222020.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

wzhfy · 2016-12-29T12:08:05Z

retest this please

SparkQA · 2016-12-29T14:29:14Z

Test build #70720 has finished for PR 16430 at commit d222020.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-06T05:28:14Z

sql/core/src/test/scala/org/apache/spark/sql/estimation/EstimationSuite.scala

in order to write a unit test, can we create a logical plan node with a some fake statistics that's passed in? that way we don't need everything end to end and can even put this in the catalyst package.

Basically it would be great to make this actually a unit test suite, rather than an end-to-end test suite.

(the way you can fix this is to create a leaf logical plan node with statistics you can pass in)

rxin · 2017-01-06T05:35:11Z

.../src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/EstimationUtils.scala

so rather than estimating like this, can we get the data size of the child node, and use that to estimate the data size of the parent?

for fixed length types, we know the size; for variable length types, we assume the size is evenly distributed.

e.g. if the total length is 1000, and we have rowcount 10, and we have 3 fields: a int, b long, c string

then we assume the avg length per row is 100, and the avg length of c would be 100 - 4 - 8 = 88?

we can update to use this algorithm in a separate pr. we can merge this pr if we fix the issue with test.

rxin · 2017-01-06T05:37:13Z

...rc/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/ProjectEstimation.scala

case a: Alias if inputAttrStats.contains(a.child) => ...

also is it possible that we are seeing other NamedExpression like AttriuteReference here?

We don't need to deal with AttributeReference here, we can get it directly from child.

I extract attr: Attribute because inputAttrStats is a AttributeMap and only accepts Attribute

rxin · 2017-01-06T05:55:49Z

sql/core/src/test/scala/org/apache/spark/sql/estimation/EstimationSuite.scala

rename this ProjectEstimationSuite?

wzhfy · 2017-01-06T07:01:05Z

Thanks for review! I'll fix it today

SparkQA · 2017-01-06T11:08:26Z

Test build #70975 has finished for PR 16430 at commit a5ca31c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-06T23:44:14Z

...st/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/ProjectEstimationSuite.scala

+    val expectedAttrStats = toAttributeMap(expectedColStats, project)
+    // The number of rows won't change for project.
+    val expectedStats = Statistics(
+      sizeInBytes = 2 * getRowSize(project.output, expectedAttrStats),


the way this test is written getRowSize is completely untested. We can almost change getRowSize to always return 0 and all the tests would pass. Can you have test cases covering it?

I tested getRowSize for int type. But yes, we should have a separate test for this method.

rxin · 2017-01-06T23:47:15Z

...in/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/ProjectEstimation.scala

+      val inputAttrStats = childStats.attributeStats
+      // Match alias with its child's column stat
+      val aliasStats = project.expressions.collect {
+        case alias @ Alias(attr: Attribute, _) if inputAttrStats.contains(attr) =>


my question from before was really whether we need to match on other things as well (that are not just Alias - e.g. can an attribute be other NamedExpression?)

cc @cloud-fan

In the long run, we should define a statistics interface in Expression, so that we can propagate the column stats more naturally, for more cases(not only Alias, but also Add, Mod, etc.). But currently catalyst doesn't propagate attributes correctly, e.g. https://issues.apache.org/jira/browse/SPARK-17995 (Union, Except, etc. has the same problem), we may need to hack a lot of places to propagate column stats correctly.

According to @wzhfy 's benchmark, it turns out we can speed up most of the cases if we take care of Alias, so I'm ok with the current approach.

rxin · 2017-01-09T05:15:18Z

Alright I'm going to merge this since this patch introduces test infrastructure that can be used by other tests. Please submit a follow-up PR to add more test cases.

## What changes were proposed in this pull request? Support cardinality estimation for project operator. ## How was this patch tested? Add a test suite and a base class in the catalyst package. Author: Zhenhua Wang <[email protected]> Closes apache#16430 from wzhfy/projectEstimation.

wzhfy changed the title ~~[SPARK-17077] [SQL] Cardinality estimation project operator~~ [SPARK-17077] [SQL] Cardinality estimation for project operator Dec 29, 2016

rxin reviewed Jan 6, 2017

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/estimation/EstimationSuite.scala Outdated

Copy link

Contributor

rxin Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename this ProjectEstimationSuite?

wzhfy added 2 commits January 6, 2017 14:31

project estimation

f9c378b

fix typo

2557f4a

refactor test suite

a5ca31c

wzhfy force-pushed the projectEstimation branch from d222020 to a5ca31c Compare January 6, 2017 08:47

rxin reviewed Jan 6, 2017

View reviewed changes

asfgit closed this in 3ccabdf Jan 9, 2017

[SPARK-17077] [SQL] Cardinality estimation for project operator #16430

[SPARK-17077] [SQL] Cardinality estimation for project operator #16430

Uh oh!

Conversation

wzhfy commented Dec 29, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

wzhfy commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

wzhfy commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

wzhfy commented Dec 29, 2016

Uh oh!

SparkQA commented Dec 29, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Jan 6, 2017

Uh oh!

SparkQA commented Jan 6, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin commented Jan 9, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wzhfy commented Dec 29, 2016 •

edited

Loading