[SPARK-17075][SQL] implemented filter estimation #16395

ron8hu · 2016-12-24T03:33:58Z

What changes were proposed in this pull request?

We traverse predicate and evaluate the logical expressions to compute the selectivity of a FILTER operator.

How was this patch tested?

We add a new test suite to test various logical operators.

ron8hu · 2016-12-24T03:44:59Z

cc @wzhfy @rxin @hvanhovell @cloud-fan

cloud-fan · 2016-12-24T03:59:14Z

ok to test

rxin · 2016-12-24T04:39:34Z

cc @srinathshankar

SparkQA · 2016-12-24T06:16:47Z

Test build #70562 has finished for PR 16395 at commit 56d1579.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class NumericRange(min: JDecimal, max: JDecimal) extends Range

SparkQA · 2017-01-03T00:35:15Z

Test build #70788 has finished for PR 16395 at commit e9d4f4d.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-03T01:29:17Z

Test build #70789 has finished for PR 16395 at commit b1932fb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-03T23:42:53Z

Test build #70830 has finished for PR 16395 at commit 1fc44a9.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

ron8hu · 2017-01-05T00:30:52Z

cc @wzhfy @rxin @srinathshankar @hvanhovell @cloud-fan
Happy New Year! This PR is ready for code review.

SparkQA · 2017-01-05T02:23:28Z

Test build #70894 has finished for PR 16395 at commit 784015e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-01-06T06:39:21Z

sql/core/src/test/scala/org/apache/spark/sql/estimation/FilterEstimationSuite.scala

similar to comment i made on the project PR, it'd be great to just create a leaf logical plan node in which we can pass arbitrary statistics and use that to make all the estimation suites unit test suites, rather than end-to-end test suites.

That way we can also have more control over the input we test.

rxin · 2017-01-06T06:40:50Z

...src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/FilterEstimation.scala

this is not thread safe. maybe turn FilterEstimation into a class.

Agreed. fixed.

rxin · 2017-01-06T06:41:21Z

...src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/FilterEstimation.scala

can you add some documentation here on the high level algorithm?

basically i spent 2 mins reading this code and i have no idea how it works.

rxin · 2017-01-06T06:41:48Z

...src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/FilterEstimation.scala

document what "update" means.

rxin · 2017-01-06T06:45:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/Range.scala

null values or 0 rows?

rxin · 2017-01-06T06:45:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/Range.scala

two methods?

rxin · 2017-01-06T06:46:31Z

...src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/FilterEstimation.scala

please use // to document inline comments

/** */ are reserved for classdocs/function docs.

rxin · 2017-01-06T06:47:13Z

...src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/FilterEstimation.scala

what's the return value? selectivity?

Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.

rxin · 2017-01-06T06:47:18Z

...src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/FilterEstimation.scala

what's the return value? selectivity?

Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.

rxin · 2017-01-06T06:47:23Z

...src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/FilterEstimation.scala

what's the return value? selectivity?

Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.

wzhfy · 2017-01-09T05:56:33Z

@ron8hu Can you update the test cases based on the latest master? We have a new test infrastructure now.

ron8hu · 2017-01-11T05:01:34Z

cc @rxin @wzhfy
Have updated code based on rxin's comments. Please review again.

rxin · 2017-01-11T06:58:53Z

...main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/EstimationUtils.scala

is this necessary? isn't this just

case op @ EqualTo(ar: AttributeReference, l: Literal) =>

Yes, I can use patterns for variable binding. Fixed.

rxin · 2017-01-11T07:09:48Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

we can probably remove this since it doesn't really carry any information ... (plan's type is already Filter)

We need this LogicalPlan so that we can access its child node's statistics information.

rxin · 2017-01-11T07:12:35Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

as i commented on the other pr, i think we should use named arguments here so readers would know what 0, 4 ,4 means.

also i'd rename filteredColStats to just expected

SparkQA · 2017-02-24T00:54:16Z

Test build #73376 has finished for PR 16395 at commit eac69af.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-24T01:16:21Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+        }
+
+      case Or(cond1, cond2) =>
+        // For ease of debugging, we compute percent1 and percent2 in 2 statements.


nit: this can also apply to the And case

cloud-fan · 2017-02-24T01:18:07Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+   * @return an optional double value to show the percentage of rows meeting a given condition
+   *         It returns None if no statistics collected for a given column.
+   */
+  def evaluateIsNull(


nit: evaluateNullCheck

cloud-fan · 2017-02-24T01:34:21Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+      hSet: Set[Any],
+      update: Boolean)
+    : Option[Double] = {
+    if (!mutableColStats.contains(attrRef.exprId)) {


we can have a method for this logic

cloud-fan · 2017-02-24T01:39:43Z

...alyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/Range.scala

  }

+  def rangeContainsLiteral(r: Range, lit: Literal): Boolean = r match {
+    case _: DefaultRange => true


shall we move this logic into each Range implementation?

cloud-fan · 2017-02-24T01:40:09Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+        // To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal.
+        // Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec.
+        val hSetBigdec = hSet.map(e => BigDecimal(e.toString))
+        val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && e <= statsRange.max)


we can use rangeContainsLiteral here.

cloud-fan · 2017-02-24T01:49:23Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+      // So we will change the order if not.
+
+      // EqualTo does not care about the order
+      case op @ EqualTo(ar: AttributeReference, l: Literal) =>


shall we also handle EqualNullSafe?

cloud-fan · 2017-02-24T01:55:40Z

...ain/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/FilterEstimation.scala

+
+        // To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal.
+        // Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec.
+        val hSetBigdec = hSet.map(e => BigDecimal(e.toString))


we should filter out null values first

cloud-fan · 2017-02-24T02:06:18Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/statsEstimation/FilterEstimationSuite.scala

+      ar: AttributeReference,
+      filterNode: Filter,
+      expectedColStats: ColumnStat,
+      rowCount: Option[BigInt] = None)


use BigInt please, all the callers pass a Some(value)

cloud-fan · 2017-02-24T02:08:09Z

LGTM except some minor comments, you can address them in follow-up

SparkQA · 2017-02-24T03:08:53Z

Test build #73377 has finished for PR 16395 at commit a48a4fd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-02-24T04:18:52Z

thanks, merging to master!

## What changes were proposed in this pull request? This is a follow-up of #16395. It fixes some code style issues, naming issues, some missing cases in pattern match, etc. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes #17065 from cloud-fan/follow-up.

## What changes were proposed in this pull request? We traverse predicate and evaluate the logical expressions to compute the selectivity of a FILTER operator. ## How was this patch tested? We add a new test suite to test various logical operators. Author: Ron Hu <[email protected]> Closes apache#16395 from ron8hu/filterSelectivity.

## What changes were proposed in this pull request? This is a follow-up of apache#16395. It fixes some code style issues, naming issues, some missing cases in pattern match, etc. ## How was this patch tested? existing tests. Author: Wenchen Fan <[email protected]> Closes apache#17065 from cloud-fan/follow-up.

ron8hu changed the title ~~implemented first version of filter estimation~~ [SPARK-17075][SQL][WIP] implemented filter estimation Dec 24, 2016

ron8hu force-pushed the filterSelectivity branch from 56d1579 to e9d4f4d Compare January 3, 2017 00:28

ron8hu changed the title ~~[SPARK-17075][SQL][WIP] implemented filter estimation~~ [SPARK-17075][SQL] implemented filter estimation Jan 3, 2017

ron8hu force-pushed the filterSelectivity branch from b1932fb to 1fc44a9 Compare January 3, 2017 21:36

ron8hu force-pushed the filterSelectivity branch from 1fc44a9 to 784015e Compare January 5, 2017 00:06

rxin reviewed Jan 6, 2017

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/estimation/Range.scala Outdated

Copy link

Contributor

rxin Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two methods?

rxin reviewed Jan 6, 2017

View reviewed changes

ron8hu force-pushed the filterSelectivity branch from 784015e to 210b11b Compare January 11, 2017 04:58

rxin reviewed Jan 11, 2017

View reviewed changes

ron8hu added 16 commits February 23, 2017 16:45

update code based on wzhfy's comments

f24cf3e

filtered NDV should be no larger than initial NDV

3ebf3a8

add tests to handle decimal data type

9763635

add test cases for float and double types

35c213f

add cast-as-date test cases

6b8aab3

update calls to getOutputSize

894d85c

use Range.isIntersected to decide if a literal is in boundary

97aacdf

handle date/timestamp string literal

2b4a10a

solve merge conflict in EstimationUtils

f54a6ce

remove useless type checking since typecocercion already did

07e6320

add string column tests. remove float time tests.

7ba6609

improve readability

1f3619f

specify update = false for Or condition

6411d21

remove the unused import. save with Unix style line ending

11b6a0b

update date column test case

298d255

clean up internal/external type conversion

eac69af

ron8hu force-pushed the filterSelectivity branch from 954d2fc to eac69af Compare February 24, 2017 00:51

use Javadoc style indentation for multiline comments

a48a4fd

cloud-fan reviewed Feb 24, 2017

View reviewed changes

asfgit closed this in d7e43b6 Feb 24, 2017

cloud-fan mentioned this pull request Feb 25, 2017

[SPARK-17075][SQL][followup] fix some minor issues and clean up the code #17065

Closed

ron8hu deleted the filterSelectivity branch April 4, 2017 00:33

[SPARK-17075][SQL] implemented filter estimation #16395

[SPARK-17075][SQL] implemented filter estimation #16395

Uh oh!

Conversation

ron8hu commented Dec 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ron8hu commented Dec 24, 2016

Uh oh!

cloud-fan commented Dec 24, 2016

Uh oh!

rxin commented Dec 24, 2016

Uh oh!

SparkQA commented Dec 24, 2016

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

SparkQA commented Jan 3, 2017

Uh oh!

ron8hu commented Jan 5, 2017

Uh oh!

SparkQA commented Jan 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin Jan 6, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ron8hu Jan 9, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzhfy commented Jan 9, 2017

Uh oh!

ron8hu commented Jan 11, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ron8hu commented Dec 24, 2016 •

edited

Loading

rxin Jan 6, 2017 •

edited

Loading

ron8hu Jan 9, 2017 •

edited

Loading