Skip to content

Conversation

@ron8hu
Copy link
Contributor

@ron8hu ron8hu commented Dec 24, 2016

What changes were proposed in this pull request?

We traverse predicate and evaluate the logical expressions to compute the selectivity of a FILTER operator.

How was this patch tested?

We add a new test suite to test various logical operators.

@ron8hu ron8hu changed the title implemented first version of filter estimation [SPARK-17075][SQL][WIP] implemented filter estimation Dec 24, 2016
@ron8hu
Copy link
Contributor Author

ron8hu commented Dec 24, 2016

cc @wzhfy @rxin @hvanhovell @cloud-fan

@cloud-fan
Copy link
Contributor

ok to test

@rxin
Copy link
Contributor

rxin commented Dec 24, 2016

cc @srinathshankar

@SparkQA
Copy link

SparkQA commented Dec 24, 2016

Test build #70562 has finished for PR 16395 at commit 56d1579.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class NumericRange(min: JDecimal, max: JDecimal) extends Range

@ron8hu ron8hu force-pushed the filterSelectivity branch from 56d1579 to e9d4f4d Compare January 3, 2017 00:28
@SparkQA
Copy link

SparkQA commented Jan 3, 2017

Test build #70788 has finished for PR 16395 at commit e9d4f4d.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ron8hu ron8hu changed the title [SPARK-17075][SQL][WIP] implemented filter estimation [SPARK-17075][SQL] implemented filter estimation Jan 3, 2017
@SparkQA
Copy link

SparkQA commented Jan 3, 2017

Test build #70789 has finished for PR 16395 at commit b1932fb.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ron8hu ron8hu force-pushed the filterSelectivity branch from b1932fb to 1fc44a9 Compare January 3, 2017 21:36
@SparkQA
Copy link

SparkQA commented Jan 3, 2017

Test build #70830 has finished for PR 16395 at commit 1fc44a9.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ron8hu ron8hu force-pushed the filterSelectivity branch from 1fc44a9 to 784015e Compare January 5, 2017 00:06
@ron8hu
Copy link
Contributor Author

ron8hu commented Jan 5, 2017

cc @wzhfy @rxin @srinathshankar @hvanhovell @cloud-fan
Happy New Year! This PR is ready for code review.

@SparkQA
Copy link

SparkQA commented Jan 5, 2017

Test build #70894 has finished for PR 16395 at commit 784015e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar to comment i made on the project PR, it'd be great to just create a leaf logical plan node in which we can pass arbitrary statistics and use that to make all the estimation suites unit test suites, rather than end-to-end test suites.

That way we can also have more control over the input we test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not thread safe. maybe turn FilterEstimation into a class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some documentation here on the high level algorithm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically i spent 2 mins reading this code and i have no idea how it works.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

document what "update" means.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

@rxin rxin Jan 6, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

null values or 0 rows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

two methods?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use // to document inline comments

/** */ are reserved for classdocs/function docs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the return value? selectivity?

Copy link
Contributor Author

@ron8hu ron8hu Jan 9, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the return value? selectivity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the return value? selectivity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the return value is a double value showing the percentage of rows meeting a given condition. Also I will add comments for this method in JavaDoc style.

@wzhfy
Copy link
Contributor

wzhfy commented Jan 9, 2017

@ron8hu Can you update the test cases based on the latest master? We have a new test infrastructure now.

@ron8hu ron8hu force-pushed the filterSelectivity branch from 784015e to 210b11b Compare January 11, 2017 04:58
@ron8hu
Copy link
Contributor Author

ron8hu commented Jan 11, 2017

cc @rxin @wzhfy
Have updated code based on rxin's comments. Please review again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this necessary? isn't this just

case op @ EqualTo(ar: AttributeReference, l: Literal) =>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I can use patterns for variable binding. Fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably remove this since it doesn't really carry any information ... (plan's type is already Filter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this LogicalPlan so that we can access its child node's statistics information.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as i commented on the other pr, i think we should use named arguments here so readers would know what 0, 4 ,4 means.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also i'd rename filteredColStats to just expected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73376 has finished for PR 16395 at commit eac69af.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

case Or(cond1, cond2) =>
// For ease of debugging, we compute percent1 and percent2 in 2 statements.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this can also apply to the And case

* @return an optional double value to show the percentage of rows meeting a given condition
* It returns None if no statistics collected for a given column.
*/
def evaluateIsNull(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: evaluateNullCheck

hSet: Set[Any],
update: Boolean)
: Option[Double] = {
if (!mutableColStats.contains(attrRef.exprId)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can have a method for this logic

}

def rangeContainsLiteral(r: Range, lit: Literal): Boolean = r match {
case _: DefaultRange => true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we move this logic into each Range implementation?

// To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal.
// Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec.
val hSetBigdec = hSet.map(e => BigDecimal(e.toString))
val validQuerySet = hSetBigdec.filter(e => e >= statsRange.min && e <= statsRange.max)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can use rangeContainsLiteral here.

// So we will change the order if not.

// EqualTo does not care about the order
case op @ EqualTo(ar: AttributeReference, l: Literal) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also handle EqualNullSafe?


// To facilitate finding the min and max values in hSet, we map hSet values to BigDecimal.
// Using hSetBigdec, we can find the min and max values quickly in the ordered hSetBigdec.
val hSetBigdec = hSet.map(e => BigDecimal(e.toString))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should filter out null values first

ar: AttributeReference,
filterNode: Filter,
expectedColStats: ColumnStat,
rowCount: Option[BigInt] = None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use BigInt please, all the callers pass a Some(value)

@cloud-fan
Copy link
Contributor

LGTM except some minor comments, you can address them in follow-up

@SparkQA
Copy link

SparkQA commented Feb 24, 2017

Test build #73377 has finished for PR 16395 at commit a48a4fd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in d7e43b6 Feb 24, 2017
asfgit pushed a commit that referenced this pull request Feb 26, 2017
## What changes were proposed in this pull request?

This is a follow-up of #16395. It fixes some code style issues, naming issues, some missing cases in pattern match, etc.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes #17065 from cloud-fan/follow-up.
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?

We traverse predicate and evaluate the logical expressions to compute the selectivity of a FILTER operator.

## How was this patch tested?

We add a new test suite to test various logical operators.

Author: Ron Hu <[email protected]>

Closes apache#16395 from ron8hu/filterSelectivity.
Yunni pushed a commit to Yunni/spark that referenced this pull request Feb 27, 2017
## What changes were proposed in this pull request?

This is a follow-up of apache#16395. It fixes some code style issues, naming issues, some missing cases in pattern match, etc.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <[email protected]>

Closes apache#17065 from cloud-fan/follow-up.
@ron8hu ron8hu deleted the filterSelectivity branch April 4, 2017 00:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants