[SPARK-11164] [SQL] Add InSet pushdown filter back for Parquet #10278

gatorsmile · 2015-12-12T18:37:08Z

When the filter is "b in ('1', '2')", the filter is not pushed down to Parquet. Thanks!

gatorsmile · 2015-12-12T19:42:06Z

After reading the other push-down PR, I think it also needs a review from @liancheng . Welcome any comment! Thanks!

SparkQA · 2015-12-12T20:21:35Z

Test build #47615 has finished for PR 10278 at commit 79be2c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-12-12T22:15:45Z

Test build #47616 has finished for PR 10278 at commit 2ff70bf.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

marmbrus · 2015-12-12T23:37:11Z

Do you have a test case that actually shows a wrong answer being computed?

gatorsmile · 2015-12-12T23:39:03Z

This only happens in 1.5. Do you need me to write a test case for 1.5?

marmbrus · 2015-12-12T23:46:10Z

Any bug fix should have a regression test. We could always change the optimizer in a way that does not hide this bug anymore.

gatorsmile · 2015-12-12T23:48:18Z

Ok, will make a try to force it. Thanks!

marmbrus · 2015-12-13T00:08:29Z

Its fine if the test only fails on 1.5

gatorsmile · 2015-12-13T00:29:51Z

Great! : )

Let me also post the test case I did in the latest 1.5. Without my fix, the first call of show() did not return the row (2, 0). Feel free to let me know if you want me to deliver the following test case.

    withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
      withTempPath { dir =>
        val path = s"${dir.getCanonicalPath}/table1"
        (1 to 5).map(i => (i, (i%2).toString)).toDF("a", "b").write.parquet(path)

        val df = sqlContext.read.parquet(path).where("not (a = 2 and b in ('1'))")
        df.show()

        val df1 = sqlContext.read.parquet(path).where("not (a = 2) or not(b in ('1'))")
        df1.show()
      }
    }

gatorsmile · 2015-12-13T00:32:07Z

I might find another bug in Parquet pushdown. Will submit the fix later when I can confirm it.

SparkQA · 2015-12-13T03:05:30Z

Test build #47618 has finished for PR 10278 at commit c9af771.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-12-17T08:11:46Z

@gatorsmile Sorry for the late reply and thanks for the nice catch!

The In predicate push down issue had been tracked by SPARK-11164, and done as part of PR #8956. Unfortunately that we didn't merge that PR due to other problems in it. Could you please add SPARK-11164 to your PR title?

For the Not push-down rule:

I'm for adding it to branch-1.5 since it's a pretty safe one.
I think we might also want to add more general CNF conversion rule to master, which should be done in a separate PR, of course.

One benefit of CNF is that it enables more filter push-down opportunities.

Since we don't have existential / universal quantifier in our predicates, I think CNF conversion in Spark SQL can be as simple as keeping pushing Not and Or inward (or downward) using De Morgan's laws and the distributive law:

object CNFConversion extends Rule[LogicalPlan] {
  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case filter: Filter =>
      import org.apache.spark.sql.catalyst.dsl.expressions._

      filter.copy(condition = filter.condition.transform {
        case Not(x Or y) => !x && !y
        case Not(x And y) => !x || !y
        case (x And y) Or z => (x || z) && (y || z)
        case x Or (y And z) => (x || y) && (x || z)
      })
  }
}

(Notice that this version doesn't handle common expression elimination.)

That said, the Not push-down rule is actually a subset of CNF conversion. There had once been a PR aimed to add CNF conversion for data source filter push-down only, but wasn't merged (see SPARK-6624 and PR #6713). As @marmbrus commented there, CNF conversion might be worth being added to the optimizer.

@rxin @marmbrus Not super confident about the CNF conversion conclusion above, please correct me if I'm wrong.

gatorsmile · 2015-12-17T16:51:38Z

Thank you for your detailed explanation! @liancheng

I have the same opinion as @marmbrus . We should include CNF conversion into our optimizer. Some RDBMS systems do it in the phase of query rewriting. Below is my 2 cents about CNF.

Generally, CNF conversion is an important concept in query optimization, especially when we support indexing in Spark. When (multi-attribute) indexes exist over some subset of conjucts, we can employ these indexes to improve the selectivity.

Thanks!

yhuai · 2015-12-17T19:38:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Looks like this is the real problem. It is not safe to just push one side at here. This is the place where we drop that In because createFilter(schema, In(...)) returns None.

@gatorsmile I am also going to try to have a fix. We can later see which one is a more suitable.

Glad to see your fix. : ) Thank you!

SparkQA · 2015-12-17T20:35:40Z

Test build #47939 has finished for PR 10278 at commit e219ac1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2015-12-18T15:36:30Z

@yhuai @liancheng Regarding this PR, should I keep it open? This PR also has another fix that can push down the filter in. Thanks!

yhuai · 2015-12-18T20:03:35Z

You can change it to just handle In. Actually, I am wondering what was the problem with the other pr that we decided to not add In with that?

gatorsmile · 2015-12-18T22:11:30Z

After reading the contents, that PR was closed due to another issue of String filters in the same PR. Please correct me if my understanding is wrong. @liancheng

Do you want to deliver this by submitting another PR, @viirya ? Either is fine for me. Thanks!

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

SparkQA · 2015-12-19T02:18:03Z

Test build #48036 has finished for PR 10278 at commit 64cd5e6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):\n * public class JavaTwitterHashTagJoinSentiments\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

gatorsmile · 2015-12-19T02:21:20Z

retest this please

SparkQA · 2015-12-19T04:02:22Z

Test build #48045 has finished for PR 10278 at commit 64cd5e6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

liancheng · 2015-12-19T06:39:25Z

@gatersmile Yeah, you're right. #8956 initially aimed to fix other issues, and also included the fix for In.

BTW, it's almost always strictly better to open small PRs that contains ONLY a single change than bigger ones that contains multiple changes. The former are much easier to review and get merged. (One liner PRs are super welcomed!)

gatorsmile · 2015-12-19T06:51:19Z

Thank you for your suggestions! @liancheng : )

Next time, I will not mix multiple fixes in the same PR.

gatorsmile · 2015-12-22T21:40:44Z

@yhuai Do you think it can be merged?

Please give the credit to @viirya who submitted it in the past. Thank you!

liancheng · 2015-12-23T02:20:35Z

@gatorsmile Could you please update the PR description?

gatorsmile · 2015-12-23T02:28:02Z

@liancheng Done. : )

liancheng · 2015-12-23T06:06:52Z

Thanks! I'm merging this to master, and will attribute this one to @viirya.

gatorsmile · 2015-12-23T06:27:15Z

Thank you!

viirya · 2015-12-23T06:58:34Z

Thanks @gatorsmile @liancheng

gatorsmile and others added 17 commits November 13, 2015 14:50

Merge remote-tracking branch 'upstream/master'

01e4cdf

Merge remote-tracking branch 'upstream/master'

6835704

Merge remote-tracking branch 'upstream/master'

9180687

SPARK-11633

b38a21e

Merge remote-tracking branch 'upstream/master' into joinMakeCopy

d2b84af

Merge remote-tracking branch 'upstream/master'

fda8025

Merge branch 'master' of https://github.com/gatorsmile/spark

ac0dccd

Merge remote-tracking branch 'upstream/master'

6e0018b

converge

0546772

converge

b37a64f

Merge remote-tracking branch 'upstream/master'

661260b

Merge remote-tracking branch 'upstream/master'

2dfa0fd

Merge remote-tracking branch 'upstream/master'

d929d9b

Merge remote-tracking branch 'upstream/master'

4070d2f

Merge remote-tracking branch 'upstream/master'

38dcfb2

Merge remote-tracking branch 'upstream/master'

cb3fc83

added a condition for Not operator in ParquetFilter.

79be2c3

added two more cases for BooleanSimplication

2ff70bf

improved the code comments.

c9af771

gatorsmile changed the title ~~[SPARK-12218] [SQL] Fixed the Parquet's filter generation rule when Not is included in Parquet filter pushdown~~ [SPARK-12218] [SPARK-11164] [SQL] Fixed the Parquet's filter generation rule when Not is included in Parquet filter pushdown Dec 17, 2015

address comments.

e219ac1

yhuai reviewed Dec 17, 2015
View reviewed changes

yhuai mentioned this pull request Dec 17, 2015

[SPARK-12218] [SQL] Invalid splitting of nested AND expressions in Data Source filter API #10362

Closed

gatorsmile added 4 commits December 18, 2015 14:47

removed the fix about 'Not' filtering

df3d8ab

Merge remote-tracking branch 'upstream/master'

8dbacc7

Merge branch 'parquetFilterNot' into pFilterIn

7678ef5

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Merge branch 'parquetFilterNot' into pFilterIn

64cd5e6

# Conflicts: # sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

gatorsmile changed the title ~~[SPARK-12218] [SPARK-11164] [SQL] Fixed the Parquet's filter generation rule when Not is included in Parquet filter pushdown~~ [SPARK-11164] [SQL] Add InSet pushdown filter back for Parquet Dec 22, 2015

asfgit closed this in 50301c0 Dec 23, 2015

gatorsmile deleted the parquetFilterNot branch December 23, 2015 06:23

a10y mentioned this pull request Aug 16, 2016

[SPARK-17091][SQL] ParquetFilters rewrite IN to OR of Eq #14671

Closed

[SPARK-11164] [SQL] Add InSet pushdown filter back for Parquet #10278

[SPARK-11164] [SQL] Add InSet pushdown filter back for Parquet #10278

Uh oh!

Conversation

gatorsmile commented Dec 12, 2015

Uh oh!

gatorsmile commented Dec 12, 2015

Uh oh!

SparkQA commented Dec 12, 2015

Uh oh!

SparkQA commented Dec 12, 2015

Uh oh!

marmbrus commented Dec 12, 2015

Uh oh!

gatorsmile commented Dec 12, 2015

Uh oh!

marmbrus commented Dec 12, 2015

Uh oh!

gatorsmile commented Dec 12, 2015

Uh oh!

marmbrus commented Dec 13, 2015

Uh oh!

gatorsmile commented Dec 13, 2015

Uh oh!

gatorsmile commented Dec 13, 2015

Uh oh!

SparkQA commented Dec 13, 2015

Uh oh!

liancheng commented Dec 17, 2015

Uh oh!

gatorsmile commented Dec 17, 2015

Uh oh!

yhuai Dec 17, 2015

Choose a reason for hiding this comment

Uh oh!

yhuai Dec 17, 2015

Choose a reason for hiding this comment

Uh oh!

gatorsmile Dec 17, 2015

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 17, 2015

Uh oh!

gatorsmile commented Dec 18, 2015

Uh oh!

yhuai commented Dec 18, 2015

Uh oh!

gatorsmile commented Dec 18, 2015

Uh oh!

SparkQA commented Dec 19, 2015

Uh oh!

gatorsmile commented Dec 19, 2015

Uh oh!

SparkQA commented Dec 19, 2015

Uh oh!

liancheng commented Dec 19, 2015

Uh oh!

gatorsmile commented Dec 19, 2015

Uh oh!

gatorsmile commented Dec 22, 2015

Uh oh!

liancheng commented Dec 23, 2015

Uh oh!

gatorsmile commented Dec 23, 2015

Uh oh!

liancheng commented Dec 23, 2015

Uh oh!

gatorsmile commented Dec 23, 2015

Uh oh!

viirya commented Dec 23, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants