Skip to content

Conversation

@gatorsmile
Copy link
Member

When the filter is "b in ('1', '2')", the filter is not pushed down to Parquet. Thanks!

@gatorsmile
Copy link
Member Author

After reading the other push-down PR, I think it also needs a review from @liancheng . Welcome any comment! Thanks!

@SparkQA
Copy link

SparkQA commented Dec 12, 2015

Test build #47615 has finished for PR 10278 at commit 79be2c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 12, 2015

Test build #47616 has finished for PR 10278 at commit 2ff70bf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@marmbrus
Copy link
Contributor

Do you have a test case that actually shows a wrong answer being computed?

@gatorsmile
Copy link
Member Author

This only happens in 1.5. Do you need me to write a test case for 1.5?

@marmbrus
Copy link
Contributor

Any bug fix should have a regression test. We could always change the optimizer in a way that does not hide this bug anymore.

@gatorsmile
Copy link
Member Author

Ok, will make a try to force it. Thanks!

@marmbrus
Copy link
Contributor

Its fine if the test only fails on 1.5

@gatorsmile
Copy link
Member Author

Great! : )

Let me also post the test case I did in the latest 1.5. Without my fix, the first call of show() did not return the row (2, 0). Feel free to let me know if you want me to deliver the following test case.

    withSQLConf(SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true") {
      withTempPath { dir =>
        val path = s"${dir.getCanonicalPath}/table1"
        (1 to 5).map(i => (i, (i%2).toString)).toDF("a", "b").write.parquet(path)

        val df = sqlContext.read.parquet(path).where("not (a = 2 and b in ('1'))")
        df.show()

        val df1 = sqlContext.read.parquet(path).where("not (a = 2) or not(b in ('1'))")
        df1.show()
      }
    }

@gatorsmile
Copy link
Member Author

I might find another bug in Parquet pushdown. Will submit the fix later when I can confirm it.

@SparkQA
Copy link

SparkQA commented Dec 13, 2015

Test build #47618 has finished for PR 10278 at commit c9af771.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@gatorsmile Sorry for the late reply and thanks for the nice catch!

The In predicate push down issue had been tracked by SPARK-11164, and done as part of PR #8956. Unfortunately that we didn't merge that PR due to other problems in it. Could you please add SPARK-11164 to your PR title?

For the Not push-down rule:

  1. I'm for adding it to branch-1.5 since it's a pretty safe one.
  2. I think we might also want to add more general CNF conversion rule to master, which should be done in a separate PR, of course.

One benefit of CNF is that it enables more filter push-down opportunities.

Since we don't have existential / universal quantifier in our predicates, I think CNF conversion in Spark SQL can be as simple as keeping pushing Not and Or inward (or downward) using De Morgan's laws and the distributive law:

object CNFConversion extends Rule[LogicalPlan] {
  override def apply(plan: LogicalPlan): LogicalPlan = plan transform {
    case filter: Filter =>
      import org.apache.spark.sql.catalyst.dsl.expressions._

      filter.copy(condition = filter.condition.transform {
        case Not(x Or y) => !x && !y
        case Not(x And y) => !x || !y
        case (x And y) Or z => (x || z) && (y || z)
        case x Or (y And z) => (x || y) && (x || z)
      })
  }
}

(Notice that this version doesn't handle common expression elimination.)

That said, the Not push-down rule is actually a subset of CNF conversion. There had once been a PR aimed to add CNF conversion for data source filter push-down only, but wasn't merged (see SPARK-6624 and PR #6713). As @marmbrus commented there, CNF conversion might be worth being added to the optimizer.

@rxin @marmbrus Not super confident about the CNF conversion conclusion above, please correct me if I'm wrong.

@gatorsmile gatorsmile changed the title [SPARK-12218] [SQL] Fixed the Parquet's filter generation rule when Not is included in Parquet filter pushdown [SPARK-12218] [SPARK-11164] [SQL] Fixed the Parquet's filter generation rule when Not is included in Parquet filter pushdown Dec 17, 2015
@gatorsmile
Copy link
Member Author

Thank you for your detailed explanation! @liancheng

I have the same opinion as @marmbrus . We should include CNF conversion into our optimizer. Some RDBMS systems do it in the phase of query rewriting. Below is my 2 cents about CNF.

Generally, CNF conversion is an important concept in query optimization, especially when we support indexing in Spark. When (multi-attribute) indexes exist over some subset of conjucts, we can employ these indexes to improve the selectivity.

Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is the real problem. It is not safe to just push one side at here. This is the place where we drop that In because createFilter(schema, In(...)) returns None.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile I am also going to try to have a fix. We can later see which one is a more suitable.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see your fix. : ) Thank you!

@SparkQA
Copy link

SparkQA commented Dec 17, 2015

Test build #47939 has finished for PR 10278 at commit e219ac1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member Author

@yhuai @liancheng Regarding this PR, should I keep it open? This PR also has another fix that can push down the filter in. Thanks!

@yhuai
Copy link
Contributor

yhuai commented Dec 18, 2015

You can change it to just handle In. Actually, I am wondering what was the problem with the other pr that we decided to not add In with that?

@gatorsmile
Copy link
Member Author

After reading the contents, that PR was closed due to another issue of String filters in the same PR. Please correct me if my understanding is wrong. @liancheng

Do you want to deliver this by submitting another PR, @viirya ? Either is fine for me. Thanks!

# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
# Conflicts:
#	sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
@SparkQA
Copy link

SparkQA commented Dec 19, 2015

Test build #48036 has finished for PR 10278 at commit 64cd5e6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):\n * public class JavaTwitterHashTagJoinSentiments\n * case class UnresolvedAlias(child: Expression, aliasName: Option[String] = None)\n * abstract class ImperativeAggregate extends AggregateFunction with CodegenFallback\n * case class UnresolvedWindowExpression(\n * case class WindowExpression(\n * case class Lead(input: Expression, offset: Expression, default: Expression)\n * case class Lag(input: Expression, offset: Expression, default: Expression)\n * abstract class AggregateWindowFunction extends DeclarativeAggregate with WindowFunction\n * abstract class RowNumberLike extends AggregateWindowFunction\n * trait SizeBasedWindowFunction extends AggregateWindowFunction\n * case class RowNumber() extends RowNumberLike\n * case class CumeDist() extends RowNumberLike with SizeBasedWindowFunction\n * case class NTile(buckets: Expression) extends RowNumberLike with SizeBasedWindowFunction\n * abstract class RankLike extends AggregateWindowFunction\n * case class Rank(children: Seq[Expression]) extends RankLike\n * case class DenseRank(children: Seq[Expression]) extends RankLike\n * case class PercentRank(children: Seq[Expression]) extends RankLike with SizeBasedWindowFunction\n

@gatorsmile
Copy link
Member Author

retest this please

@SparkQA
Copy link

SparkQA commented Dec 19, 2015

Test build #48045 has finished for PR 10278 at commit 64cd5e6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@liancheng
Copy link
Contributor

@gatersmile Yeah, you're right. #8956 initially aimed to fix other issues, and also included the fix for In.

BTW, it's almost always strictly better to open small PRs that contains ONLY a single change than bigger ones that contains multiple changes. The former are much easier to review and get merged. (One liner PRs are super welcomed!)

@gatorsmile
Copy link
Member Author

Thank you for your suggestions! @liancheng : )

Next time, I will not mix multiple fixes in the same PR.

@gatorsmile gatorsmile changed the title [SPARK-12218] [SPARK-11164] [SQL] Fixed the Parquet's filter generation rule when Not is included in Parquet filter pushdown [SPARK-11164] [SQL] Add InSet pushdown filter back for Parquet Dec 22, 2015
@gatorsmile
Copy link
Member Author

@yhuai Do you think it can be merged?

Please give the credit to @viirya who submitted it in the past. Thank you!

@liancheng
Copy link
Contributor

@gatorsmile Could you please update the PR description?

@gatorsmile
Copy link
Member Author

@liancheng Done. : )

@liancheng
Copy link
Contributor

Thanks! I'm merging this to master, and will attribute this one to @viirya.

@asfgit asfgit closed this in 50301c0 Dec 23, 2015
@gatorsmile gatorsmile deleted the parquetFilterNot branch December 23, 2015 06:23
@gatorsmile
Copy link
Member Author

Thank you!

@viirya
Copy link
Member

viirya commented Dec 23, 2015

Thanks @gatorsmile @liancheng

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants