[SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet #8956

viirya · 2015-10-01T11:18:14Z

JIRA: https://issues.apache.org/jira/browse/SPARK-10895

SparkQA · 2015-10-01T13:30:00Z

Test build #43149 has finished for PR 8956 at commit f27288e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StringFilter(

rxin · 2015-10-03T08:03:28Z

cc @liancheng

SparkQA · 2015-10-03T15:08:30Z

Test build #43213 has finished for PR 8956 at commit 4d00ed0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StringFilter(

viirya · 2015-10-03T16:47:06Z

retest this please.

SparkQA · 2015-10-03T18:59:03Z

Test build #43215 has finished for PR 8956 at commit 4d00ed0.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StringFilter(

SparkQA · 2015-10-05T17:32:06Z

Test build #43238 has finished for PR 8956 at commit eb134b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class StringFilter(

liancheng · 2015-10-15T20:17:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala

Nit: remove ()

liancheng · 2015-10-15T21:55:27Z

Off topic but related, SetInFilter.canDrop can also leverage statistics information:

    private val min = valueSet.min
    private val max = valueSet.max

    override def canDrop(statistics: Statistics[T]): Boolean = {
      statistics.getMax.compareTo(min) < 0 || max.compareTo(statistics.getMin) < 0
    }

(And we probably should rename SetInFilter to InSetFilter.)

…ter-pushdown

viirya · 2015-10-17T03:56:20Z

@liancheng Thank you for your detailed comments. I've updated this patch. When the tests are passed, please review it again to see if there is any problem.

SparkQA · 2015-10-17T04:22:19Z

Test build #43878 has finished for PR 8956 at commit 02bbab8.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])
- abstract class StringFilter extends UserDefinedPredicate[Binary]
- case class StringStartsWithFilter(prefix: String) extends StringFilter
- case class StringEndsWithFilter(suffix: String) extends StringFilter
- case class StringContainsFilter(str: String) extends StringFilter

viirya · 2015-10-17T04:23:51Z

retest this please.

SparkQA · 2015-10-17T06:42:00Z

Test build #43880 has finished for PR 8956 at commit 02bbab8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])
- abstract class StringFilter extends UserDefinedPredicate[Binary]
- case class StringStartsWithFilter(prefix: String) extends StringFilter
- case class StringEndsWithFilter(suffix: String) extends StringFilter
- case class StringContainsFilter(str: String) extends StringFilter

rxin · 2015-10-17T06:49:22Z

Are there any performance improvements by pushing this down?

viirya · 2015-10-17T06:56:50Z

I can run some performance tests later.

rxin · 2015-10-17T06:58:58Z

Thanks - that'd be great.

viirya · 2015-10-19T22:42:05Z

Sorry I am in travel. I will submit the test few days after.

rxin · 2015-10-19T22:43:26Z

@viirya do you mind closing this and reopening it when it's ready?

viirya · 2015-10-20T02:41:45Z

Sure.

viirya · 2015-10-25T08:29:34Z

@rxin I am curious that although I don't observe significant performance improvement from a simple projection + filter operation by now with simple experiment, by making this filters pushed down to Parquet side, do we retrieve less data and reduce the memory footprint? If so, even under the same performance level, is this patch still worth merging?

rxin · 2015-10-25T09:16:11Z

If we don't observe performance improvements, it's definitely not worth it. Can you post your how you measured it, and performance results? Thanks.

viirya · 2015-10-25T09:29:08Z

ok. Thanks. Because we found that with pushdown filters, we can avoid the OOM problem when processing large data in our daily usage. I am wondering if it is helpful to others too.

I will post the the performance test later.

rxin · 2015-10-25T09:31:10Z

How does pushdown avoid OOM?

viirya · 2015-10-25T09:33:55Z

Because we can pre-filtering the data? Without pushdown, the whole data will be loaded into memory and then has been filtered later.

rxin · 2015-10-25T09:35:59Z

Is that the case? I thought we load them one by one (or small batch at a time) and then apply the filter directly on them?

viirya · 2015-10-25T09:38:32Z

Hmm, I am not sure about that. Because I supposed that Parquet relation will read all data first if no pushdown filters are applied. Then Spark SQL's Filter operation will be applied later. Maybe @liancheng can answer this?

liancheng · 2015-10-26T06:24:11Z

Well, it depends. The situation is a little bit tricky to explain. In general there are two cases:

String filters with high selectivity, namely most records can be dropped

Performance

Usually, I'd expect there's no noticeable performance gain, because each record is checked against the filter pushed down, and string operations themselves are CPU bound. So the performance should be similar to the case no filter is pushed down at all.

However, a properly implemented StringStartsWithFilter.canDrop (as I mentioned in this comment) can bring big performance win since it can drop entire row groups whenever possible. But this requires us to bump parquet-mr to 1.8+ first, which is done in [SPARK-9876] [SQL] Bumps parquet-mr to 1.8.1 #9225.
Memory footprint

What @viirya observed is reasonable. One benefit of Parquet filters is that, they are performed before record assembling, namely we can drop a record before converting the underlying Parquet column values into an InternalRow. I think that's the reason why @viirya observed that OOM was gone.

(BTW, ParquetRelation processes all the data using iterators, so we don't read all the data first and then apply the filters. My theory is that it's the InternalRow materialization costs more memory.)

String filters with low selectivity, namely most records can NOT be dropped

Performance

In this case, I'd expect performance regression. This is because currently Spark SQL tends to be pessimistic, and always applies all the filters again even if some of them are pushed down. In this case, almost all records are filtered twice. Since string operations are CPU bound, this can be time consuming.
Memory footprint

Since the string filters in this PR "steal" the underlying byte arrays without copying them, I'd expect the memory footprint is similar to the normal case.

viirya · 2015-10-26T13:34:14Z

Thank you @liancheng for clear explanation!

So looks like the only benefit of this patch is the reduced memory footprint under certain cases. If you all think it is not worth merging this, we should keep it closed.

liancheng · 2015-10-27T02:50:06Z

@viirya I think we can add StringStartsWithFilter later after #9225 is merged. Also we are considering removing the defensive filtering. But yeah, for now let's keep this one closed.

viirya · 2015-10-27T03:28:15Z

@liancheng ok. Thank you.

Push down string filters to Parquet.

f27288e

Move pattern matching code out of repeatedly called function.

4d00ed0

Push down In filter to Parquet.

eb134b9

liancheng reviewed Oct 15, 2015
View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala Outdated

Copy link

Contributor

liancheng Oct 15, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove ()

Merge remote-tracking branch 'upstream/master' into parquet-stringfil…

d819b51

…ter-pushdown

viirya changed the title ~~[SPARK-10895][SQL] Push down string filters to Parquet~~ [SPARK-10895][SPARK-11164][SQL] Push down string filters to Parquet Oct 17, 2015

viirya changed the title ~~[SPARK-10895][SPARK-11164][SQL] Push down string filters to Parquet~~ [SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet Oct 17, 2015

For comments.

02bbab8

viirya closed this Oct 20, 2015

liancheng mentioned this pull request Dec 17, 2015

[SPARK-11164] [SQL] Add InSet pushdown filter back for Parquet #10278

Closed

viirya deleted the parquet-stringfilter-pushdown branch December 27, 2023 18:18

[SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet #8956

[SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet #8956

Uh oh!

Conversation

viirya commented Oct 1, 2015

Uh oh!

SparkQA commented Oct 1, 2015

Uh oh!

rxin commented Oct 3, 2015

Uh oh!

SparkQA commented Oct 3, 2015

Uh oh!

viirya commented Oct 3, 2015

Uh oh!

SparkQA commented Oct 3, 2015

Uh oh!

SparkQA commented Oct 5, 2015

Uh oh!

liancheng Oct 15, 2015

Choose a reason for hiding this comment

Uh oh!

liancheng commented Oct 15, 2015

Uh oh!

viirya commented Oct 17, 2015

Uh oh!

SparkQA commented Oct 17, 2015

Uh oh!

viirya commented Oct 17, 2015

Uh oh!

SparkQA commented Oct 17, 2015

Uh oh!

rxin commented Oct 17, 2015

Uh oh!

viirya commented Oct 17, 2015

Uh oh!

rxin commented Oct 17, 2015

Uh oh!

viirya commented Oct 19, 2015

Uh oh!

rxin commented Oct 19, 2015

Uh oh!

viirya commented Oct 20, 2015

Uh oh!

viirya commented Oct 25, 2015

Uh oh!

rxin commented Oct 25, 2015

Uh oh!

viirya commented Oct 25, 2015

Uh oh!

rxin commented Oct 25, 2015

Uh oh!

viirya commented Oct 25, 2015

Uh oh!

rxin commented Oct 25, 2015

Uh oh!

viirya commented Oct 25, 2015

Uh oh!

liancheng commented Oct 26, 2015

Uh oh!

viirya commented Oct 26, 2015

Uh oh!

liancheng commented Oct 27, 2015

Uh oh!

viirya commented Oct 27, 2015

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants