Skip to content

Conversation

@viirya
Copy link
Member

@viirya viirya commented Oct 1, 2015

@SparkQA
Copy link

SparkQA commented Oct 1, 2015

Test build #43149 has finished for PR 8956 at commit f27288e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

@rxin
Copy link
Contributor

rxin commented Oct 3, 2015

cc @liancheng

@SparkQA
Copy link

SparkQA commented Oct 3, 2015

Test build #43213 has finished for PR 8956 at commit 4d00ed0.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

@viirya
Copy link
Member Author

viirya commented Oct 3, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 3, 2015

Test build #43215 has finished for PR 8956 at commit 4d00ed0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

@SparkQA
Copy link

SparkQA commented Oct 5, 2015

Test build #43238 has finished for PR 8956 at commit eb134b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class StringFilter(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remove ()

@liancheng
Copy link
Contributor

Off topic but related, SetInFilter.canDrop can also leverage statistics information:

    private val min = valueSet.min
    private val max = valueSet.max

    override def canDrop(statistics: Statistics[T]): Boolean = {
      statistics.getMax.compareTo(min) < 0 || max.compareTo(statistics.getMin) < 0
    }

(And we probably should rename SetInFilter to InSetFilter.)

@viirya viirya changed the title [SPARK-10895][SQL] Push down string filters to Parquet [SPARK-10895][SPARK-11164][SQL] Push down string filters to Parquet Oct 17, 2015
@viirya viirya changed the title [SPARK-10895][SPARK-11164][SQL] Push down string filters to Parquet [SPARK-10895][SPARK-11164][SQL] Push down InSet and string filters to Parquet Oct 17, 2015
@viirya
Copy link
Member Author

viirya commented Oct 17, 2015

@liancheng Thank you for your detailed comments. I've updated this patch. When the tests are passed, please review it again to see if there is any problem.

@SparkQA
Copy link

SparkQA commented Oct 17, 2015

Test build #43878 has finished for PR 8956 at commit 02bbab8.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])
    • abstract class StringFilter extends UserDefinedPredicate[Binary]
    • case class StringStartsWithFilter(prefix: String) extends StringFilter
    • case class StringEndsWithFilter(suffix: String) extends StringFilter
    • case class StringContainsFilter(str: String) extends StringFilter

@viirya
Copy link
Member Author

viirya commented Oct 17, 2015

retest this please.

@SparkQA
Copy link

SparkQA commented Oct 17, 2015

Test build #43880 has finished for PR 8956 at commit 02bbab8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • case class InSetFilter[T <: Comparable[T]](valueSet: Set[T])
    • abstract class StringFilter extends UserDefinedPredicate[Binary]
    • case class StringStartsWithFilter(prefix: String) extends StringFilter
    • case class StringEndsWithFilter(suffix: String) extends StringFilter
    • case class StringContainsFilter(str: String) extends StringFilter

@rxin
Copy link
Contributor

rxin commented Oct 17, 2015

Are there any performance improvements by pushing this down?

@viirya
Copy link
Member Author

viirya commented Oct 17, 2015

I can run some performance tests later.

@rxin
Copy link
Contributor

rxin commented Oct 17, 2015

Thanks - that'd be great.

@viirya
Copy link
Member Author

viirya commented Oct 19, 2015

Sorry I am in travel. I will submit the test few days after.

@rxin
Copy link
Contributor

rxin commented Oct 19, 2015

@viirya do you mind closing this and reopening it when it's ready?

@viirya
Copy link
Member Author

viirya commented Oct 20, 2015

Sure.

@viirya viirya closed this Oct 20, 2015
@viirya
Copy link
Member Author

viirya commented Oct 25, 2015

@rxin I am curious that although I don't observe significant performance improvement from a simple projection + filter operation by now with simple experiment, by making this filters pushed down to Parquet side, do we retrieve less data and reduce the memory footprint? If so, even under the same performance level, is this patch still worth merging?

@rxin
Copy link
Contributor

rxin commented Oct 25, 2015

If we don't observe performance improvements, it's definitely not worth it. Can you post your how you measured it, and performance results? Thanks.

@viirya
Copy link
Member Author

viirya commented Oct 25, 2015

ok. Thanks. Because we found that with pushdown filters, we can avoid the OOM problem when processing large data in our daily usage. I am wondering if it is helpful to others too.

I will post the the performance test later.

@rxin
Copy link
Contributor

rxin commented Oct 25, 2015

How does pushdown avoid OOM?

@viirya
Copy link
Member Author

viirya commented Oct 25, 2015

Because we can pre-filtering the data? Without pushdown, the whole data will be loaded into memory and then has been filtered later.

@rxin
Copy link
Contributor

rxin commented Oct 25, 2015

Is that the case? I thought we load them one by one (or small batch at a time) and then apply the filter directly on them?

@viirya
Copy link
Member Author

viirya commented Oct 25, 2015

Hmm, I am not sure about that. Because I supposed that Parquet relation will read all data first if no pushdown filters are applied. Then Spark SQL's Filter operation will be applied later. Maybe @liancheng can answer this?

@liancheng
Copy link
Contributor

Well, it depends. The situation is a little bit tricky to explain. In general there are two cases:

  1. String filters with high selectivity, namely most records can be dropped
  • Performance

    Usually, I'd expect there's no noticeable performance gain, because each record is checked against the filter pushed down, and string operations themselves are CPU bound. So the performance should be similar to the case no filter is pushed down at all.

    However, a properly implemented StringStartsWithFilter.canDrop (as I mentioned in this comment) can bring big performance win since it can drop entire row groups whenever possible. But this requires us to bump parquet-mr to 1.8+ first, which is done in [SPARK-9876] [SQL] Bumps parquet-mr to 1.8.1 #9225.

  • Memory footprint

    What @viirya observed is reasonable. One benefit of Parquet filters is that, they are performed before record assembling, namely we can drop a record before converting the underlying Parquet column values into an InternalRow. I think that's the reason why @viirya observed that OOM was gone.

    (BTW, ParquetRelation processes all the data using iterators, so we don't read all the data first and then apply the filters. My theory is that it's the InternalRow materialization costs more memory.)

  1. String filters with low selectivity, namely most records can NOT be dropped
  • Performance

    In this case, I'd expect performance regression. This is because currently Spark SQL tends to be pessimistic, and always applies all the filters again even if some of them are pushed down. In this case, almost all records are filtered twice. Since string operations are CPU bound, this can be time consuming.

  • Memory footprint

    Since the string filters in this PR "steal" the underlying byte arrays without copying them, I'd expect the memory footprint is similar to the normal case.

@viirya
Copy link
Member Author

viirya commented Oct 26, 2015

Thank you @liancheng for clear explanation!

So looks like the only benefit of this patch is the reduced memory footprint under certain cases. If you all think it is not worth merging this, we should keep it closed.

@liancheng
Copy link
Contributor

@viirya I think we can add StringStartsWithFilter later after #9225 is merged. Also we are considering removing the defensive filtering. But yeah, for now let's keep this one closed.

@viirya
Copy link
Member Author

viirya commented Oct 27, 2015

@liancheng ok. Thank you.

@viirya viirya deleted the parquet-stringfilter-pushdown branch December 27, 2023 18:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants