[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates #22597

dongjoon-hyun · 2018-10-01T04:55:48Z

What changes were proposed in this pull request?

This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with ., the pushed predicates are ignored.

Test Data

scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot")
scala> df.write.mode("overwrite").orc("/tmp/orc")

Spark 2.3.2

scala> spark.sql("set spark.sql.orc.impl=native")
scala> spark.sql("set spark.sql.orc.filterPushdown=true")
scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
|           5|
|           7|
|           8|
+------------+

Time taken: 1542 ms

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
|           5|
|           7|
|           8|
+------------+

Time taken: 152 ms

Spark 2.4.0 RC3

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
|           5|
|           7|
|           8|
+------------+

Time taken: 4074 ms

scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show)
+------------+
|col.with.dot|
+------------+
|           5|
|           7|
|           8|
+------------+

Time taken: 1771 ms

How was this patch tested?

Pass the Jenkins with a newly added test case.

… predicates

SparkQA · 2018-10-01T08:54:41Z

Test build #96810 has finished for PR 22597 at commit f6c3dca.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-01T09:33:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala

+  // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters
+  // in order to distinguish predicate pushdown for nested columns.
+  private def quoteAttributeNameIfNeeded(name: String) : String = {
+    if (!name.contains("`") && name.contains(".")) {


Does this condition take the backtick in column name into account? For instance,

>>> spark.range(1).toDF("abc`.abc").show() +--------+ |abc`.abc| +--------+ | 0| +--------+

Thank you for review. I'll consider that, too.

@HyukjinKwon . Actually, Spark 2.3.2 ORC (native/hive) doesn't support a backtick character in column names. It fails on writing operation. And, although Spark 2.4.0 broadens the supported special characters like . and " in column names, the backtick character is not handled yet.

So, for that one, I'll proceed in another PR since it's an improvement instead of a regression.

Also, cc @gatorsmile and @dbtsai .

For ORC and AVRO improvement, SPARK-25722 is created.

dbtsai · 2018-10-12T20:33:05Z

Is it possible to add tests like parquet to remove the filter in Spark SQL to ensure that the predicate is pushed down to the reader? Thanks.

gatorsmile · 2018-10-12T22:59:26Z

Yes. Please add a test case.

dongjoon-hyun · 2018-10-12T23:13:36Z

Thank you for review, @dbtsai and @gatorsmile .

BTW, what do you mean by removing? The pushed filter doesn't introduce correctness issue like Parquet. Since it's a performance slowdown, this PR want to fix it. We don't want to remove filters in this PR.

Also, for the performance slowdown, we cannot add a test case. We usually make a benchmark to detect this kind of regression. Do we want to a benchmark case at FilterPushdownBenchmark instead?

dbtsai · 2018-10-12T23:33:13Z

In ParquetFilter, the way we test if a predicate pushdown works is by removing that predicate from Spark SQL physical plan, and only relying on the reader to do the filter. Thus, if there is a bug in pushdown filter in reader, Spark will get the incorrect result. This can use in test to ensure no regression later.

dongjoon-hyun · 2018-10-13T00:59:24Z

Thanks. I got it. You mean stripSparkFilter which is used in both OrcQuerySuite.scala and ParquetFilterSuite.scala. Sure!

cloud-fan · 2018-10-13T05:44:30Z

In ParquetFilter, the way we test if a predicate pushdown works is by removing that predicate from Spark SQL physical plan, and only relying on the reader to do the filter.

I haven't looked into, but the parquet record-level filtering is disabled by default, so if we remove predicates from spark side, the result can be wrong even if the predicates are pushed ro parquet.

SparkQA · 2018-10-13T06:20:32Z

Test build #97329 has finished for PR 22597 at commit 335a39f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-14T08:18:30Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

  }
+
+  test("SPARK-25579 ORC PPD should support column names with dot") {
+    import testImplicits._


Can we add a test at OrcFilterSuite too?

Ur, this is OrcFilterSuite.

Can we add a test at OrcFilterSuite too?

For HiveOrcFilterSuite, hive ORC implementation doesn't support dot.

Okay. One end to end test should be enough

HyukjinKwon · 2018-10-14T08:22:49Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

+      val path = new File(dir, "orc").getCanonicalPath
+      Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
+      val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
+      checkAnswer(stripSparkFilter(df), Row(1, 2))


@dongjoon-hyun, technically shouldn't we test if the stripes are filtered? I added some tests a long ago (stripSparkFilter is added by me FWIW as well):

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala

Lines 445 to 459 in 5d572fc

test("Support for pushing down filters for decimal types") {

withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {

val data = (0 until 10).map(i => Tuple1(BigDecimal.valueOf(i)))

withTempPath { file =>

// It needs to repartition data so that we can have several ORC files

// in order to skip stripes in ORC.

spark.createDataFrame(data).toDF("a").repartition(10)

.write.orc(file.getCanonicalPath)

val df = spark.read.orc(file.getCanonicalPath).where("a == 2")

val actual = stripSparkFilter(df).count()

assert(actual < 10)

}

}

}

Like that test, this test also generates two ORC files with one row and test if PPD works.

HyukjinKwon · 2018-10-14T08:28:14Z

I haven't looked into, but the parquet record-level filtering is disabled by default, so if we remove predicates from spark side, the result can be wrong even if the predicates are pushed ro parquet.

That's explicitly enabled for the parquet tests (that's disabled by me FWIW). For ORC tests, since it doesn't support record by record filter, it tests if the output is smaller then the original data.

Some parquet tests do this as well for instance,

spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala

Lines 1016 to 1040 in 5d726b8

    
           test("Filters should be pushed down for Parquet readers at row group level") { 
        
             import testImplicits._ 
        
             withSQLConf( 
        
               // Makes sure disabling 'spark.sql.parquet.recordFilter' still enables 
        
               // row group level filtering. 
        
               SQLConf.PARQUET_RECORD_FILTER_ENABLED.key -> "false", 
        
               SQLConf.PARQUET_FILTER_PUSHDOWN_ENABLED.key -> "true", 
        
               SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") { 
        
               withTempPath { path => 
        
                 val data = (1 to 1024) 
        
                 data.toDF("a").coalesce(1) 
        
                   .write.option("parquet.block.size", 512) 
        
                   .parquet(path.getAbsolutePath) 
        
                 val df = spark.read.parquet(path.getAbsolutePath).filter("a == 500") 
        
                 // Here, we strip the Spark side filter and check the actual results from Parquet. 
        
                 val actual = stripSparkFilter(df).collect().length 
        
                 // Since those are filtered at row group level, the result count should be less 
        
                 // than the total length but should not be a single record. 
        
                 // Note that, if record level filtering is enabled, it should be a single record. 
        
                 // If no filter is pushed down to Parquet, it should be the total length of data. 
        
                 assert(actual > 1 && actual < data.length) 
        
               } 
        
             } 
        
           }

SparkQA · 2018-10-15T00:37:24Z

Test build #97366 has finished for PR 22597 at commit 849c7fa.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-15T02:27:16Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

+        val path = new File(dir, "orc").getCanonicalPath
+        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)
+        val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2")
+        checkAnswer(stripSparkFilter(df), Row(1, 2))


to confirm, this only works when (1, 2) and (3, 4) are in different row groups? (not sure what's the terminology in ORC)

Yep. It works when they are in different stripes.

How do we generalize this into nested cases? The parent struct can contain dot as well.

ORC data source doesn't support nested column pruning yet.

Thank you for review, @dbtsai ! I ignored PPDs with nested columns here because Spark doesn't pushdown in Spark 2.4 and until now without your PR (#22573). With your PR, Spark 3.0 will support that and we can update this to handle that cases, too.

@cloud-fan . Actually, ORC 1.5.0 starts to support PPD with nested columns ORC-323. So, @dbtsai and I discussed about supporting that before. We are going to support ORC PPDs with nested columns in Spark 3.0 without regression.

HyukjinKwon · 2018-10-15T02:28:54Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala

+    withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
+      withTempDir { dir =>
+        val path = new File(dir, "orc").getCanonicalPath
+        Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path)


How about explicitly repartition to make separate output files?

We are using the default parallelism from TestSparkSession on two rows and it generates separate output files already.

If you are concerning some possibility of flakiness, we are able to increase the number of rows to 10 and call repartition(10) and check assert(actual < 10) as you did before. Do you want that?

Do not rely on implicit environment values, let's make the test as explicit as possible.

Sure. Thank you for confirmation, @cloud-fan and @HyukjinKwon .

dongjoon-hyun · 2018-10-16T05:48:41Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala

    df.write.mode(SaveMode.Overwrite).orc(path.getCanonicalPath)
  }
+
+  protected def checkPredicatePushDown(df: DataFrame, numRows: Int, predicate: String): Unit = {


@HyukjinKwon . I refactor this since it's repeated three times now.
And, this function should be here because the existing two instances are in OrcQueryTest and new instance is in OrcQuerySuite. There is another similar instance, but I skipped it because it's not the same pattern.

cloud-fan · 2018-10-16T06:13:52Z

LGTM

SparkQA · 2018-10-16T07:05:02Z

Test build #97436 has finished for PR 22597 at commit 7686179.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-16T07:10:55Z

retest this please

SparkQA · 2018-10-16T08:05:40Z

Test build #97441 has finished for PR 22597 at commit 7686179.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-16T08:51:00Z

retest this please

SparkQA · 2018-10-16T12:22:36Z

Test build #97445 has finished for PR 22597 at commit 7686179.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-16T12:30:49Z

Merged to master and branch-2.4.

… predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. **Test Data** ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` **Spark 2.3.2** ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 1542 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 152 ms ``` **Spark 2.4.0 RC3** ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 4074 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 1771 ms ``` ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes #22597 from dongjoon-hyun/SPARK-25579. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit 2c664ed) Signed-off-by: hyukjinkwon <[email protected]>

dongjoon-hyun · 2018-10-16T14:11:19Z

Thank you all!

… predicates ## What changes were proposed in this pull request? This PR aims to fix an ORC performance regression at Spark 2.4.0 RCs from Spark 2.3.2. Currently, for column names with `.`, the pushed predicates are ignored. **Test Data** ```scala scala> val df = spark.range(Int.MaxValue).sample(0.2).toDF("col.with.dot") scala> df.write.mode("overwrite").orc("/tmp/orc") ``` **Spark 2.3.2** ```scala scala> spark.sql("set spark.sql.orc.impl=native") scala> spark.sql("set spark.sql.orc.filterPushdown=true") scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 1542 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 152 ms ``` **Spark 2.4.0 RC3** ```scala scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 4074 ms scala> spark.time(spark.read.orc("/tmp/orc").where("`col.with.dot` < 10").show) +------------+ |col.with.dot| +------------+ | 5| | 7| | 8| +------------+ Time taken: 1771 ms ``` ## How was this patch tested? Pass the Jenkins with a newly added test case. Closes apache#22597 from dongjoon-hyun/SPARK-25579. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC…

f6c3dca

… predicates

HyukjinKwon reviewed Oct 1, 2018

View reviewed changes

Add a test case.

335a39f

HyukjinKwon reviewed Oct 14, 2018

View reviewed changes

Add withSQLConf with ORC_FILTER_PUSHDOWN_ENABLED=true

849c7fa

HyukjinKwon approved these changes Oct 14, 2018

View reviewed changes

cloud-fan reviewed Oct 15, 2018

View reviewed changes

HyukjinKwon reviewed Oct 15, 2018

View reviewed changes

Address comments

7686179

dongjoon-hyun commented Oct 16, 2018

View reviewed changes

asfgit closed this in 2c664ed Oct 16, 2018

dongjoon-hyun deleted the SPARK-25579 branch October 16, 2018 21:37

	test("Support for pushing down filters for decimal types") {
	withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") {
	val data = (0 until 10).map(i => Tuple1(BigDecimal.valueOf(i)))
	withTempPath { file =>
	// It needs to repartition data so that we can have several ORC files
	// in order to skip stripes in ORC.
	spark.createDataFrame(data).toDF("a").repartition(10)
	.write.orc(file.getCanonicalPath)
	val df = spark.read.orc(file.getCanonicalPath).where("a == 2")
	val actual = stripSparkFilter(df).count()

	assert(actual < 10)
	}
	}
	}

[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates #22597

[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates #22597

Uh oh!

Conversation

dongjoon-hyun commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Oct 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Oct 12, 2018

Uh oh!

gatorsmile commented Oct 12, 2018

Uh oh!

dongjoon-hyun commented Oct 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dbtsai commented Oct 12, 2018

Uh oh!

dongjoon-hyun commented Oct 13, 2018

Uh oh!

cloud-fan commented Oct 13, 2018

Uh oh!

SparkQA commented Oct 13, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Oct 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 15, 2018

Uh oh!

cloud-fan Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 15, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 16, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 16, 2018

dongjoon-hyun commented Oct 1, 2018 •

edited

Loading

dongjoon-hyun commented Oct 12, 2018 •

edited

Loading

HyukjinKwon commented Oct 14, 2018 •

edited

Loading

cloud-fan Oct 15, 2018 •

edited

Loading

dongjoon-hyun Oct 16, 2018 •

edited

Loading

dongjoon-hyun Oct 15, 2018 •

edited

Loading

dongjoon-hyun Oct 16, 2018 •

edited

Loading