-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25579][SQL] Use quoted attribute names if needed in pushed ORC predicates #22597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 3 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -17,18 +17,20 @@ | |
|
|
||
| package org.apache.spark.sql.execution.datasources.orc | ||
|
|
||
| import java.io.File | ||
| import java.nio.charset.StandardCharsets | ||
| import java.sql.{Date, Timestamp} | ||
|
|
||
| import scala.collection.JavaConverters._ | ||
|
|
||
| import org.apache.orc.storage.ql.io.sarg.{PredicateLeaf, SearchArgument} | ||
|
|
||
| import org.apache.spark.sql.{Column, DataFrame} | ||
| import org.apache.spark.sql.{Column, DataFrame, Row} | ||
| import org.apache.spark.sql.catalyst.dsl.expressions._ | ||
| import org.apache.spark.sql.catalyst.expressions._ | ||
| import org.apache.spark.sql.catalyst.planning.PhysicalOperation | ||
| import org.apache.spark.sql.execution.datasources.{DataSourceStrategy, HadoopFsRelation, LogicalRelation} | ||
| import org.apache.spark.sql.internal.SQLConf | ||
| import org.apache.spark.sql.test.SharedSQLContext | ||
| import org.apache.spark.sql.types._ | ||
|
|
||
|
|
@@ -383,4 +385,17 @@ class OrcFilterSuite extends OrcTest with SharedSQLContext { | |
| )).get.toString | ||
| } | ||
| } | ||
|
|
||
| test("SPARK-25579 ORC PPD should support column names with dot") { | ||
| import testImplicits._ | ||
|
||
|
|
||
| withSQLConf(SQLConf.ORC_FILTER_PUSHDOWN_ENABLED.key -> "true") { | ||
| withTempDir { dir => | ||
| val path = new File(dir, "orc").getCanonicalPath | ||
| Seq((1, 2), (3, 4)).toDF("col.dot.1", "col.dot.2").write.orc(path) | ||
|
||
| val df = spark.read.orc(path).where("`col.dot.1` = 1 and `col.dot.2` = 2") | ||
| checkAnswer(stripSparkFilter(df), Row(1, 2)) | ||
|
||
| } | ||
| } | ||
| } | ||
| } | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this condition take the backtick in column name into account? For instance,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for review. I'll consider that, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HyukjinKwon . Actually, Spark 2.3.2 ORC (native/hive) doesn't support a backtick character in column names. It fails on writing operation. And, although Spark 2.4.0 broadens the supported special characters like
.and"in column names, the backtick character is not handled yet.So, for that one, I'll proceed in another PR since it's an improvement instead of a regression.
Also, cc @gatorsmile and @dbtsai .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For
ORCandAVROimprovement, SPARK-25722 is created.