-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12139] [SQL] REGEX Column Specification #18023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ok to test |
| def enableHiveSupportQuotedIdentifiers() : Boolean = { | ||
| SparkEnv.get != null && | ||
| SparkEnv.get.conf != null && | ||
| SparkEnv.get.conf.getBoolean("hive.support.quoted.identifiers", false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark SQL always supports quoted identifiers. However, the missing part is the REGEX Column Specification. How about adding such a conf to SQLConf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to SQLConf
|
Test build #77062 has finished for PR 18023 at commit
|
| val escapedIdentifier = "`(.+)`".r | ||
| val ret = Option(ctx.fieldName.getStart).map(_.getText match { | ||
| case r@escapedIdentifier(i) => | ||
| UnresolvedRegex(i, Some(unresolved_attr.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about no change in the parser?
Is that possible we can simply resolve it in ResolveReferences?
BTW, we also need to handle the same issue in the Dataset APIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we go to ResolveReferences, we need to figure out that the regex is the special case, just like Star. ResolveReferences will do the resolve based on the expression type is UnresolvedRegx, Star etc.
Database API goes the same path. I have added a unittest for DatasetSuite.scala for regex
|
Please update the PR title to |
|
Test build #77063 has finished for PR 18023 at commit
|
| case Some(t) => input.output.filter(_.qualifier.filter(resolver(_, t)).nonEmpty) | ||
| .filter(_.name.matches(expr)) | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An Attribute is always a NamedExpression, why do we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. we dont need it any more. removed.
| * | ||
| * @param table an optional table that should be the target of the expansion. If omitted all | ||
| * tables' columns are produced. | ||
| */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expr is the pattern right? Maybe we should give it a better name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed to regexPattern
| val expandedAttributes: Seq[Attribute] = table match { | ||
| // If there is no table specified, use all input attributes that match expr | ||
| case None => input.output.filter(_.name.matches(expr)) | ||
| // If there is a table, pick out attributes that are part of this table that match expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input.output.filter(_.qualifier.exists(resolver(_, t))) is a bit more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
| case UnresolvedAttribute(nameParts) => | ||
| case unresolved_attr @ UnresolvedAttribute(nameParts) => | ||
| if (conf.supportQuotedIdentifiers) { | ||
| val escapedIdentifier = "`(.+)`".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to compile the same regex over and over. Can you move this to the ParserUtils...
I am also wondering if we shouldn't do the match in the parser it self.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add API in ParserUtils.
I think in the parser, it can still get the backticks.
that, the backticks are stripped off.
| val attr = ctx.fieldName.getText | ||
| expression(ctx.base) match { | ||
| case UnresolvedAttribute(nameParts) => | ||
| case unresolved_attr @ UnresolvedAttribute(nameParts) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a guard, e.g.: case unresolved_attr @ UnresolvedAttribute(nameParts) if conf.supportQuotedIdentifiers => . That makes the logic down the line much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is concise to put the if inside the case unresolved_attr @ UnresolvedAttribute(nameParts).
if we use guard, we still need to handle the case when the conf.supportQuotedIdentifiers is false.
| */ | ||
| override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) { | ||
| if (conf.supportQuotedIdentifiers) { | ||
| val escapedIdentifier = "`(.+)`".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't need to compile the same regex over and over. Can you move this to the ParserUtils...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add API in ParserUtils.
| case unresolved_attr @ UnresolvedAttribute(nameParts) => | ||
| if (conf.supportQuotedIdentifiers) { | ||
| val escapedIdentifier = "`(.+)`".r | ||
| val ret = Option(ctx.fieldName.getStart).map(_.getText match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an option here does not add a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the option
| override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) { | ||
| if (conf.supportQuotedIdentifiers) { | ||
| val escapedIdentifier = "`(.+)`".r | ||
| val ret = Option(ctx.getStart).map(_.getText match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using an option here does not add a thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the option
|
Test build #77068 has finished for PR 18023 at commit
|
Blame Rev:
|
Like what we did for df.select(df("(a|b)?+.+")) |
| .intConf | ||
| .createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt) | ||
|
|
||
| val SUPPORT_QUOTED_IDENTIFIERS = buildConf("spark.sql.support.quoted.identifiers") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about renaming it to spark.sql.parser.quotedRegexColumnNames?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed.
|
|
||
| val SUPPORT_QUOTED_IDENTIFIERS = buildConf("spark.sql.support.quoted.identifiers") | ||
| .internal() | ||
| .doc("When true, identifiers specified by regex patterns will be expanded.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We only do it for the column names, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It must be quoted. Thus, we also need to mention it in the description.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. this only applies to column names. updated the doc.
| sb.toString() | ||
| } | ||
|
|
||
| val escapedIdentifier = "`(.+)`".r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
|
Test build #77081 has finished for PR 18023 at commit
|
|
Test build #77086 has finished for PR 18023 at commit
|
| .createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt) | ||
|
|
||
| val SUPPORT_QUOTED_REGEX_COLUMN_NAME = buildConf("spark.sql.parser.quotedRegexColumnNames") | ||
| .internal() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin @hvanhovell @cloud-fan Should we keep it internal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should be public. I didn't realize that that I put it under internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be public
| assert(e.message.contains("Invalid number of arguments")) | ||
| } | ||
|
|
||
| test("SPARK-12139: REGEX Column Specification for Hive Queries") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you create a file in https://github.com/apache/spark/tree/master/sql/core/src/test/resources/sql-tests/inputs? Now, all the new SQL test cases need to be moved there.
You can run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite" to generate the result files. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's use those rather than adding more files to SQLQuerySUite. I'd love to get rid of SQLQuerySuite ....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok. moved the test to sql/core/src/test/resources/sql-tests/inputs
| |FROM testData2 t | ||
| |WHERE a = 1 | ||
| """.stripMargin), | ||
| Row(1) :: Row(2) :: Nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The above two test queries are not needed in the new suite.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed. I was trying to make sure that the existing behaviors are not broken.
|
Test build #77099 has finished for PR 18023 at commit
|
|
|
||
| -- Clean-up | ||
| DROP VIEW IF EXISTS testData; | ||
| DROP VIEW IF EXISTS testData2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to drop the temp views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
|
||
| CREATE OR REPLACE TEMPORARY VIEW testData2 AS SELECT * FROM VALUES | ||
| (1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2) | ||
| AS testData2(a, b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the test cases are testing the regex pattern matching in column names, could you add more names and let the regex pattern match more columns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. added two more columns
| def col(colName: String): Column = colName match { | ||
| case "*" => | ||
| Column(ResolvedStar(queryExecution.analyzed.output)) | ||
| case ParserUtils.escapedIdentifier(i) if sqlContext.conf.supportQuotedRegexColumnName => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid using i or j. Instead, using some meaningful variable names.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
|
@viirya regarding
Yes, we still have hive.support.quoted.identifiers. only when spark.sql.parser.quotedRegexColumnNames = true, |
|
@gatorsmile regarding
I added some tests for aggreation. But for group by, we cannot have regex there. group by requires the fields to be orderable. As hive states and our previous comments, we should only support regex in SELECT. |
|
@janewangfb That is fine we do not support it, but we still need to add test cases for these negative cases. Thanks! |
|
@gatorsmile, regarding:
|
|
@gatorsmile, regarding
Yes, I have added testcases. |
|
@gatorsmile regarding:
|
|
Test build #79476 has finished for PR 18023 at commit
|
|
Test build #79503 has finished for PR 18023 at commit
|
| case unresolved_attr @ UnresolvedAttribute(nameParts) => | ||
| ctx.fieldName.getStart.getText match { | ||
| case escapedIdentifier(columnNameRegex) | ||
| if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, recently, we reverted a PR back. In the parser, we are unable to use SQLConf.get.
Could you please change SQLConf.get back to conf?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| case escapedIdentifier(columnNameRegex) | ||
| if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) => | ||
| UnresolvedRegex(columnNameRegex, Some(unresolved_attr.name), | ||
| SQLConf.get.caseSensitiveAnalysis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolled back to conf.
| ctx.getStart.getText match { | ||
| case escapedIdentifier(columnNameRegex) | ||
| if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) => | ||
| UnresolvedRegex(columnNameRegex, None, SQLConf.get.caseSensitiveAnalysis) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolled back to conf.
| UnresolvedAttribute.quoted(ctx.getText) | ||
| ctx.getStart.getText match { | ||
| case escapedIdentifier(columnNameRegex) | ||
| if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rolled back to conf
| private[sql] abstract class DataSourceTest extends QueryTest { | ||
|
|
||
| protected def sqlTest(sqlString: String, expectedAnswer: Seq[Row]) { | ||
| protected def sqlTest(sqlString: String, expectedAnswer: Seq[Row], enableRegex: String = "true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enableRegex: String = "true" -> enableRegex: Boolean = false
Could you change the type to Boolean and call .toString below and set the default to false?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
| CaseWhen(branches, Option(ctx.elseExpression).map(expression)) | ||
| } | ||
|
|
||
| private def canApplyRegex(ctx: ParserRuleContext): Boolean = withOrigin(ctx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a comment above this function to explain why regex can be applied under NamedExpression only. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
|
Test build #79504 has finished for PR 18023 at commit
|
|
@gatorsmile regarding:
(I saw this comment in email but didn't find in the PR) I have reverted the unneeded changes. |
|
Test build #79535 has finished for PR 18023 at commit
|
|
The last comment is about |
|
LGTM |
|
Thanks! Merging to master. |
|
Sure, I could have a follow-up PR to resolve DataFrameNaFunctions.fill. thanks for reviewing this PR. |
What changes were proposed in this pull request?
Hive interprets regular expression, e.g.,
(a)?+.+in query specification. This PR enables spark to support this feature when hive.support.quoted.identifiers is set to true.How was this patch tested?
scala> hc.sql("SELECT
(a|b)?+.+from test1").collect.foreach(println)