-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-12139] [SQL] REGEX Column Specification #18023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 30 commits
af55afd
6f9bdb0
43beb07
7699e87
6e37517
bee07cd
d5e450a
979bfb6
612bedf
48c54aa
0284d01
129243a
5df5494
779724d
a27023c
e8d4054
201f4d6
a0e3773
6b091dc
537b3bc
da60368
9c582eb
79e58f0
616b726
f98207b
321211d
4e36ed9
04b62c6
2d0dd1c
448c3e2
2ef2c14
d65c462
65e5eec
65886cd
ca89a4a
d3eed1a
956b849
8adad7c
56e2b83
d613ff9
f5104e4
a5f9c44
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1259,25 +1259,37 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging | |
| } | ||
|
|
||
| /** | ||
| * Create a dereference expression. The return type depends on the type of the parent, this can | ||
| * either be a [[UnresolvedAttribute]] (if the parent is an [[UnresolvedAttribute]]), or an | ||
| * [[UnresolvedExtractValue]] if the parent is some expression. | ||
| * Create a dereference expression. The return type depends on the type of the parent. | ||
| * If the parent is an [[UnresolvedAttribute]], it can be a [[UnresolvedAttribute]] or | ||
| * a [[UnresolvedRegex]] for regex quoted in ``; if the parent is some other expression, | ||
| * it can be [[UnresolvedExtractValue]]. | ||
| */ | ||
| override def visitDereference(ctx: DereferenceContext): Expression = withOrigin(ctx) { | ||
| val attr = ctx.fieldName.getText | ||
| expression(ctx.base) match { | ||
| case UnresolvedAttribute(nameParts) => | ||
| UnresolvedAttribute(nameParts :+ attr) | ||
| case unresolved_attr @ UnresolvedAttribute(nameParts) => | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. how about
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this wont work. In your first "case", ctx.fieldName.getStart.getText is
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oh sorry I made a mistake,
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, ctx.fieldName.getText will trim the backquote |
||
| ctx.fieldName.getStart.getText match { | ||
| case escapedIdentifier(columnNameRegex) if conf.supportQuotedRegexColumnName => | ||
| UnresolvedRegex(columnNameRegex, Some(unresolved_attr.name), conf.caseSensitiveAnalysis) | ||
| case _ => | ||
| UnresolvedAttribute(nameParts :+ attr) | ||
| } | ||
| case e => | ||
| UnresolvedExtractValue(e, Literal(attr)) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Create an [[UnresolvedAttribute]] expression. | ||
| * Create an [[UnresolvedAttribute]] expression or a [[UnresolvedRegex]] if it is a regex | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what if we always create
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should only create UnresolvedRegex when necessary.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. there seems no problem if we always go with the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the code complexity will be similar, because if the column is ``, we need to extract the pattern;
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not talking about algorithm complexity, but saying that we can simplify the logic by avoiding detecting the regex string.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @cloud-fan, the code path is shared by by both select a, select a.b and on cause etc. If it is select a.b, the table part also go here, but later there is no project expand. If it is on cause, the the string is already striped, not regex any more. Only with column names, we will have the project expanding (similar to start). So, we will need the regex pattern match to know that this is only for columns. Do you have any suggestion? Currently Hive only supports select column regex expansion. and this PR matches the hive behavior. |
||
| * quoted in `` | ||
| */ | ||
| override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) { | ||
| UnresolvedAttribute.quoted(ctx.getText) | ||
| ctx.getStart.getText match { | ||
| case escapedIdentifier(columnNameRegex) if conf.supportQuotedRegexColumnName => | ||
| UnresolvedRegex(columnNameRegex, None, conf.caseSensitiveAnalysis) | ||
| case _ => | ||
| UnresolvedAttribute.quoted(ctx.getText) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -177,6 +177,12 @@ object ParserUtils { | |
| sb.toString() | ||
| } | ||
|
|
||
| /** the column name pattern in quoted regex without qualifier */ | ||
| val escapedIdentifier = "`(.+)`".r | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add a comment for this.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added. |
||
|
|
||
| /** the column name pattern in quoted regex with qualifier */ | ||
| val qualifiedEscapedIdentifier = ("(.+)" + """.""" + "`(.+)`").r | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. these 2 seems hacky to me, we can always create
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when the config is on, we need to extract XYZ from |
||
|
|
||
| /** Some syntactic sugar which makes it easier to work with optional clauses for LogicalPlans. */ | ||
| implicit class EnhancedLogicalPlan(val plan: LogicalPlan) extends AnyVal { | ||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -847,6 +847,12 @@ object SQLConf { | |
| .intConf | ||
| .createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt) | ||
|
|
||
| val SUPPORT_QUOTED_REGEX_COLUMN_NAME = buildConf("spark.sql.parser.quotedRegexColumnNames") | ||
| .doc("When true, quoted Identifiers (using backticks) in SELECT statement are interpreted" + | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not only select statement. It can be almost any query.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. we should only support select. It does not make sense to do select a from test where Also, for hive (https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select), it only supports the select statements:
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. It only makes sense when we use it in SELECT statement. However, our parser allows
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it possible users to use regex column in agg such as
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @gatorsmile for quoted Identifiers; if it not in select, it would not be affected.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @viirya I tried it out, e.g., val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @janewangfb You can do something like: So I guess you can also do something like:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, df.groupBy("a", "b").agg(df.col(" |
||
| " as regular expressions.") | ||
| .booleanConf | ||
| .createWithDefault(false) | ||
|
|
||
| object Deprecated { | ||
| val MAPRED_REDUCE_TASKS = "mapred.reduce.tasks" | ||
| } | ||
|
|
@@ -1105,6 +1111,8 @@ class SQLConf extends Serializable with Logging { | |
|
|
||
| def starSchemaFTRatio: Double = getConf(STARSCHEMA_FACT_TABLE_RATIO) | ||
|
|
||
| def supportQuotedRegexColumnName: Boolean = getConf(SUPPORT_QUOTED_REGEX_COLUMN_NAME) | ||
|
|
||
| /** ********************** SQLConf functionality methods ************ */ | ||
|
|
||
| /** Set Spark SQL configuration properties. */ | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -41,7 +41,7 @@ import org.apache.spark.sql.catalyst.expressions._ | |
| import org.apache.spark.sql.catalyst.expressions.aggregate._ | ||
| import org.apache.spark.sql.catalyst.json.{JacksonGenerator, JSONOptions} | ||
| import org.apache.spark.sql.catalyst.optimizer.CombineUnions | ||
| import org.apache.spark.sql.catalyst.parser.ParseException | ||
| import org.apache.spark.sql.catalyst.parser.{ParseException, ParserUtils} | ||
| import org.apache.spark.sql.catalyst.plans._ | ||
| import org.apache.spark.sql.catalyst.plans.logical._ | ||
| import org.apache.spark.sql.catalyst.plans.physical.{Partitioning, PartitioningCollection} | ||
|
|
@@ -1188,8 +1188,29 @@ class Dataset[T] private[sql]( | |
| case "*" => | ||
| Column(ResolvedStar(queryExecution.analyzed.output)) | ||
| case _ => | ||
| val expr = resolve(colName) | ||
| Column(expr) | ||
| if (sqlContext.conf.supportQuotedRegexColumnName) { | ||
| colRegex(colName) | ||
| } else { | ||
| val expr = resolve(colName) | ||
| Column(expr) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Selects column based on the column name specified as a regex and return it as [[Column]]. | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please add
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added |
||
| * @group untypedrel | ||
| * @since 2.3.0 | ||
| */ | ||
| def colRegex(colName: String): Column = { | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
For example, we can do: But we can't do the same thing with
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have tested out. it works for both cases. and I have added testcase DatasetSuite.scala. |
||
| val caseSensitive = sparkSession.sessionState.conf.caseSensitiveAnalysis | ||
| colName match { | ||
| case ParserUtils.escapedIdentifier(columnNameRegex) => | ||
| Column(UnresolvedRegex(columnNameRegex, None, caseSensitive)) | ||
| case ParserUtils.qualifiedEscapedIdentifier(nameParts, columnNameRegex) => | ||
| Column(UnresolvedRegex(columnNameRegex, Some(nameParts), caseSensitive)) | ||
| case _ => | ||
| Column(resolve(colName)) | ||
| } | ||
| } | ||
|
|
||
| /** | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| CREATE OR REPLACE TEMPORARY VIEW testData AS SELECT * FROM VALUES | ||
| (1, "1", "11"), (2, "2", "22"), (3, "3", "33"), (4, "4", "44"), (5, "5", "55"), (6, "6", "66") | ||
| AS testData(key, value1, value2); | ||
|
|
||
| CREATE OR REPLACE TEMPORARY VIEW testData2 AS SELECT * FROM VALUES | ||
| (1, 1, 1, 2), (1, 2, 1, 2), (2, 1, 2, 3), (2, 2, 2, 3), (3, 1, 3, 4), (3, 2, 3, 4) | ||
| AS testData2(A, B, c, d); | ||
|
|
||
| -- AnalysisException | ||
| SELECT `(a)?+.+` FROM testData2 WHERE a = 1; | ||
| SELECT t.`(a)?+.+` FROM testData2 t WHERE a = 1; | ||
| SELECT `(a|b)` FROM testData2 WHERE a = 2; | ||
| SELECT `(a|b)?+.+` FROM testData2 WHERE a = 2; | ||
|
|
||
| set spark.sql.parser.quotedRegexColumnNames=true; | ||
|
|
||
| -- Regex columns | ||
| SELECT `(a)?+.+` FROM testData2 WHERE a = 1; | ||
| SELECT `(A)?+.+` FROM testData2 WHERE a = 1; | ||
| SELECT t.`(a)?+.+` FROM testData2 t WHERE a = 1; | ||
| SELECT t.`(A)?+.+` FROM testData2 t WHERE a = 1; | ||
| SELECT `(a|B)` FROM testData2 WHERE a = 2; | ||
| SELECT `(A|b)` FROM testData2 WHERE a = 2; | ||
| SELECT `(a|B)?+.+` FROM testData2 WHERE a = 2; | ||
| SELECT `(A|b)?+.+` FROM testData2 WHERE a = 2; | ||
| SELECT `(e|f)` FROM testData2; | ||
| SELECT t.`(e|f)` FROM testData2 t; | ||
| SELECT p.`(KEY)?+.+`, b, testdata2.`(b)?+.+` FROM testData p join testData2 ON p.key = testData2.a WHERE key < 3; | ||
| SELECT p.`(key)?+.+`, b, testdata2.`(b)?+.+` FROM testData p join testData2 ON p.key = testData2.a WHERE key < 3; | ||
|
|
||
| set spark.sql.caseSensitive=true; | ||
|
|
||
| CREATE OR REPLACE TEMPORARY VIEW testdata3 AS SELECT * FROM VALUES | ||
| (0, 1), (1, 2), (2, 3), (3, 4) | ||
| AS testdata3(a, b); | ||
|
|
||
| -- Regex columns | ||
| SELECT `(A)?+.+` FROM testdata3; | ||
| SELECT `(a)?+.+` FROM testdata3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a guard, e.g.:
case unresolved_attr @ UnresolvedAttribute(nameParts) if conf.supportQuotedIdentifiers =>. That makes the logic down the line much simpler.Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is concise to put the if inside the case unresolved_attr @ UnresolvedAttribute(nameParts).
if we use guard, we still need to handle the case when the conf.supportQuotedIdentifiers is false.