Skip to content
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
af55afd
Fix SPARK-12139: REGEX Column Specification for Hive Queries
janewangfb May 18, 2017
6f9bdb0
Fix SPARK-12139: REGEX Column Specification for Hive Queries
janewangfb May 18, 2017
43beb07
Merge branch 'support_select_regex' of https://github.com/janewangfb/…
janewangfb May 18, 2017
7699e87
add unittests for DataSet.
janewangfb May 18, 2017
6e37517
Address hvanhovell's comments
janewangfb May 19, 2017
bee07cd
Address gatorsmile's comments
janewangfb May 19, 2017
d5e450a
address gatorsmile's comment
janewangfb May 19, 2017
979bfb6
add the gold file
janewangfb May 19, 2017
612bedf
address gatorsmile's comment
janewangfb May 19, 2017
48c54aa
address gatorsmile's comment
janewangfb May 19, 2017
0284d01
address cloud-fan's commentS
janewangfb May 22, 2017
129243a
Merge branch 'master' into support_select_regex
janewangfb May 23, 2017
5df5494
Merge branch 'master' into support_select_regex
janewangfb May 26, 2017
779724d
Address cloud-fan's comments
janewangfb May 26, 2017
a27023c
fix typo
janewangfb May 27, 2017
e8d4054
Merge branch 'master' into support_select_regex
janewangfb May 27, 2017
201f4d6
Merge branch 'master' into support_select_regex
janewangfb May 30, 2017
a0e3773
roll back code
janewangfb May 30, 2017
6b091dc
merge master
janewangfb Jun 26, 2017
537b3bc
Merge branch 'master' into support_select_regex
janewangfb Jun 26, 2017
da60368
add gatorsmile's comments
janewangfb Jun 26, 2017
9c582eb
Merge branch 'master' into support_select_regex
janewangfb Jun 28, 2017
79e58f0
address gatorsmile's comments
janewangfb Jun 28, 2017
616b726
made regex case insensitive
janewangfb Jun 28, 2017
f98207b
merge master
janewangfb Jun 28, 2017
321211d
fix regex case insensitive match
janewangfb Jun 28, 2017
4e36ed9
Address gatorsmile's comments
janewangfb Jun 30, 2017
04b62c6
Merge branch 'master' into support_select_regex
janewangfb Jun 30, 2017
2d0dd1c
Merge branch 'master' into support_select_regex
janewangfb Jun 30, 2017
448c3e2
address gatorsmile's comment
janewangfb Jun 30, 2017
2ef2c14
Merge branch 'master' into support_select_regex
janewangfb Jul 5, 2017
d65c462
address gatorsmile's comments and fixed his failed testcases
janewangfb Jul 5, 2017
65e5eec
Merge branch 'master' into support_select_regex
janewangfb Jul 5, 2017
65886cd
fix build failure
janewangfb Jul 5, 2017
ca89a4a
Merge branch 'master' into support_select_regex
janewangfb Jul 7, 2017
d3eed1a
Address gatorsmile and viirya's comments
janewangfb Jul 7, 2017
956b849
address gatorsmile's comments
janewangfb Jul 10, 2017
8adad7c
Merge branch 'master' into support_select_regex
janewangfb Jul 10, 2017
56e2b83
merge master and resolve conflicts
janewangfb Jul 11, 2017
d613ff9
fix build failure
janewangfb Jul 11, 2017
f5104e4
Merge branch 'master' into support_select_regex
janewangfb Jul 11, 2017
a5f9c44
address gaatorsmile's comments
janewangfb Jul 11, 2017
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,28 @@ case class UnresolvedTableValuedFunction(
override lazy val resolved = false
}

/**
* Represents all of the input attributes to a given relational operator, for example in
* "SELECT `(id)?+.+` FROM ...".
*
* @param table an optional table that should be the target of the expansion. If omitted all
* tables' columns are produced.
*/
case class UnresolvedRegex(regexPattern: String, table: Option[String])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move it below UnresolvedStar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved

extends Star with Unevaluable {
override def expand(input: LogicalPlan, resolver: Resolver): Seq[NamedExpression] = {
table match {
// If there is no table specified, use all input attributes that match expr
case None => input.output.filter(_.name.matches(regexPattern))
// If there is a table, pick out attributes that are part of this table that match expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we consider "struct expansion" like what we did in UnresolvedStar?

Copy link
Contributor Author

@janewangfb janewangfb May 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the table argument of UnresolvedStart

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

* @param target an optional name that should be the target of the expansion.  If omitted all
  *              targets' columns are produced. This can either be a table name or struct name. This
  *              is a list of identifiers that is the path of the expansion.

shall we support record.`(id)?+.+` ?

Copy link
Contributor Author

@janewangfb janewangfb May 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this diff, we support column only. We could consider it in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does hive support it?

Copy link
Contributor Author

@janewangfb janewangfb May 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hive supper regex column specification, see https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select.

Hive does not support record.(id)?+.+. I have tried out:

describe jane2;
Starting task [DDLTask{id=Stage-0, name=DDL}] in serial mode
Succeeded task DDLTask{id=Stage-0, name=DDL}
OK

col_name data_type comment

a int None
b struct<c:int,d:string> None

select b.(c)?+.+ from jane2;
Command ran:
1: select b.(c)?+.+ from jane2

FAILED: RuntimeException cannot find field (c)?+.+(lowercase form: (c)?+.+) in [c, d]
Caused by: RuntimeException: cannot find field (c)?+.+(lowercase form: (c)?+.+) in [c, d]

case Some(t) => input.output.filter(_.qualifier.exists(resolver(_, t)))
.filter(_.name.matches(regexPattern))
}
}

override def toString: String = table.map(_ + ".").getOrElse("") + regexPattern
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: table.map(_ + "." + regexPattern).getOrElse(regexPattern)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

}

/**
* Holds the name of an attribute that has yet to be resolved.
*/
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1230,25 +1230,37 @@ class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
}

/**
* Create a dereference expression. The return type depends on the type of the parent, this can
* either be a [[UnresolvedAttribute]] (if the parent is an [[UnresolvedAttribute]]), or an
* [[UnresolvedExtractValue]] if the parent is some expression.
* Create a dereference expression. The return type depends on the type of the parent.
* If the parent is an [[UnresolvedAttribute]], it can be a [[UnresolvedAttribute]] or
* a [[UnresolvedRegex]] for regex quoted in ``; if the parent is some other expression,
* it can be [[UnresolvedExtractValue]].
*/
override def visitDereference(ctx: DereferenceContext): Expression = withOrigin(ctx) {
val attr = ctx.fieldName.getText
expression(ctx.base) match {
case UnresolvedAttribute(nameParts) =>
UnresolvedAttribute(nameParts :+ attr)
case unresolved_attr @ UnresolvedAttribute(nameParts) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a guard, e.g.: case unresolved_attr @ UnresolvedAttribute(nameParts) if conf.supportQuotedIdentifiers => . That makes the logic down the line much simpler.

Copy link
Contributor Author

@janewangfb janewangfb May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is concise to put the if inside the case unresolved_attr @ UnresolvedAttribute(nameParts).

if we use guard, we still need to handle the case when the conf.supportQuotedIdentifiers is false.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about

case u @ UnresolvedAttribute(nameParts) if nameParts.length == 1 && conf.supportQuotedRegexColumnName =>
  UnresolvedRegex(ctx.fieldName.getStart.getText, Some(nameParts.head))

// If there are more dereferences, turn `UnresolvedRegex` back to `UnresolvedAttribute`
case UnresolvedRegex(regex, table) =>
  UnresolvedAttribute(table.toSeq + regex)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this wont work. In your first "case", ctx.fieldName.getStart.getText is XYZ, nameparts is XYZ. and the table part should come from ctx.base.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh sorry I made a mistake, ctx.fieldName.getText will trim the backquote?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, ctx.fieldName.getText will trim the backquote

matchEscapedIdentifier(ctx.fieldName.getStart.getText) match {
case Some(i) if conf.supportQuotedIdentifiers =>
UnresolvedRegex(i, Some(unresolved_attr.name))
case _ =>
UnresolvedAttribute(nameParts :+ attr)
}
case e =>
UnresolvedExtractValue(e, Literal(attr))
}
}

/**
* Create an [[UnresolvedAttribute]] expression.
* Create an [[UnresolvedAttribute]] expression or a [[UnresolvedRegex]] if it is a regex
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if we always create UnresolvedRegex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should only create UnresolvedRegex when necessary.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there seems no problem if we always go with the UnresolvedRegex, then we can simplify the code and remove the logic to detect regex string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the code complexity will be similar, because if the column is ``, we need to extract the pattern;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not talking about algorithm complexity, but saying that we can simplify the logic by avoiding detecting the regex string.

Copy link
Contributor Author

@janewangfb janewangfb May 30, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan, the code path is shared by by both select a, select a.b and on cause etc. If it is select a.b, the table part also go here, but later there is no project expand. If it is on cause, the the string is already striped, not regex any more. Only with column names, we will have the project expanding (similar to start). So, we will need the regex pattern match to know that this is only for columns.

Do you have any suggestion? Currently Hive only supports select column regex expansion. and this PR matches the hive behavior.

* quoted in ``
*/
override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) {
UnresolvedAttribute.quoted(ctx.getText)
matchEscapedIdentifier(ctx.getStart.getText) match {
case Some(i) if conf.supportQuotedIdentifiers =>
UnresolvedRegex(i, None)
case _ =>
UnresolvedAttribute.quoted(ctx.getText)
}
}

/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,18 @@ object ParserUtils {
sb.toString()
}

val escapedIdentifier = "`(.+)`".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.


/**
* Return the substring extracted using regex
*/
def matchEscapedIdentifier(b: String): Option[String] = {
b match {
case escapedIdentifier(i) => Some(i)
case _ => None
}
}

/** Some syntactic sugar which makes it easier to work with optional clauses for LogicalPlans. */
implicit class EnhancedLogicalPlan(val plan: LogicalPlan) extends AnyVal {
/**
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,12 @@ object SQLConf {
.intConf
.createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)

val SUPPORT_QUOTED_IDENTIFIERS = buildConf("spark.sql.support.quoted.identifiers")
Copy link
Member

@gatorsmile gatorsmile May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming it to spark.sql.parser.quotedRegexColumnNames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed.

.internal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin @hvanhovell @cloud-fan Should we keep it internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be public. I didn't realize that that I put it under internal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be public

.doc("When true, identifiers specified by regex patterns will be expanded.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only do it for the column names, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be quoted. Thus, we also need to mention it in the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this only applies to column names. updated the doc.

.booleanConf
.createWithDefault(false)

object Deprecated {
val MAPRED_REDUCE_TASKS = "mapred.reduce.tasks"
}
Expand Down Expand Up @@ -1051,6 +1057,8 @@ class SQLConf extends Serializable with Logging {

def starSchemaFTRatio: Double = getConf(STARSCHEMA_FACT_TABLE_RATIO)

def supportQuotedIdentifiers: Boolean = getConf(SUPPORT_QUOTED_IDENTIFIERS)

/** ********************** SQLConf functionality methods ************ */

/** Set Spark SQL configuration properties. */
Expand Down
34 changes: 34 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -244,6 +244,40 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
("a", ClassData("a", 1)), ("b", ClassData("b", 2)), ("c", ClassData("c", 3)))
}

test("select 3, regex") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-> test("REGEX column specification")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDF()
intercept[AnalysisException] {
ds.select(expr("`(_1)?+.+`").as[Int])
}

intercept[AnalysisException] {
ds.select(expr("`(_1|_2)`").as[Int])
}

withSQLConf(SQLConf.SUPPORT_QUOTED_IDENTIFIERS.key -> "true") {
checkDataset(
ds.select(expr("`(_1)?+.+`").as[Int]),
1, 2, 3)
val m = ds.select(expr("`(_1|_2)`"))

checkDataset(
ds.select(expr("`(_1|_2)`"))
.select(expr("named_struct('a', _1, 'b', _2)").as[ClassData]),
ClassData("a", 1), ClassData("b", 2), ClassData("c", 3))

checkDataset(
ds.alias("g")
.select(expr("g.`(_1)?+.+`").as[Int]),
1, 2, 3)

checkDataset(
ds.alias("g")
.select(expr("g.`(_1|_2)`"))
.select(expr("named_struct('a', _1, 'b', _2)").as[ClassData]),
ClassData("a", 1), ClassData("b", 2), ClassData("c", 3))
}
}

test("filter") {
val ds = Seq(("a", 1), ("b", 2), ("c", 3)).toDS()
checkDataset(
Expand Down
88 changes: 88 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2624,4 +2624,92 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext {
val e = intercept[AnalysisException](sql("SELECT nvl(1, 2, 3)"))
assert(e.message.contains("Invalid number of arguments"))
}

test("SPARK-12139: REGEX Column Specification for Hive Queries") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a file in https://github.com/apache/spark/tree/master/sql/core/src/test/resources/sql-tests/inputs? Now, all the new SQL test cases need to be moved there.

You can run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite" to generate the result files. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's use those rather than adding more files to SQLQuerySUite. I'd love to get rid of SQLQuerySuite ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. moved the test to sql/core/src/test/resources/sql-tests/inputs

// hive.support.quoted.identifiers is turned off by default
checkAnswer(
sql(
"""
|SELECT b
|FROM testData2
|WHERE a = 1
""".stripMargin),
Row(1) :: Row(2) :: Nil)

checkAnswer(
sql(
"""
|SELECT t.b
|FROM testData2 t
|WHERE a = 1
""".stripMargin),
Row(1) :: Row(2) :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two test queries are not needed in the new suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. I was trying to make sure that the existing behaviors are not broken.


intercept[AnalysisException] {
sql(
"""
|SELECT `(a)?+.+`
|FROM testData2
|WHERE a = 1
""".stripMargin)
}

intercept[AnalysisException] {
sql(
"""
|SELECT t.`(a)?+.+`
|FROM testData2 t
|WHERE a = 1
""".stripMargin)
}

// now, turn on hive.support.quoted.identifiers
withSQLConf(SQLConf.SUPPORT_QUOTED_IDENTIFIERS.key -> "true") {
checkAnswer(
sql(
"""
|SELECT b
|FROM testData2
|WHERE a = 1
""".stripMargin),
Row(1) :: Row(2) :: Nil)

checkAnswer(
sql(
"""
|SELECT t.b
|FROM testData2 t
|WHERE a = 1
""".stripMargin),
Row(1) :: Row(2) :: Nil)

checkAnswer(
sql(
"""
|SELECT `(a)?+.+`
|FROM testData2
|WHERE a = 1
""".stripMargin),
Row(1) :: Row(2) :: Nil)

checkAnswer(
sql(
"""
|SELECT t.`(a)?+.+`
|FROM testData2 t
|WHERE a = 1
""".stripMargin),
Row(1) :: Row(2) :: Nil)

checkAnswer(
sql(
"""
|SELECT p.`(key)?+.+`, b, testdata2.`(b)?+.+`
|FROM testData p join testData2
|ON p.key = testData2.a
|WHERE key < 3
""".stripMargin),
Row("1", 1, 1) :: Row("1", 2, 1) :: Row("2", 1, 2) :: Row("2", 2, 2) :: Nil)
}
}
}