[SPARK-12139] [SQL] REGEX Column Specification #18023

janewangfb · 2017-05-18T00:28:22Z

What changes were proposed in this pull request?

Hive interprets regular expression, e.g., (a)?+.+ in query specification. This PR enables spark to support this feature when hive.support.quoted.identifiers is set to true.

How was this patch tested?

Add unittests in SQLQuerySuite.scala
Run spark-shell tested the original failed query:
scala> hc.sql("SELECT (a|b)?+.+ from test1").collect.foreach(println)

gatorsmile · 2017-05-18T18:48:09Z

ok to test

gatorsmile · 2017-05-18T18:54:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+  def enableHiveSupportQuotedIdentifiers() : Boolean = {
+    SparkEnv.get != null &&
+      SparkEnv.get.conf != null &&
+      SparkEnv.get.conf.getBoolean("hive.support.quoted.identifiers", false)


Spark SQL always supports quoted identifiers. However, the missing part is the REGEX Column Specification. How about adding such a conf to SQLConf?

Added to SQLConf

SparkQA · 2017-05-18T18:55:17Z

Test build #77062 has finished for PR 18023 at commit af55afd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedRegex(expr: String, table: Option[String]) extends Star with Unevaluable

gatorsmile · 2017-05-18T18:57:29Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+          val escapedIdentifier = "`(.+)`".r
+          val ret = Option(ctx.fieldName.getStart).map(_.getText match {
+            case r@escapedIdentifier(i) =>
+              UnresolvedRegex(i, Some(unresolved_attr.name))


How about no change in the parser?

Is that possible we can simply resolve it in ResolveReferences?

BTW, we also need to handle the same issue in the Dataset APIs.

Before we go to ResolveReferences, we need to figure out that the regex is the special case, just like Star. ResolveReferences will do the resolve based on the expression type is UnresolvedRegx, Star etc.

Database API goes the same path. I have added a unittest for DatasetSuite.scala for regex

gatorsmile · 2017-05-18T18:58:13Z

Please update the PR title to [SPARK-12139] [SQL] REGEX Column Specification

…spark into support_select_regex

SparkQA · 2017-05-18T19:39:56Z

Test build #77063 has finished for PR 18023 at commit 43beb07.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2017-05-18T22:15:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+      case Some(t) => input.output.filter(_.qualifier.filter(resolver(_, t)).nonEmpty)
+        .filter(_.name.matches(expr))
+    }
+


An Attribute is always a NamedExpression, why do we need this?

you are right. we dont need it any more. removed.

hvanhovell · 2017-05-18T22:17:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+ *
+ * @param table an optional table that should be the target of the expansion.  If omitted all
+ *              tables' columns are produced.
+ */


expr is the pattern right? Maybe we should give it a better name.

renamed to regexPattern

hvanhovell · 2017-05-18T22:18:02Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+    val expandedAttributes: Seq[Attribute] = table match {
+      // If there is no table specified, use all input attributes that match expr
+      case None => input.output.filter(_.name.matches(expr))
+      // If there is a table, pick out attributes that are part of this table that match expr


input.output.filter(_.qualifier.exists(resolver(_, t))) is a bit more concise.

hvanhovell · 2017-05-18T22:21:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

-      case UnresolvedAttribute(nameParts) =>
+      case unresolved_attr @ UnresolvedAttribute(nameParts) =>
+        if (conf.supportQuotedIdentifiers) {
+          val escapedIdentifier = "`(.+)`".r


We don't need to compile the same regex over and over. Can you move this to the ParserUtils...

I am also wondering if we shouldn't do the match in the parser it self.

Add API in ParserUtils.

I think in the parser, it can still get the backticks.

that, the backticks are stripped off.

hvanhovell · 2017-05-18T22:23:26Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

    val attr = ctx.fieldName.getText
    expression(ctx.base) match {
-      case UnresolvedAttribute(nameParts) =>
+      case unresolved_attr @ UnresolvedAttribute(nameParts) =>


Please use a guard, e.g.: case unresolved_attr @ UnresolvedAttribute(nameParts) if conf.supportQuotedIdentifiers => . That makes the logic down the line much simpler.

It is concise to put the if inside the case unresolved_attr @ UnresolvedAttribute(nameParts).

if we use guard, we still need to handle the case when the conf.supportQuotedIdentifiers is false.

hvanhovell · 2017-05-18T23:56:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

   */
  override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) {
+    if (conf.supportQuotedIdentifiers) {
+      val escapedIdentifier = "`(.+)`".r


We don't need to compile the same regex over and over. Can you move this to the ParserUtils...

Add API in ParserUtils.

hvanhovell · 2017-05-18T23:58:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+      case unresolved_attr @ UnresolvedAttribute(nameParts) =>
+        if (conf.supportQuotedIdentifiers) {
+          val escapedIdentifier = "`(.+)`".r
+          val ret = Option(ctx.fieldName.getStart).map(_.getText match {


Using an option here does not add a thing.

removed the option

hvanhovell · 2017-05-18T23:58:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

  override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) {
+    if (conf.supportQuotedIdentifiers) {
+      val escapedIdentifier = "`(.+)`".r
+      val ret = Option(ctx.getStart).map(_.getText match {


Using an option here does not add a thing.

removed the option

SparkQA · 2017-05-19T01:28:41Z

Test build #77068 has finished for PR 18023 at commit 7699e87.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

Blame Rev:

gatorsmile · 2017-05-19T04:31:33Z

Like what we did for * in Column.scala, we also need to handle the Dataset APIs. You can follow the way we handle star there.

df.select(df("(a|b)?+.+"))

gatorsmile · 2017-05-19T04:34:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .intConf
      .createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)

+  val SUPPORT_QUOTED_IDENTIFIERS = buildConf("spark.sql.support.quoted.identifiers")


How about renaming it to spark.sql.parser.quotedRegexColumnNames?

gatorsmile · 2017-05-19T04:34:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val SUPPORT_QUOTED_IDENTIFIERS = buildConf("spark.sql.support.quoted.identifiers")
+    .internal()
+    .doc("When true, identifiers specified by regex patterns will be expanded.")


We only do it for the column names, right?

It must be quoted. Thus, we also need to mention it in the description.

yes. this only applies to column names. updated the doc.

gatorsmile · 2017-05-19T04:41:40Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala

    sb.toString()
  }

+  val escapedIdentifier = "`(.+)`".r


Please add a comment for this.

SparkQA · 2017-05-19T06:45:05Z

Test build #77081 has finished for PR 18023 at commit 6e37517.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class UnresolvedRegex(regexPattern: String, table: Option[String])

SparkQA · 2017-05-19T07:24:19Z

Test build #77086 has finished for PR 18023 at commit bee07cd.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-19T16:26:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)

+  val SUPPORT_QUOTED_REGEX_COLUMN_NAME = buildConf("spark.sql.parser.quotedRegexColumnNames")
+    .internal()


@rxin @hvanhovell @cloud-fan Should we keep it internal?

I think it should be public. I didn't realize that that I put it under internal.

should be public

gatorsmile · 2017-05-19T16:32:57Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

    assert(e.message.contains("Invalid number of arguments"))
  }
+
+  test("SPARK-12139: REGEX Column Specification for Hive Queries") {


Could you create a file in https://github.com/apache/spark/tree/master/sql/core/src/test/resources/sql-tests/inputs? Now, all the new SQL test cases need to be moved there.

You can run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite" to generate the result files. Thanks!

Yes let's use those rather than adding more files to SQLQuerySUite. I'd love to get rid of SQLQuerySuite ....

ok. moved the test to sql/core/src/test/resources/sql-tests/inputs

gatorsmile · 2017-05-19T16:33:58Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+          |FROM testData2 t
+          |WHERE a = 1
+        """.stripMargin),
+      Row(1) :: Row(2) :: Nil)


The above two test queries are not needed in the new suite.

removed. I was trying to make sure that the existing behaviors are not broken.

SparkQA · 2017-05-19T20:50:48Z

Test build #77099 has finished for PR 18023 at commit 979bfb6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-19T22:23:31Z

sql/core/src/test/resources/sql-tests/inputs/query_regex_column.sql

+
+-- Clean-up
+DROP VIEW IF EXISTS testData;
+DROP VIEW IF EXISTS testData2;


No need to drop the temp views.

gatorsmile · 2017-05-19T22:25:17Z

sql/core/src/test/resources/sql-tests/inputs/query_regex_column.sql

+
+CREATE OR REPLACE TEMPORARY VIEW testData2 AS SELECT * FROM VALUES
+(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)
+AS testData2(a, b);


Since the test cases are testing the regex pattern matching in column names, could you add more names and let the regex pattern match more columns?

sure. added two more columns

gatorsmile · 2017-05-19T22:28:45Z

sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala

  def col(colName: String): Column = colName match {
    case "*" =>
      Column(ResolvedStar(queryExecution.analyzed.output))
+    case ParserUtils.escapedIdentifier(i) if sqlContext.conf.supportQuotedRegexColumnName =>


Please avoid using i or j. Instead, using some meaningful variable names.

janewangfb · 2017-07-07T23:13:22Z

@viirya regarding

Do we still care hive.support.quoted.identifiers? If no, please update the PR description accordingly.

Yes, we still have hive.support.quoted.identifiers. only when spark.sql.parser.quotedRegexColumnNames = true, ... in SELECT statements will be treated in regex as in Hive

janewangfb · 2017-07-07T23:26:38Z

@gatorsmile regarding

Could you add some test cases in the other parts of the query? For example, group by clauses.

I added some tests for aggreation. But for group by, we cannot have regex there. group by requires the fields to be orderable. As hive states and our previous comments, we should only support regex in SELECT.

gatorsmile · 2017-07-08T06:04:59Z

@janewangfb That is fine we do not support it, but we still need to add test cases for these negative cases. Thanks!

janewangfb · 2017-07-10T19:18:44Z

@gatorsmile, regarding:

@janewangfb If we turn on the flag spark.sql.parser.quotedRegexColumnNames by default, the > following test cases failed. Could you do some investigations? Thanks!

org.apache.spark.sql.SQLQuerySuite
the struct type was not supported now in the regex
some special characters has different meaning in regex.

org.apache.spark.sql.DataFrameSuite
some special characters has different meaning in regex.

org.apache.spark.sql.SingleLevelAggregateHashMapSuite
org.apache.spark.sql.DataFrameAggregateSuite
org.apache.spark.sql.TwoLevelAggregateHashMapSuite
org.apache.spark.sql.TwoLevelAggregateHashMapWithVectorizedMapSuite
org.apache.spark.sql.DataFrameNaFunctionsSuite
org.apache.spark.sql.DataFrameStatSuite
These four failed for the same testcase. in AS alias, regex is not allowed.

org.apache.spark.sql.SQLQueryTestSuite
This suite has the same behavior wether spark.sql.parser.quotedRegexColumnNames default value is true/false.

org.apache.spark.sql.execution.datasources.json.JsonSuite
for map struct, regex should not be allowed in A[B] part.

org.apache.spark.sql.DatasetSuite
Expected. Explicitly set the spark.sql.parser.quotedRegexColumnNames = false to false for those tests.

org.apache.spark.sql.sources.TableScanSuite
some special characters has different meaning in regex

org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite
regex is not allowed in where.

janewangfb · 2017-07-10T19:19:55Z

@gatorsmile, regarding

That is fine we do not support it, but we still need to add test cases for these negative cases. Thanks!

Yes, I have added testcases.

janewangfb · 2017-07-10T19:31:18Z

@gatorsmile regarding:

Could you add some test cases in the other parts of the query? For example, group by clauses.
Yes, added.

SparkQA · 2017-07-10T21:47:32Z

Test build #79476 has finished for PR 18023 at commit 8adad7c.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser
class SparkSqlParser(conf: SQLConf) extends AbstractSqlParser
class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf)
class VariableSubstitution(conf: SQLConf)

SparkQA · 2017-07-11T05:19:00Z

Test build #79503 has finished for PR 18023 at commit 56e2b83.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class ArrowSerializer(FramedSerializer):

gatorsmile · 2017-07-11T06:42:23Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+      case unresolved_attr @ UnresolvedAttribute(nameParts) =>
+        ctx.fieldName.getStart.getText match {
+          case escapedIdentifier(columnNameRegex)
+            if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>


Sorry, recently, we reverted a PR back. In the parser, we are unable to use SQLConf.get.

Could you please change SQLConf.get back to conf?

gatorsmile · 2017-07-11T06:42:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+          case escapedIdentifier(columnNameRegex)
+            if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>
+            UnresolvedRegex(columnNameRegex, Some(unresolved_attr.name),
+              SQLConf.get.caseSensitiveAnalysis)


The same here.

rolled back to conf.

gatorsmile · 2017-07-11T06:42:47Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

+    ctx.getStart.getText match {
+      case escapedIdentifier(columnNameRegex)
+        if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>
+        UnresolvedRegex(columnNameRegex, None, SQLConf.get.caseSensitiveAnalysis)


The same here

rolled back to conf.

gatorsmile · 2017-07-11T06:42:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

-    UnresolvedAttribute.quoted(ctx.getText)
+    ctx.getStart.getText match {
+      case escapedIdentifier(columnNameRegex)
+        if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>


The same here

rolled back to conf

gatorsmile · 2017-07-11T06:46:37Z

sql/core/src/test/scala/org/apache/spark/sql/sources/DataSourceTest.scala

 private[sql] abstract class DataSourceTest extends QueryTest {

-  protected def sqlTest(sqlString: String, expectedAnswer: Seq[Row]) {
+  protected def sqlTest(sqlString: String, expectedAnswer: Seq[Row], enableRegex: String = "true") {


enableRegex: String = "true" -> enableRegex: Boolean = false

Could you change the type to Boolean and call .toString below and set the default to false?

gatorsmile · 2017-07-11T06:58:19Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

    CaseWhen(branches, Option(ctx.elseExpression).map(expression))
  }

+  private def canApplyRegex(ctx: ParserRuleContext): Boolean = withOrigin(ctx) {


Please add a comment above this function to explain why regex can be applied under NamedExpression only. Thanks!

SparkQA · 2017-07-11T07:04:55Z

Test build #79504 has finished for PR 18023 at commit d613ff9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

janewangfb · 2017-07-11T17:47:30Z

@gatorsmile regarding:

Could you revert back all the unneeded changes? (in JsonSuite.scala).

(I saw this comment in email but didn't find in the PR) I have reverted the unneeded changes.

SparkQA · 2017-07-11T20:13:08Z

Test build #79535 has finished for PR 18023 at commit a5f9c44.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-12T04:59:47Z

The last comment is about DataFrameNaFunctions.fill. It does not work when spark.sql.parser.quotedRegexColumnNames is on. Could you resolve that in the follow-up PR?

gatorsmile · 2017-07-12T04:59:57Z

LGTM

gatorsmile · 2017-07-12T05:01:05Z

Thanks! Merging to master.

janewangfb · 2017-07-12T05:05:19Z

@gatorsmile

Sure, I could have a follow-up PR to resolve DataFrameNaFunctions.fill.

thanks for reviewing this PR.

Fix SPARK-12139: REGEX Column Specification for Hive Queries

af55afd

gatorsmile reviewed May 18, 2017

View reviewed changes

janewangfb added 2 commits May 18, 2017 12:24

Fix SPARK-12139: REGEX Column Specification for Hive Queries

6f9bdb0

Merge branch 'support_select_regex' of https://github.com/janewangfb/…

43beb07

…spark into support_select_regex

janewangfb changed the title ~~Fix SPARK-12139: REGEX Column Specification for Hive Queries~~ [SPARK-12139] [SQL] REGEX Column Specification May 18, 2017

add unittests for DataSet.

7699e87

hvanhovell requested changes May 18, 2017

View reviewed changes

Address hvanhovell's comments

6e37517

Blame Rev:

gatorsmile reviewed May 19, 2017

View reviewed changes

Address gatorsmile's comments

bee07cd

gatorsmile reviewed May 19, 2017

View reviewed changes

janewangfb added 2 commits May 19, 2017 11:26

address gatorsmile's comment

d5e450a

add the gold file

979bfb6

gatorsmile reviewed May 19, 2017

View reviewed changes

Address gatorsmile and viirya's comments

d3eed1a

janewangfb added 2 commits July 10, 2017 12:29

address gatorsmile's comments

956b849

Merge branch 'master' into support_select_regex

8adad7c

merge master and resolve conflicts

56e2b83

fix build failure

d613ff9

gatorsmile reviewed Jul 11, 2017

View reviewed changes

Merge branch 'master' into support_select_regex

f5104e4

address gaatorsmile's comments

a5f9c44

asfgit closed this in 2cbfc97 Jul 12, 2017

cloud-fan mentioned this pull request Nov 20, 2023

[SPARK-43980][SQL] introducing select * except syntax #43843

Closed

[SPARK-12139] [SQL] REGEX Column Specification #18023

[SPARK-12139] [SQL] REGEX Column Specification #18023

Uh oh!

Conversation

janewangfb commented May 18, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

gatorsmile commented May 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented May 18, 2017

Uh oh!

SparkQA commented May 18, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janewangfb May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janewangfb May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

janewangfb May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 19, 2017

Uh oh!

gatorsmile commented May 19, 2017

Uh oh!

gatorsmile May 19, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

janewangfb commented May 18, 2017 •

edited

Loading

janewangfb May 19, 2017 •

edited

Loading

janewangfb May 19, 2017 •

edited

Loading

janewangfb May 19, 2017 •

edited

Loading

gatorsmile May 19, 2017 •

edited

Loading