Skip to content

Conversation

@janewangfb
Copy link
Contributor

@janewangfb janewangfb commented May 18, 2017

What changes were proposed in this pull request?

Hive interprets regular expression, e.g., (a)?+.+ in query specification. This PR enables spark to support this feature when hive.support.quoted.identifiers is set to true.

How was this patch tested?

  • Add unittests in SQLQuerySuite.scala
  • Run spark-shell tested the original failed query:
    scala> hc.sql("SELECT (a|b)?+.+ from test1").collect.foreach(println)

@gatorsmile
Copy link
Member

ok to test

def enableHiveSupportQuotedIdentifiers() : Boolean = {
SparkEnv.get != null &&
SparkEnv.get.conf != null &&
SparkEnv.get.conf.getBoolean("hive.support.quoted.identifiers", false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark SQL always supports quoted identifiers. However, the missing part is the REGEX Column Specification. How about adding such a conf to SQLConf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to SQLConf

@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77062 has finished for PR 18023 at commit af55afd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedRegex(expr: String, table: Option[String]) extends Star with Unevaluable

val escapedIdentifier = "`(.+)`".r
val ret = Option(ctx.fieldName.getStart).map(_.getText match {
case r@escapedIdentifier(i) =>
UnresolvedRegex(i, Some(unresolved_attr.name))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about no change in the parser?

Is that possible we can simply resolve it in ResolveReferences?

BTW, we also need to handle the same issue in the Dataset APIs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we go to ResolveReferences, we need to figure out that the regex is the special case, just like Star. ResolveReferences will do the resolve based on the expression type is UnresolvedRegx, Star etc.

Database API goes the same path. I have added a unittest for DatasetSuite.scala for regex

@gatorsmile
Copy link
Member

Please update the PR title to [SPARK-12139] [SQL] REGEX Column Specification

@janewangfb janewangfb changed the title Fix SPARK-12139: REGEX Column Specification for Hive Queries [SPARK-12139] [SQL] REGEX Column Specification May 18, 2017
@SparkQA
Copy link

SparkQA commented May 18, 2017

Test build #77063 has finished for PR 18023 at commit 43beb07.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case Some(t) => input.output.filter(_.qualifier.filter(resolver(_, t)).nonEmpty)
.filter(_.name.matches(expr))
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An Attribute is always a NamedExpression, why do we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. we dont need it any more. removed.

*
* @param table an optional table that should be the target of the expansion. If omitted all
* tables' columns are produced.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expr is the pattern right? Maybe we should give it a better name.

Copy link
Contributor Author

@janewangfb janewangfb May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed to regexPattern

val expandedAttributes: Seq[Attribute] = table match {
// If there is no table specified, use all input attributes that match expr
case None => input.output.filter(_.name.matches(expr))
// If there is a table, pick out attributes that are part of this table that match expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input.output.filter(_.qualifier.exists(resolver(_, t))) is a bit more concise.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

case UnresolvedAttribute(nameParts) =>
case unresolved_attr @ UnresolvedAttribute(nameParts) =>
if (conf.supportQuotedIdentifiers) {
val escapedIdentifier = "`(.+)`".r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to compile the same regex over and over. Can you move this to the ParserUtils...

I am also wondering if we shouldn't do the match in the parser it self.

Copy link
Contributor Author

@janewangfb janewangfb May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add API in ParserUtils.

I think in the parser, it can still get the backticks.

that, the backticks are stripped off.

val attr = ctx.fieldName.getText
expression(ctx.base) match {
case UnresolvedAttribute(nameParts) =>
case unresolved_attr @ UnresolvedAttribute(nameParts) =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use a guard, e.g.: case unresolved_attr @ UnresolvedAttribute(nameParts) if conf.supportQuotedIdentifiers => . That makes the logic down the line much simpler.

Copy link
Contributor Author

@janewangfb janewangfb May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is concise to put the if inside the case unresolved_attr @ UnresolvedAttribute(nameParts).

if we use guard, we still need to handle the case when the conf.supportQuotedIdentifiers is false.

*/
override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) {
if (conf.supportQuotedIdentifiers) {
val escapedIdentifier = "`(.+)`".r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to compile the same regex over and over. Can you move this to the ParserUtils...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add API in ParserUtils.

case unresolved_attr @ UnresolvedAttribute(nameParts) =>
if (conf.supportQuotedIdentifiers) {
val escapedIdentifier = "`(.+)`".r
val ret = Option(ctx.fieldName.getStart).map(_.getText match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an option here does not add a thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the option

override def visitColumnReference(ctx: ColumnReferenceContext): Expression = withOrigin(ctx) {
if (conf.supportQuotedIdentifiers) {
val escapedIdentifier = "`(.+)`".r
val ret = Option(ctx.getStart).map(_.getText match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an option here does not add a thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the option

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77068 has finished for PR 18023 at commit 7699e87.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

Like what we did for * in Column.scala, we also need to handle the Dataset APIs. You can follow the way we handle star there.

df.select(df("(a|b)?+.+"))

.intConf
.createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)

val SUPPORT_QUOTED_IDENTIFIERS = buildConf("spark.sql.support.quoted.identifiers")
Copy link
Member

@gatorsmile gatorsmile May 19, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming it to spark.sql.parser.quotedRegexColumnNames?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed.


val SUPPORT_QUOTED_IDENTIFIERS = buildConf("spark.sql.support.quoted.identifiers")
.internal()
.doc("When true, identifiers specified by regex patterns will be expanded.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only do it for the column names, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It must be quoted. Thus, we also need to mention it in the description.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. this only applies to column names. updated the doc.

sb.toString()
}

val escapedIdentifier = "`(.+)`".r
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added.

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77081 has finished for PR 18023 at commit 6e37517.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class UnresolvedRegex(regexPattern: String, table: Option[String])

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77086 has finished for PR 18023 at commit bee07cd.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.createWithDefault(UnsafeExternalSorter.DEFAULT_NUM_ELEMENTS_FOR_SPILL_THRESHOLD.toInt)

val SUPPORT_QUOTED_REGEX_COLUMN_NAME = buildConf("spark.sql.parser.quotedRegexColumnNames")
.internal()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin @hvanhovell @cloud-fan Should we keep it internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be public. I didn't realize that that I put it under internal.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be public

assert(e.message.contains("Invalid number of arguments"))
}

test("SPARK-12139: REGEX Column Specification for Hive Queries") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you create a file in https://github.com/apache/spark/tree/master/sql/core/src/test/resources/sql-tests/inputs? Now, all the new SQL test cases need to be moved there.

You can run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite" to generate the result files. Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes let's use those rather than adding more files to SQLQuerySUite. I'd love to get rid of SQLQuerySuite ....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. moved the test to sql/core/src/test/resources/sql-tests/inputs

|FROM testData2 t
|WHERE a = 1
""".stripMargin),
Row(1) :: Row(2) :: Nil)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above two test queries are not needed in the new suite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed. I was trying to make sure that the existing behaviors are not broken.

@SparkQA
Copy link

SparkQA commented May 19, 2017

Test build #77099 has finished for PR 18023 at commit 979bfb6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


-- Clean-up
DROP VIEW IF EXISTS testData;
DROP VIEW IF EXISTS testData2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to drop the temp views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed


CREATE OR REPLACE TEMPORARY VIEW testData2 AS SELECT * FROM VALUES
(1, 1), (1, 2), (2, 1), (2, 2), (3, 1), (3, 2)
AS testData2(a, b);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the test cases are testing the regex pattern matching in column names, could you add more names and let the regex pattern match more columns?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. added two more columns

def col(colName: String): Column = colName match {
case "*" =>
Column(ResolvedStar(queryExecution.analyzed.output))
case ParserUtils.escapedIdentifier(i) if sqlContext.conf.supportQuotedRegexColumnName =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid using i or j. Instead, using some meaningful variable names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@janewangfb
Copy link
Contributor Author

@viirya regarding

Do we still care hive.support.quoted.identifiers? If no, please update the PR description accordingly.

Yes, we still have hive.support.quoted.identifiers. only when spark.sql.parser.quotedRegexColumnNames = true, ... in SELECT statements will be treated in regex as in Hive

@janewangfb
Copy link
Contributor Author

@gatorsmile regarding

Could you add some test cases in the other parts of the query? For example, group by clauses.

I added some tests for aggreation. But for group by, we cannot have regex there. group by requires the fields to be orderable. As hive states and our previous comments, we should only support regex in SELECT.

@gatorsmile
Copy link
Member

gatorsmile commented Jul 8, 2017

@janewangfb That is fine we do not support it, but we still need to add test cases for these negative cases. Thanks!

@janewangfb
Copy link
Contributor Author

@gatorsmile, regarding:

@janewangfb If we turn on the flag spark.sql.parser.quotedRegexColumnNames by default, the > following test cases failed. Could you do some investigations? Thanks!

org.apache.spark.sql.SQLQuerySuite
the struct type was not supported now in the regex
some special characters has different meaning in regex.

org.apache.spark.sql.DataFrameSuite
some special characters has different meaning in regex.

org.apache.spark.sql.SingleLevelAggregateHashMapSuite
org.apache.spark.sql.DataFrameAggregateSuite
org.apache.spark.sql.TwoLevelAggregateHashMapSuite
org.apache.spark.sql.TwoLevelAggregateHashMapWithVectorizedMapSuite
org.apache.spark.sql.DataFrameNaFunctionsSuite
org.apache.spark.sql.DataFrameStatSuite
These four failed for the same testcase. in AS alias, regex is not allowed.

org.apache.spark.sql.SQLQueryTestSuite
This suite has the same behavior wether spark.sql.parser.quotedRegexColumnNames default value is true/false.

org.apache.spark.sql.execution.datasources.json.JsonSuite
for map struct, regex should not be allowed in A[B] part.

org.apache.spark.sql.DatasetSuite
Expected. Explicitly set the spark.sql.parser.quotedRegexColumnNames = false to false for those tests.

org.apache.spark.sql.sources.TableScanSuite
some special characters has different meaning in regex

org.apache.spark.sql.execution.datasources.parquet.ParquetFilterSuite
regex is not allowed in where.

@janewangfb
Copy link
Contributor Author

@gatorsmile, regarding

That is fine we do not support it, but we still need to add test cases for these negative cases. Thanks!

Yes, I have added testcases.

@janewangfb
Copy link
Contributor Author

@gatorsmile regarding:

Could you add some test cases in the other parts of the query? For example, group by clauses.
Yes, added.

@SparkQA
Copy link

SparkQA commented Jul 10, 2017

Test build #79476 has finished for PR 18023 at commit 8adad7c.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
  • class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser
  • class SparkSqlParser(conf: SQLConf) extends AbstractSqlParser
  • class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf)
  • class VariableSubstitution(conf: SQLConf)

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79503 has finished for PR 18023 at commit 56e2b83.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class ArrowSerializer(FramedSerializer):

case unresolved_attr @ UnresolvedAttribute(nameParts) =>
ctx.fieldName.getStart.getText match {
case escapedIdentifier(columnNameRegex)
if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, recently, we reverted a PR back. In the parser, we are unable to use SQLConf.get.

Could you please change SQLConf.get back to conf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

case escapedIdentifier(columnNameRegex)
if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>
UnresolvedRegex(columnNameRegex, Some(unresolved_attr.name),
SQLConf.get.caseSensitiveAnalysis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolled back to conf.

ctx.getStart.getText match {
case escapedIdentifier(columnNameRegex)
if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>
UnresolvedRegex(columnNameRegex, None, SQLConf.get.caseSensitiveAnalysis)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolled back to conf.

UnresolvedAttribute.quoted(ctx.getText)
ctx.getStart.getText match {
case escapedIdentifier(columnNameRegex)
if SQLConf.get.supportQuotedRegexColumnName && canApplyRegex(ctx) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rolled back to conf

private[sql] abstract class DataSourceTest extends QueryTest {

protected def sqlTest(sqlString: String, expectedAnswer: Seq[Row]) {
protected def sqlTest(sqlString: String, expectedAnswer: Seq[Row], enableRegex: String = "true") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enableRegex: String = "true" -> enableRegex: Boolean = false

Could you change the type to Boolean and call .toString below and set the default to false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

CaseWhen(branches, Option(ctx.elseExpression).map(expression))
}

private def canApplyRegex(ctx: ParserRuleContext): Boolean = withOrigin(ctx) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment above this function to explain why regex can be applied under NamedExpression only. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79504 has finished for PR 18023 at commit d613ff9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@janewangfb
Copy link
Contributor Author

@gatorsmile regarding:

Could you revert back all the unneeded changes? (in JsonSuite.scala).

(I saw this comment in email but didn't find in the PR) I have reverted the unneeded changes.

@SparkQA
Copy link

SparkQA commented Jul 11, 2017

Test build #79535 has finished for PR 18023 at commit a5f9c44.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

The last comment is about DataFrameNaFunctions.fill. It does not work when spark.sql.parser.quotedRegexColumnNames is on. Could you resolve that in the follow-up PR?

@gatorsmile
Copy link
Member

LGTM

@gatorsmile
Copy link
Member

Thanks! Merging to master.

@asfgit asfgit closed this in 2cbfc97 Jul 12, 2017
@janewangfb
Copy link
Contributor Author

@gatorsmile

Sure, I could have a follow-up PR to resolve DataFrameNaFunctions.fill.

thanks for reviewing this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants