[SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly #22788

huaxingao · 2018-10-21T16:24:32Z

What changes were proposed in this pull request?

Currently, the nested columns are not escaped correctly:

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a.b`

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`   // ambiguous

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a.b`    // ambiquous

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = ```a.b``.c`    // two verbose

It should be something as the following:

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = a.b

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = a.b

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = `a.b`.c

How was this patch tested?

Add test

dongjoon-hyun · 2018-10-21T17:23:06Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala

+        }
+      }
+    ).mkString(".")
+  }


Hi, @huaxingao . Can we have the following one-line patch instead 12 line changes?

- nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".") + nameParts.map(n => if (nameParts.length > 1 || n.contains(".")) s"`$n`" else n).mkString(".")

cc @cloud-fan and @gatorsmile and @dbtsai .

@dongjoon-hyun Thanks! Fixed.

dongjoon-hyun

@huaxingao . I like this improvement because of the following four examples.

BEFORE

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a.b`

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`   // ambiguous

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a.b`    // ambiquous

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = ```a.b``.c`    // two verbose

AFTER

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a`.`b`

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a`.`b`

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = `a.b`.`c`

Since the updated test result doesn't clealy show your contribution, could you add some test cases covering the above four examples explicitly? It will prevent future regression, too.

SparkQA · 2018-10-21T20:09:28Z

Test build #97686 has finished for PR 22788 at commit 2aa4ad9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-10-21T22:11:28Z

Test build #97687 has finished for PR 22788 at commit e941acd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-22T03:36:33Z

@huaxingao Can you update the PR description based on @dongjoon-hyun's comments #22788 (review). It is more clear on the improvement of this change.

cloud-fan · 2018-10-22T13:30:03Z

sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out

 -- !query 9 output
 org.apache.spark.sql.AnalysisException
-Reference 't1.i1' is ambiguous, could be: mydb1.t1.i1, mydb1.t1.i1.; line 1 pos 7
+Reference '`t1`.`i1`' is ambiguous, could be: mydb1.t1.i1, mydb1.t1.i1.; line 1 pos 7


why is the new format better? it's more verbose, isn't it?

I thought some examples in the above.

These examples only make sense when we have the outer backticks. e.g. 't1.i1' is good.

cloud-fan · 2018-10-22T13:30:39Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala


      val e = intercept[AnalysisException](sql("SELECT v.i from (SELECT i FROM v)"))
      assert(e.message ==
-        "cannot resolve '`v.i`' given input columns: [__auto_generated_subquery_name.i]")


I think the problem here is the out-most backticks, do you know where we add it?

The out-most backticks is added in
case _, in case class UnresolvedAttribute(nameParts: Seq[String])

override def sql: String = name match { case ParserUtils.escapedIdentifier(_) | ParserUtils.qualifiedEscapedIdentifier(_, _) => name case _ => quoteIdentifier(name) }

the nameParts is a Seq of "v" "i", name is v.i

SparkQA · 2018-10-22T23:13:28Z

Test build #97877 has finished for PR 22788 at commit 99bfd00.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-22T23:47:21Z

Retest this please.

cloud-fan · 2018-10-23T02:33:15Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/unresolved.scala


  def name: String =
-    nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".")
+    nameParts.map(n => if (nameParts.length > 1 || n.contains(".")) s"`$n`" else n).mkString(".")


I don't think this is better for name, we should update sql though.

SparkQA · 2018-10-23T03:22:32Z

Test build #97881 has finished for PR 22788 at commit 99bfd00.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2018-10-23T06:31:39Z

@cloud-fan @dongjoon-hyun instead of changing Filter API, do you think using proper escaped char like this PR in #22573 is a good approach?

cloud-fan · 2018-10-23T06:47:58Z

Yea I think so, we can even use JSON to be safer. e.g. for a.b.c.d, we can encode it as a json array [a,b,c,d]. At data source side, use a json parser to read it back.

dbtsai · 2018-10-23T06:52:40Z

@cloud-fan I like the idea of using JSON, but that will also change the definition of string format. Do we just use JSON for nested case so the existing data source doesn't have to be changed?

cloud-fan · 2018-10-23T07:20:13Z

yea, only use json if it's a nested column.

viirya · 2018-10-23T10:04:20Z

From above examples,

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`   // ambiguous

Is this ambiguous? Doesn't it mean a column simply named a.b? Btw I don't see the difference between before and after this change.

dongjoon-hyun · 2018-10-23T20:53:41Z

@viirya . Please see the all four examples. I guess you missed the context.

[SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly #22788 (review)

BTW, I'm good for any methods if we can proceed further, @cloud-fan and @dbtsai .

cloud-fan · 2018-10-23T23:59:39Z

I agree with the problem described in the PR description that UnresolvedAttribute.sql is not ideal. But we should just update UnresolvedAttribute.sql, not the name method. name is used in other places and I think it has no problem.

viirya · 2018-10-24T00:18:16Z

@dongjoon-hyun Oh I see. The ambiguousness is in the results of sql for several inputs.

SparkQA · 2018-10-25T10:37:50Z

Test build #98003 has finished for PR 22788 at commit 458f77a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-28T23:08:00Z

sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala

+
+    val sql4 = $"`a.b`.c".expr.asInstanceOf[UnresolvedAttribute].sql
+    assert(sql4 === "`a.b`.`c`")
+  }


Hi, @huaxingao . Can we move this test to ColumnExpressionSuite? I made a PR to you. Please review and merge.

Merged. Thank you very much! @dongjoon-hyun

cloud-fan · 2018-10-29T02:15:17Z

sql/core/src/test/resources/sql-tests/results/columnresolution-negative.sql.out

 -- !query 18 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7
+cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7


do you think we should just make sql same as name? It looks to me that 'db1.t1.i1' is better than '`db1`.`t1`.`i1`', as it's more compact and is not ambiguous.

Yes. I agree. For the four examples, we will have the following results:

$"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b`

make sql same as name:

a.b

$"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b`

make sql same as name:

`a.b`

$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b`

make sql same as name:

a.b

$"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a.b`.`c`

make sql same as name:

`a.b`.c

Does this look good to everybody?

@huaxingao Were you planning to make a change to make name same as sql ? Currently they are different ?

@cloud-fan I had a question, the 3rd example from huaxin, wanted to confirm again if the output looks okay to you.

from huaxin

$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b`

make sql same as name:

a.b

Is it okay to drop the backtick from the first identifier ?

@dilipbiswal
Currently sql is not the same as name

Is it okay to drop the backtick from the first identifier

AFAIK both name and sql are for display/message. I think dropping backtick is fine if no ambiguous

@cloud-fan Makes sense. Thanks for clarification...

Refactor the test case

SparkQA · 2018-10-30T18:54:23Z

Test build #98270 has finished for PR 22788 at commit 7ff2696.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2018-11-01T20:26:38Z

This sounds a safe change. cc @liancheng

SparkQA · 2018-11-01T21:01:59Z

Test build #98366 has finished for PR 22788 at commit 05688f5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-01T21:02:58Z

Test build #98365 has finished for PR 22788 at commit 693c512.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-11-02T03:21:27Z

Test build #98376 has finished for PR 22788 at commit 3c81840.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2018-11-02T19:39:38Z

I have a question regarding the test failure in ExpressionTypeCheckingSuite. Most of the tests in this suite failed after I change UnresolvedAttribute.sql = UnresolvedAttribute.name. For example,

  test("check types for unary arithmetic") {
    assertError(BitwiseNot('stringField), "requires integral type")
  }

The test failed with

"cannot resolve '~`stringField`' due to data type mismatch: argument 1 requires integral
 type, however, '`stringField`' is of string type.;" did not contain "cannot resolve
 '~stringField' due to data type mismatch:"

It seems that the root cause of the failure is that UnresolveAttribute.sql doesn't match with AttributeReference.sql any more after my fix. I am in doubt if it is correct to make UnresolvedAttribute.sql the same as UnresolvedAttribute.name

huaxingao · 2018-11-06T21:34:58Z

@cloud-fan @dongjoon-hyun
Because of the above test failures in ExpressionTypeCheckingSuite, shall I revert to the previous change ?

  override def sql: String = nameParts.map { part =>
    part match {
      case ParserUtils.escapedIdentifier(_) | ParserUtils.qualifiedEscapedIdentifier(_, _) =>
        part
      case _ =>
        quoteIdentifier(part)
    }
  }.mkString(".")

Or change ExpressionTypeCheckingSuite?

HyukjinKwon · 2018-12-27T10:14:54Z

retest this please

HyukjinKwon · 2018-12-27T10:15:05Z

I'm retriggering this to see the test failures.

SparkQA · 2018-12-27T12:40:02Z

Test build #100471 has finished for PR 22788 at commit 3c81840.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

[SPARK-25769][SQL]change nested columns from to .

2aa4ad9

dongjoon-hyun reviewed Oct 21, 2018

View reviewed changes

address comment

e941acd

dongjoon-hyun reviewed Oct 21, 2018

View reviewed changes

cloud-fan reviewed Oct 22, 2018

View reviewed changes

huaxingao changed the title ~~[SPARK-25769][SQL]change nested columns from a.b to a.b~~ [SPARK-25769][SQL]escape nested columns by backtick each of the column name Oct 22, 2018

add test

99bfd00

cloud-fan reviewed Oct 23, 2018

View reviewed changes

move change to UnresolvedAttribute.sql

458f77a

Refactor the test case

566d5f9

dongjoon-hyun reviewed Oct 28, 2018

View reviewed changes

cloud-fan reviewed Oct 29, 2018

View reviewed changes

Merge pull request #1 from dongjoon-hyun/PR-22788

7ff2696

Refactor the test case

make UnresolvedAttribute the same as UnresolvedAttribute.name

693c512

huaxingao changed the title ~~[SPARK-25769][SQL]escape nested columns by backtick each of the column name~~ [SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly Nov 1, 2018

change test case name

05688f5

fix test faliure

3c81840

dongjoon-hyun added the SQL label Jun 14, 2019

huaxingao closed this Apr 5, 2020

[SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly #22788

[SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly #22788

Uh oh!

Conversation

huaxingao commented Oct 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 21, 2018

Uh oh!

SparkQA commented Oct 21, 2018

Uh oh!

viirya commented Oct 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 22, 2018

Uh oh!

dongjoon-hyun commented Oct 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 23, 2018

Uh oh!

dbtsai commented Oct 23, 2018

Uh oh!

cloud-fan commented Oct 23, 2018

Uh oh!

dbtsai commented Oct 23, 2018

Uh oh!

cloud-fan commented Oct 23, 2018

Uh oh!

viirya commented Oct 23, 2018

Uh oh!

dongjoon-hyun commented Oct 23, 2018

Uh oh!

cloud-fan commented Oct 23, 2018

Uh oh!

viirya commented Oct 24, 2018

Uh oh!

SparkQA commented Oct 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 30, 2018

Uh oh!

huaxingao commented Oct 21, 2018 •

edited

Loading