Skip to content

Conversation

@huaxingao
Copy link
Contributor

@huaxingao huaxingao commented Oct 21, 2018

What changes were proposed in this pull request?

Currently, the nested columns are not escaped correctly:

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a.b`

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`   // ambiguous

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a.b`    // ambiquous

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = ```a.b``.c`    // two verbose

It should be something as the following:

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = a.b

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = a.b

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = `a.b`.c

How was this patch tested?

Add test

}
}
).mkString(".")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @huaxingao . Can we have the following one-line patch instead 12 line changes?

- nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".")
+ nameParts.map(n => if (nameParts.length > 1 || n.contains(".")) s"`$n`" else n).mkString(".")

cc @cloud-fan and @gatorsmile and @dbtsai .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun Thanks! Fixed.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao . I like this improvement because of the following four examples.

BEFORE

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a.b`

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`   // ambiguous

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a.b`    // ambiquous

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = ```a.b``.c`    // two verbose

AFTER

scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a`.`b`

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`

scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a`.`b`

scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = `a.b`.`c`

Since the updated test result doesn't clealy show your contribution, could you add some test cases covering the above four examples explicitly? It will prevent future regression, too.

@SparkQA
Copy link

SparkQA commented Oct 21, 2018

Test build #97686 has finished for PR 22788 at commit 2aa4ad9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 21, 2018

Test build #97687 has finished for PR 22788 at commit e941acd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Member

viirya commented Oct 22, 2018

@huaxingao Can you update the PR description based on @dongjoon-hyun's comments #22788 (review). It is more clear on the improvement of this change.

-- !query 9 output
org.apache.spark.sql.AnalysisException
Reference 't1.i1' is ambiguous, could be: mydb1.t1.i1, mydb1.t1.i1.; line 1 pos 7
Reference '`t1`.`i1`' is ambiguous, could be: mydb1.t1.i1, mydb1.t1.i1.; line 1 pos 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is the new format better? it's more verbose, isn't it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought some examples in the above.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples only make sense when we have the outer backticks. e.g. 't1.i1' is good.


val e = intercept[AnalysisException](sql("SELECT v.i from (SELECT i FROM v)"))
assert(e.message ==
"cannot resolve '`v.i`' given input columns: [__auto_generated_subquery_name.i]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem here is the out-most backticks, do you know where we add it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The out-most backticks is added in
case _, in case class UnresolvedAttribute(nameParts: Seq[String])

  override def sql: String = name match {
    case ParserUtils.escapedIdentifier(_) | ParserUtils.qualifiedEscapedIdentifier(_, _) => name
    case _ => quoteIdentifier(name)
  }

the nameParts is a Seq of "v" "i", name is v.i

@huaxingao huaxingao changed the title [SPARK-25769][SQL]change nested columns from a.b to a.b [SPARK-25769][SQL]escape nested columns by backtick each of the column name Oct 22, 2018
@SparkQA
Copy link

SparkQA commented Oct 22, 2018

Test build #97877 has finished for PR 22788 at commit 99bfd00.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Retest this please.


def name: String =
nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".")
nameParts.map(n => if (nameParts.length > 1 || n.contains(".")) s"`$n`" else n).mkString(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is better for name, we should update sql though.

@SparkQA
Copy link

SparkQA commented Oct 23, 2018

Test build #97881 has finished for PR 22788 at commit 99bfd00.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dbtsai
Copy link
Member

dbtsai commented Oct 23, 2018

@cloud-fan @dongjoon-hyun instead of changing Filter API, do you think using proper escaped char like this PR in #22573 is a good approach?

@cloud-fan
Copy link
Contributor

Yea I think so, we can even use JSON to be safer. e.g. for a.b.c.d, we can encode it as a json array [a,b,c,d]. At data source side, use a json parser to read it back.

@dbtsai
Copy link
Member

dbtsai commented Oct 23, 2018

@cloud-fan I like the idea of using JSON, but that will also change the definition of string format. Do we just use JSON for nested case so the existing data source doesn't have to be changed?

@cloud-fan
Copy link
Contributor

yea, only use json if it's a nested column.

@viirya
Copy link
Member

viirya commented Oct 23, 2018

From above examples,

scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`   // ambiguous

Is this ambiguous? Doesn't it mean a column simply named a.b? Btw I don't see the difference between before and after this change.

@dongjoon-hyun
Copy link
Member

@viirya . Please see the all four examples. I guess you missed the context.

BTW, I'm good for any methods if we can proceed further, @cloud-fan and @dbtsai .

@cloud-fan
Copy link
Contributor

I agree with the problem described in the PR description that UnresolvedAttribute.sql is not ideal. But we should just update UnresolvedAttribute.sql, not the name method. name is used in other places and I think it has no problem.

@viirya
Copy link
Member

viirya commented Oct 24, 2018

@dongjoon-hyun Oh I see. The ambiguousness is in the results of sql for several inputs.

@SparkQA
Copy link

SparkQA commented Oct 25, 2018

Test build #98003 has finished for PR 22788 at commit 458f77a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


val sql4 = $"`a.b`.c".expr.asInstanceOf[UnresolvedAttribute].sql
assert(sql4 === "`a.b`.`c`")
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @huaxingao . Can we move this test to ColumnExpressionSuite? I made a PR to you. Please review and merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merged. Thank you very much! @dongjoon-hyun

-- !query 18 output
org.apache.spark.sql.AnalysisException
cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7
cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you think we should just make sql same as name? It looks to me that 'db1.t1.i1' is better than '`db1`.`t1`.`i1`', as it's more compact and is not ambiguous.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I agree. For the four examples, we will have the following results:

$"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b` 

make sql same as name:

a.b
$"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b`

make sql same as name:

`a.b`
$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b`     

make sql same as name:

a.b
$"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a.b`.`c`

make sql same as name:

`a.b`.c         

Does this look good to everybody?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@huaxingao Were you planning to make a change to make name same as sql ? Currently they are different ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan I had a question, the 3rd example from huaxin, wanted to confirm again if the output looks okay to you.

from huaxin

$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql

with my previous fix:

`a`.`b`

make sql same as name:

a.b

Is it okay to drop the backtick from the first identifier ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dilipbiswal
Currently sql is not the same as name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it okay to drop the backtick from the first identifier

AFAIK both name and sql are for display/message. I think dropping backtick is fine if no ambiguous

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan Makes sense. Thanks for clarification...

@SparkQA
Copy link

SparkQA commented Oct 30, 2018

Test build #98270 has finished for PR 22788 at commit 7ff2696.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao huaxingao changed the title [SPARK-25769][SQL]escape nested columns by backtick each of the column name [SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly Nov 1, 2018
@gatorsmile
Copy link
Member

This sounds a safe change. cc @liancheng

@SparkQA
Copy link

SparkQA commented Nov 1, 2018

Test build #98366 has finished for PR 22788 at commit 05688f5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 1, 2018

Test build #98365 has finished for PR 22788 at commit 693c512.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2018

Test build #98376 has finished for PR 22788 at commit 3c81840.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

I have a question regarding the test failure in ExpressionTypeCheckingSuite. Most of the tests in this suite failed after I change UnresolvedAttribute.sql = UnresolvedAttribute.name. For example,

  test("check types for unary arithmetic") {
    assertError(BitwiseNot('stringField), "requires integral type")
  }

The test failed with

"cannot resolve '~`stringField`' due to data type mismatch: argument 1 requires integral
 type, however, '`stringField`' is of string type.;" did not contain "cannot resolve
 '~stringField' due to data type mismatch:"

It seems that the root cause of the failure is that UnresolveAttribute.sql doesn't match with AttributeReference.sql any more after my fix. I am in doubt if it is correct to make UnresolvedAttribute.sql the same as UnresolvedAttribute.name

@huaxingao
Copy link
Contributor Author

@cloud-fan @dongjoon-hyun
Because of the above test failures in ExpressionTypeCheckingSuite, shall I revert to the previous change ?

  override def sql: String = nameParts.map { part =>
    part match {
      case ParserUtils.escapedIdentifier(_) | ParserUtils.qualifiedEscapedIdentifier(_, _) =>
        part
      case _ =>
        quoteIdentifier(part)
    }
  }.mkString(".")

Or change ExpressionTypeCheckingSuite?

@HyukjinKwon
Copy link
Member

retest this please

@HyukjinKwon
Copy link
Member

I'm retriggering this to see the test failures.

@SparkQA
Copy link

SparkQA commented Dec 27, 2018

Test build #100471 has finished for PR 22788 at commit 3c81840.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants