-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25769][SQL]make UnresolvedAttribute.sql escape nested columns correctly #22788
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| } | ||
| } | ||
| ).mkString(".") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @huaxingao . Can we have the following one-line patch instead 12 line changes?
- nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".")
+ nameParts.map(n => if (nameParts.length > 1 || n.contains(".")) s"`$n`" else n).mkString(".")cc @cloud-fan and @gatorsmile and @dbtsai .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dongjoon-hyun Thanks! Fixed.
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huaxingao . I like this improvement because of the following four examples.
BEFORE
scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a.b`
scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b` // ambiguous
scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a.b` // ambiquous
scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = ```a.b``.c` // two verboseAFTER
scala> $"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res0: String = `a`.`b`
scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b`
scala> $"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res2: String = `a`.`b`
scala> $"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res3: String = `a.b`.`c`Since the updated test result doesn't clealy show your contribution, could you add some test cases covering the above four examples explicitly? It will prevent future regression, too.
|
Test build #97686 has finished for PR 22788 at commit
|
|
Test build #97687 has finished for PR 22788 at commit
|
|
@huaxingao Can you update the PR description based on @dongjoon-hyun's comments #22788 (review). It is more clear on the improvement of this change. |
| -- !query 9 output | ||
| org.apache.spark.sql.AnalysisException | ||
| Reference 't1.i1' is ambiguous, could be: mydb1.t1.i1, mydb1.t1.i1.; line 1 pos 7 | ||
| Reference '`t1`.`i1`' is ambiguous, could be: mydb1.t1.i1, mydb1.t1.i1.; line 1 pos 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the new format better? it's more verbose, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought some examples in the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These examples only make sense when we have the outer backticks. e.g. 't1.i1' is good.
|
|
||
| val e = intercept[AnalysisException](sql("SELECT v.i from (SELECT i FROM v)")) | ||
| assert(e.message == | ||
| "cannot resolve '`v.i`' given input columns: [__auto_generated_subquery_name.i]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem here is the out-most backticks, do you know where we add it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The out-most backticks is added in
case _, in case class UnresolvedAttribute(nameParts: Seq[String])
override def sql: String = name match {
case ParserUtils.escapedIdentifier(_) | ParserUtils.qualifiedEscapedIdentifier(_, _) => name
case _ => quoteIdentifier(name)
}
the nameParts is a Seq of "v" "i", name is v.i
a.b to a.b|
Test build #97877 has finished for PR 22788 at commit
|
|
Retest this please. |
|
|
||
| def name: String = | ||
| nameParts.map(n => if (n.contains(".")) s"`$n`" else n).mkString(".") | ||
| nameParts.map(n => if (nameParts.length > 1 || n.contains(".")) s"`$n`" else n).mkString(".") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is better for name, we should update sql though.
|
Test build #97881 has finished for PR 22788 at commit
|
|
@cloud-fan @dongjoon-hyun instead of changing |
|
Yea I think so, we can even use JSON to be safer. e.g. for |
|
@cloud-fan I like the idea of using JSON, but that will also change the definition of string format. Do we just use JSON for nested case so the existing data source doesn't have to be changed? |
|
yea, only use json if it's a nested column. |
|
From above examples, scala> $"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
res1: String = `a.b` // ambiguousIs this ambiguous? Doesn't it mean a column simply named |
|
@viirya . Please see the all four examples. I guess you missed the context. BTW, I'm good for any methods if we can proceed further, @cloud-fan and @dbtsai . |
|
I agree with the problem described in the PR description that |
|
@dongjoon-hyun Oh I see. The ambiguousness is in the results of |
|
Test build #98003 has finished for PR 22788 at commit
|
|
|
||
| val sql4 = $"`a.b`.c".expr.asInstanceOf[UnresolvedAttribute].sql | ||
| assert(sql4 === "`a.b`.`c`") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @huaxingao . Can we move this test to ColumnExpressionSuite? I made a PR to you. Please review and merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merged. Thank you very much! @dongjoon-hyun
| -- !query 18 output | ||
| org.apache.spark.sql.AnalysisException | ||
| cannot resolve '`db1.t1.i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 | ||
| cannot resolve '`db1`.`t1`.`i1`' given input columns: [mydb2.t1.i1, mydb2.t1.i1]; line 1 pos 7 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you think we should just make sql same as name? It looks to me that 'db1.t1.i1' is better than '`db1`.`t1`.`i1`', as it's more compact and is not ambiguous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I agree. For the four examples, we will have the following results:
$"a.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
with my previous fix:
`a`.`b`
make sql same as name:
a.b
$"`a.b`".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
with my previous fix:
`a`.`b`
make sql same as name:
`a.b`
$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
with my previous fix:
`a`.`b`
make sql same as name:
a.b
$"`a.b`.c".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
with my previous fix:
`a.b`.`c`
make sql same as name:
`a.b`.c
Does this look good to everybody?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@huaxingao Were you planning to make a change to make name same as sql ? Currently they are different ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan I had a question, the 3rd example from huaxin, wanted to confirm again if the output looks okay to you.
from huaxin
$"`a`.b".expr.asInstanceOf[org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute].sql
with my previous fix:
`a`.`b`
make sql same as name:
a.b
Is it okay to drop the backtick from the first identifier ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dilipbiswal
Currently sql is not the same as name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it okay to drop the backtick from the first identifier
AFAIK both name and sql are for display/message. I think dropping backtick is fine if no ambiguous
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan Makes sense. Thanks for clarification...
Refactor the test case
|
Test build #98270 has finished for PR 22788 at commit
|
|
This sounds a safe change. cc @liancheng |
|
Test build #98366 has finished for PR 22788 at commit
|
|
Test build #98365 has finished for PR 22788 at commit
|
|
Test build #98376 has finished for PR 22788 at commit
|
|
I have a question regarding the test failure in The test failed with It seems that the root cause of the failure is that |
|
@cloud-fan @dongjoon-hyun Or change |
|
retest this please |
|
I'm retriggering this to see the test failures. |
|
Test build #100471 has finished for PR 22788 at commit
|
What changes were proposed in this pull request?
Currently, the nested columns are not escaped correctly:
It should be something as the following:
How was this patch tested?
Add test