[SPARK-20690][SQL] Subqueries in FROM should have alias names#17935
[SPARK-20690][SQL] Subqueries in FROM should have alias names#17935viirya wants to merge 9 commits intoapache:masterfrom
Conversation
|
Test build #76743 has finished for PR 17935 at commit
|
|
Test build #76745 has finished for PR 17935 at commit
|
|
Test build #76771 has finished for PR 17935 at commit
|
|
Test build #76772 has finished for PR 17935 at commit
|
|
Test build #76783 has started for PR 17935 at commit |
|
retest this please. |
|
Test build #76785 has finished for PR 17935 at commit
|
|
cc @cloud-fan |
| sql( | ||
| """ | ||
| | select 1 | ||
| | from (select 1 from onerow t1 LIMIT 1) |
There was a problem hiding this comment.
I'm surprised we support this syntax, I think the FROM clause must have an alias.
I checked with postgres, it will throw exception subquery in FROM must have an alias, can you check with other databases? Thanks!
There was a problem hiding this comment.
mysql:
mysql> select 1 from (select 1 from test);
ERROR 1248 (42000): Every derived table must have its own alias
There was a problem hiding this comment.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries
Hive supports subqueries only in the FROM clause (through Hive 0.12). The subquery has to be given a name because every table in a FROM clause must have a name.
Hive also requires an alias name.
There was a problem hiding this comment.
https://docs.oracle.com/cd/E17952_01/mysql-5.1-en/from-clause-subqueries.html
The [AS] name clause is mandatory, because every table in a FROM clause must have a name. Any columns in the subquery select list must have unique names.
Oracle also requires it.
There was a problem hiding this comment.
cc @hvanhovell shall we change the parser? I think it's hard to reason about the semantic of an anonymous subquery
There was a problem hiding this comment.
Yeah, in this change I remove qualifier after an anonymous subquery. Not sure if it is what we always want.
There was a problem hiding this comment.
I think we should change the parser and require alias for subquery.
There was a problem hiding this comment.
Yeah, this seems confusing. Subqueries should be have an alias. Let's try to add that.
| Row(3, 3.0, 2, 3.0) :: Row(3, 3.0, 2, 3.0) :: Nil) | ||
| } | ||
|
|
||
| test("SPARK-20690: Do not add missing attributes through subqueries") { |
There was a problem hiding this comment.
do we still need this test? I think it's all covered by the parser test
| relationPrimary | ||
| : tableIdentifier sample? (AS? strictIdentifier)? #tableName | ||
| | '(' queryNoWith ')' sample? (AS? strictIdentifier)? #aliasedQuery | ||
| | '(' queryNoWith ')' sample? (AS? strictIdentifier) #aliasedQuery |
There was a problem hiding this comment.
shall we also update AstBuilder for this?
There was a problem hiding this comment.
Should we also force an alias for the next line? '(' relation ')' sample? (AS? strictIdentifier)?
There was a problem hiding this comment.
I also have this question when I change this.
MySQL supports select 1 from (test); but doesn't support select 1 from (test) as a ;
Postgres doesn't support both syntax.
Hive supports select * from (test); but doesn't support select * from (test) as a;.
Seems aliased relation is not commonly supported. So I leave it untouched.
There was a problem hiding this comment.
Yay, unclear semantics. Ok, that is fine for now.
| * Aliased subquery. | ||
| * | ||
| * @param alias the alias name for this subquery. | ||
| * @param child the LogicalPlan |
There was a problem hiding this comment.
Nit: the LogicalPlan -> the logical plan of this subquery
| } | ||
|
|
||
| case f @ Filter(cond, child) if child.resolved => | ||
| case f @ Filter(cond, child) if !f.resolved && child.resolved => |
There was a problem hiding this comment.
Only added for Filter? How about Sort in the same rule?
hvanhovell
left a comment
There was a problem hiding this comment.
One small question, otherwise LGTM.
| relationPrimary | ||
| : tableIdentifier sample? (AS? strictIdentifier)? #tableName | ||
| | '(' queryNoWith ')' sample? (AS? strictIdentifier)? #aliasedQuery | ||
| | '(' queryNoWith ')' sample? (AS? strictIdentifier) #aliasedQuery |
There was a problem hiding this comment.
Should we also force an alias for the next line? '(' relation ')' sample? (AS? strictIdentifier)?
|
Test build #76969 has finished for PR 17935 at commit
|
|
Test build #76994 has finished for PR 17935 at commit
|
|
Sure.
…On May 17, 2017 12:10 PM, "Wenchen Fan" ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala
<#17935 (comment)>:
> @@ -631,13 +631,13 @@ class CachedTableSuite extends QueryTest with SQLTestUtils with SharedSQLContext
val ds2 =
sql(
"""
- |SELECT * FROM (SELECT max(c1) FROM t1 GROUP BY c1)
+ |SELECT * FROM (SELECT max(c1) as c1 FROM t1 GROUP BY c1) t1
can you pick another name? t1 appeared twice...
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#17935 (review)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEM97DecogWUFu4HTI-KqSoijpKuJPsks5r6nMcgaJpZM4NWVlc>
.
|
|
thanks, merging to master! |
|
oops, you pushed a new commit... But it should be fine as the change is very small. |
|
Thanks @cloud-fan @hvanhovell @gatorsmile @wzhfy |
|
@cloud-fan Yeah, I change the aliased name in one test. I tested it locally. |
|
Test build #77007 has finished for PR 17935 at commit
|
|
No problem. The new commit passes tests. |
## What changes were proposed in this pull request?
We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this:
select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1
This query works in current codebase. However, the outside where clause shouldn't be able to refer `t1.c1` attribute.
The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes apache#17935 from viirya/SPARK-20690.
## What changes were proposed in this pull request?
We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this:
select 1 from (select 1 from onerow t1 LIMIT 1) where t1.c1=1
This query works in current codebase. However, the outside where clause shouldn't be able to refer `t1.c1` attribute.
The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes apache#17935 from viirya/SPARK-20690.
|
I was trying to run a test case from another database which does support unaliased subqueries in the |
|
@JoshRosen Thanks for filing this issue. I'll look into it. |
|
@JoshRosen what was the other type of database you were using? |
|
@ash211, I was attempting to re-use test cases from CockroachDB (which is surprisingly permissive in the SQL it accepts compared to Postgres). |
|
Guys - isn't this API breaking? |
|
Also the description / title is completely different from the JIRA ticket. |
|
@rxin Sorry I forgot to change JIRA ticket. I changed it now. |
|
Other committers please revert this change until we find a solution or verify that almost no users write queries like this. |
|
I'm ok to revert this. Just a little reference. Seems it is required to have alias name for derived table in SQL 2003 grammar: And as investigating before, MySQL, Postgres, Oracle require alias name for derived table. So it seems to me that the syntax of no alias name is incorrect and the users tending to write queries like this is expected to be few. For your reference. |
|
I don't think that argument is useful at all. For example, none of the other databases support the DataFrame API. Does that mean few users will write DataFrame code? |
|
Sorry, to be accurate, for the syntax of derived table in SQL, the databases I listed above are commonly seen in the market, and they don't support it without alias name. SQL 2003 grammar also doesn't support it. Based on the above, I'd tend to think that derived table without alias name is not a widely used syntax among database users. But we know there's exception such as CockroachDB, as @JoshRosen pointed out. And you're right that, it's possible there're already many Spark users using this syntax, as we support it. This isn't an argument to object reverting, rather just a reference for you to decide on this issue. |
|
The reason I found out about this is because the one of the widely circulated TPC-DS benchmark harness online uses this. |
What changes were proposed in this pull request?
We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this:
This query works in current codebase. However, the outside where clause shouldn't be able to refer
t1.c1attribute.The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too.
How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.