Skip to content

[SPARK-20690][SQL] Subqueries in FROM should have alias names#17935

Closed
viirya wants to merge 9 commits intoapache:masterfrom
viirya:SPARK-20690
Closed

[SPARK-20690][SQL] Subqueries in FROM should have alias names#17935
viirya wants to merge 9 commits intoapache:masterfrom
viirya:SPARK-20690

Conversation

@viirya
Copy link
Copy Markdown
Member

@viirya viirya commented May 10, 2017

What changes were proposed in this pull request?

We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this:

select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1

This query works in current codebase. However, the outside where clause shouldn't be able to refer t1.c1 attribute.

The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 10, 2017

Test build #76743 has finished for PR 17935 at commit bb16061.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 10, 2017

Test build #76745 has finished for PR 17935 at commit 974217f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 11, 2017

Test build #76771 has finished for PR 17935 at commit 51aafde.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class ColumnStatsMap(originalMap: AttributeMap[ColumnStat])
  • s\"($

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 11, 2017

Test build #76772 has finished for PR 17935 at commit c19cbbb.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 11, 2017

Test build #76783 has started for PR 17935 at commit a904214.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 11, 2017

retest this please.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 11, 2017

Test build #76785 has finished for PR 17935 at commit a904214.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-20690][SQL][WIP] Analyzer shouldn't add missing attributes through subqueries [SPARK-20690][SQL] Analyzer shouldn't add missing attributes through subqueries May 11, 2017
@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 11, 2017

cc @cloud-fan

sql(
"""
| select 1
| from (select 1 from onerow t1 LIMIT 1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised we support this syntax, I think the FROM clause must have an alias.

I checked with postgres, it will throw exception subquery in FROM must have an alias, can you check with other databases? Thanks!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mysql:

mysql> select 1 from (select 1 from test);
ERROR 1248 (42000): Every derived table must have its own alias

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SubQueries

Hive supports subqueries only in the FROM clause (through Hive 0.12). The subquery has to be given a name because every table in a FROM clause must have a name.

Hive also requires an alias name.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://docs.oracle.com/cd/E17952_01/mysql-5.1-en/from-clause-subqueries.html

The [AS] name clause is mandatory, because every table in a FROM clause must have a name. Any columns in the subquery select list must have unique names.

Oracle also requires it.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @hvanhovell shall we change the parser? I think it's hard to reason about the semantic of an anonymous subquery

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, in this change I remove qualifier after an anonymous subquery. Not sure if it is what we always want.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should change the parser and require alias for subquery.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this seems confusing. Subqueries should be have an alias. Let's try to add that.

@viirya viirya changed the title [SPARK-20690][SQL] Analyzer shouldn't add missing attributes through subqueries [SPARK-20690][SQL] Subqueries in FROM should have alias names May 16, 2017
Row(3, 3.0, 2, 3.0) :: Row(3, 3.0, 2, 3.0) :: Nil)
}

test("SPARK-20690: Do not add missing attributes through subqueries") {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we still need this test? I think it's all covered by the parser test

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We can remove this.

relationPrimary
: tableIdentifier sample? (AS? strictIdentifier)? #tableName
| '(' queryNoWith ')' sample? (AS? strictIdentifier)? #aliasedQuery
| '(' queryNoWith ')' sample? (AS? strictIdentifier) #aliasedQuery
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also update AstBuilder for this?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also force an alias for the next line? '(' relation ')' sample? (AS? strictIdentifier)?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have this question when I change this.

MySQL supports select 1 from (test); but doesn't support select 1 from (test) as a ;

Postgres doesn't support both syntax.

Hive supports select * from (test); but doesn't support select * from (test) as a;.

Seems aliased relation is not commonly supported. So I leave it untouched.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay, unclear semantics. Ok, that is fine for now.

* Aliased subquery.
*
* @param alias the alias name for this subquery.
* @param child the LogicalPlan
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the LogicalPlan -> the logical plan of this subquery

}

case f @ Filter(cond, child) if child.resolved =>
case f @ Filter(cond, child) if !f.resolved && child.resolved =>
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only added for Filter? How about Sort in the same rule?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

Copy link
Copy Markdown
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small question, otherwise LGTM.

relationPrimary
: tableIdentifier sample? (AS? strictIdentifier)? #tableName
| '(' queryNoWith ')' sample? (AS? strictIdentifier)? #aliasedQuery
| '(' queryNoWith ')' sample? (AS? strictIdentifier) #aliasedQuery
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also force an alias for the next line? '(' relation ')' sample? (AS? strictIdentifier)?

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 16, 2017

Test build #76969 has finished for PR 17935 at commit 1e595b2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 17, 2017

Test build #76994 has finished for PR 17935 at commit 9ec0414.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 17, 2017 via email

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@cloud-fan
Copy link
Copy Markdown
Contributor

oops, you pushed a new commit... But it should be fine as the change is very small.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 17, 2017

Thanks @cloud-fan @hvanhovell @gatorsmile @wzhfy

@asfgit asfgit closed this in 7463a88 May 17, 2017
@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 17, 2017

@cloud-fan Yeah, I change the aliased name in one test. I tested it locally.

@SparkQA
Copy link
Copy Markdown

SparkQA commented May 17, 2017

Test build #77007 has finished for PR 17935 at commit 9b117f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 17, 2017

No problem. The new commit passes tests.

robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
## What changes were proposed in this pull request?

We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this:

    select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1

This query works in current codebase. However, the outside where clause shouldn't be able to refer `t1.c1` attribute.

The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#17935 from viirya/SPARK-20690.
lycplus pushed a commit to lycplus/spark that referenced this pull request May 24, 2017
## What changes were proposed in this pull request?

We add missing attributes into Filter in Analyzer. But we shouldn't do it through subqueries like this:

    select 1 from  (select 1 from onerow t1 LIMIT 1) where  t1.c1=1

This query works in current codebase. However, the outside where clause shouldn't be able to refer `t1.c1` attribute.

The root cause is we allow subqueries in FROM have no alias names previously, it is confusing and isn't supported by various databases such as MySQL, Postgres, Oracle. We shouldn't support it too.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes apache#17935 from viirya/SPARK-20690.
@JoshRosen
Copy link
Copy Markdown
Contributor

I was trying to run a test case from another database which does support unaliased subqueries in the FROM clause and hit a confusing parser error due to this patch's behavior change. While I agree that we shouldn't necessarily support this syntax, I think that the current error message that we're returning isn't very good so I've file https://issues.apache.org/jira/browse/SPARK-20916 to improve it.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented May 30, 2017

@JoshRosen Thanks for filing this issue. I'll look into it.

@ash211
Copy link
Copy Markdown
Contributor

ash211 commented Jun 5, 2017

@JoshRosen what was the other type of database you were using?

@JoshRosen
Copy link
Copy Markdown
Contributor

@ash211, I was attempting to re-use test cases from CockroachDB (which is surprisingly permissive in the SQL it accepts compared to Postgres).

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Jun 29, 2017

Guys - isn't this API breaking?

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Jun 29, 2017

Also the description / title is completely different from the JIRA ticket.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented Jun 29, 2017

@rxin Sorry I forgot to change JIRA ticket. I changed it now.

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Jun 30, 2017

Other committers please revert this change until we find a solution or verify that almost no users write queries like this.

@viirya
Copy link
Copy Markdown
Member Author

viirya commented Jun 30, 2017

I'm ok to revert this. Just a little reference.

Seems it is required to have alias name for derived table in SQL 2003 grammar:
https://github.com/ronsavage/SQL/blob/master/sql-2003-2.bnf#L2221

And as investigating before, MySQL, Postgres, Oracle require alias name for derived table.

So it seems to me that the syntax of no alias name is incorrect and the users tending to write queries like this is expected to be few.

For your reference.

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Jun 30, 2017

I don't think that argument is useful at all. For example, none of the other databases support the DataFrame API. Does that mean few users will write DataFrame code?

@viirya
Copy link
Copy Markdown
Member Author

viirya commented Jun 30, 2017

Sorry, to be accurate, for the syntax of derived table in SQL, the databases I listed above are commonly seen in the market, and they don't support it without alias name. SQL 2003 grammar also doesn't support it.

Based on the above, I'd tend to think that derived table without alias name is not a widely used syntax among database users.

But we know there's exception such as CockroachDB, as @JoshRosen pointed out. And you're right that, it's possible there're already many Spark users using this syntax, as we support it.

This isn't an argument to object reverting, rather just a reference for you to decide on this issue.

@rxin
Copy link
Copy Markdown
Contributor

rxin commented Jun 30, 2017

The reason I found out about this is because the one of the widely circulated TPC-DS benchmark harness online uses this.

@viirya viirya deleted the SPARK-20690 branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants