-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan LATERAL
subqueries
#11456
Plan LATERAL
subqueries
#11456
Conversation
7cf77ec
to
5cb0c79
Compare
5cb0c79
to
f54781e
Compare
Thank you! IMO, missing lateral and correlated subquery support is one of the crucial things to fix in DF. Any work in this area is much appreciated. |
f54781e
to
ee34089
Compare
datafusion/sql/src/relation/mod.rs
Outdated
// the `for` loop in `plan_table_with_joins`. | ||
let old_from_schema = planner_context.set_outer_from_schema(None).unwrap(); | ||
let new_query_schema = match planner_context.outer_query_schema() { | ||
Some(lhs) => Some(Arc::new(lhs.join(&old_from_schema)?)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably want to replace the lhs.join(<rhs>)
call here with a version of a join
where identical column references on the right side override their lhs
counterparts (probably lhs.merge(<rhs>)
will suffice).
See #11464 for details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a second commit to the PR that fixes the issue mentioned above (see the commit message for details).
/// extends the FROM schema, returning the existing one, if any | ||
pub fn extend_outer_from_schema(&mut self, schema: &DFSchemaRef) -> Result<()> { | ||
self.outer_from_schema = match self.outer_from_schema.as_ref() { | ||
Some(from_schema) => Some(Arc::new(from_schema.join(schema)?)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my other comment.
Actually, looking at the Postgres behavior it seems that keeping join
here is better (the j1_id
column reference is then reported as ambiguous):
SELECT j1_string, j2_string FROM
j1 AS x JOIN j1 as y USING(j1_string)
LEFT JOIN LATERAL (SELECT * FROM j2 WHERE j1_id < j2_id) AS j2 ON(true);
Thank you @aalexandrov -- I will review this PR carefully, but likely will not have a chance until tomorrow or Tuesday |
c0a65a4
to
db91644
Compare
datafusion/sql/src/relation/mod.rs
Outdated
// factors, and those can never be the first factor in a FROM list. This | ||
// means we arrived here through the `for` loop in `plan_from_tables` or | ||
// the `for` loop in `plan_table_with_joins`. | ||
let old_from_schema = planner_context.set_outer_from_schema(None).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the following query, and it panicked. It's valid in PostgreSQL.
DataFusion CLI v40.0.0
> select * from lateral (select 1) t;
thread 'main' panicked at datafusion/datafusion/sql/src/relation/mod.rs:162:75:
called `Option::unwrap()` on a `None` value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somehow I was thinking that LATERAL
is not valid / accepted by the parser for the first element in a SELECT
list. I fixed the issue and added a regression test in sql_integration.rs
.
This PR is very much on my list to review, but I fear I may not get to it today (I have a bunch of PRs waiting upstream in arrow-rs I need to attend to) |
db91644
to
79aab9d
Compare
Still on my list... . I am not sure if @jackwener or @mingmwang might have time to review this (or if they are familiar with this particular rewrite) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @aalexandrov -- sorry for not getting a chance to review this PR earlier (I have been quite busy with other stuff and JOINs are pretty low on my priority list unfortunately)
The basic code in this PR looks good to me-- but I don't know enough about the semantics of lateral joins to know if it is
Can you please add some tests to sqllogictest that shows that this code gets the same code in postgres and in DataFusion
Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest
The instructions for postgres compatibility tests are here
https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#running-tests-postgres-compatibility
Ideally you should be able to extend one of the existing test files in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files
This PR only focuses on extending I was hoping to merge that first before taking a stab at the required extensions in |
Implementing in phases makes sense to me and is consistent with what we have done in other places However, I still think we should add basic .slt test coverage (even if they simply error our with "unsupported subquery") is to document, with code, what the expected behavior is if someone runs this kind of query. I think a runtime "not yet implemented" error is fine, but going into infinite loops or internal errors is probably less good. |
In order to compute the `set_outer_from_schema` argument we currently use `DFSchema::join`. When we combine the current outer FROM schema with the current outer query schema columns from the latter should override columns from the first, so the correct way is to use `DFSchema::merge`. To witness the fix, note that the query in the fixed test case isn't planned as expected without the accompanying changes.
79aab9d
to
4274826
Compare
I added some tests to check the behavior of the SQL planner and assert the current runtime behavior (unimplemented because physical planning doesn't know what to do with correlated column references) to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @aalexandrov - this code and tests look good to me.
I again apologize for not having a chance to review this PR sooner.
All in all, I think this is a good step forward.
I took the liberty of merging this PR up from main to resolve a conflict and fixed some expected tests
cc @jonahgao in case you are interested in reviewing the SQL planning aspects
🚀 |
Thanks again for bearing with us @aalexandrov -- this is a great step |
Thanks, @alamb. Can you point me to the unnesting rules that already exist? I might try to extend them to cover some of the more basic queries first. |
The optimizer rules are here: https://docs.rs/datafusion/latest/datafusion/optimizer/optimizer/index.html I think in particular two that are relevant are: |
Which issue does this PR close?
Closes the planning part of #10048. The query from the issue will now fail in the optimization phase similar to #6140.
Rationale for this change
Adds an unsupported fragment of the SQL syntax.
What changes are included in this PR?
Extends the planner with support for
LATERAL
subqueries. See the tests added insql_integration.rs
for details.Are these changes tested?
Yes, I added tests in
sql_integration.rs
.Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If we document the supported SQL fragment in the user-facing docs as BNF-style diagrams we might want to adapt those.
I don't think so.