Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan LATERAL subqueries #11456

Merged
merged 4 commits into from
Aug 20, 2024
Merged

Plan LATERAL subqueries #11456

merged 4 commits into from
Aug 20, 2024

Conversation

aalexandrov
Copy link
Contributor

Which issue does this PR close?

Closes the planning part of #10048. The query from the issue will now fail in the optimization phase similar to #6140.

Rationale for this change

Adds an unsupported fragment of the SQL syntax.

What changes are included in this PR?

Extends the planner with support for LATERAL subqueries. See the tests added in sql_integration.rs for details.

Are these changes tested?

Yes, I added tests in sql_integration.rs.

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If we document the supported SQL fragment in the user-facing docs as BNF-style diagrams we might want to adapt those.

I don't think so.

@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion labels Jul 13, 2024
@github-actions github-actions bot removed documentation Improvements or additions to documentation development-process Related to development process of DataFusion labels Jul 14, 2024
@aalexandrov aalexandrov changed the title Plan from lateral Plan LATERAL subqueries Jul 14, 2024
@ozankabak
Copy link
Contributor

Thank you! IMO, missing lateral and correlated subquery support is one of the crucial things to fix in DF. Any work in this area is much appreciated.

// the `for` loop in `plan_table_with_joins`.
let old_from_schema = planner_context.set_outer_from_schema(None).unwrap();
let new_query_schema = match planner_context.outer_query_schema() {
Some(lhs) => Some(Arc::new(lhs.join(&old_from_schema)?)),
Copy link
Contributor Author

@aalexandrov aalexandrov Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want to replace the lhs.join(<rhs>) call here with a version of a join where identical column references on the right side override their lhs counterparts (probably lhs.merge(<rhs>) will suffice).

See #11464 for details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a second commit to the PR that fixes the issue mentioned above (see the commit message for details).

/// extends the FROM schema, returning the existing one, if any
pub fn extend_outer_from_schema(&mut self, schema: &DFSchemaRef) -> Result<()> {
self.outer_from_schema = match self.outer_from_schema.as_ref() {
Some(from_schema) => Some(Arc::new(from_schema.join(schema)?)),
Copy link
Contributor Author

@aalexandrov aalexandrov Jul 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my other comment.

Actually, looking at the Postgres behavior it seems that keeping join here is better (the j1_id column reference is then reported as ambiguous):

SELECT j1_string, j2_string FROM
  j1 AS x JOIN j1 as y USING(j1_string)
  LEFT JOIN LATERAL (SELECT * FROM j2 WHERE j1_id < j2_id) AS j2 ON(true);

@alamb
Copy link
Contributor

alamb commented Jul 14, 2024

Thank you @aalexandrov -- I will review this PR carefully, but likely will not have a chance until tomorrow or Tuesday

// factors, and those can never be the first factor in a FROM list. This
// means we arrived here through the `for` loop in `plan_from_tables` or
// the `for` loop in `plan_table_with_joins`.
let old_from_schema = planner_context.set_outer_from_schema(None).unwrap();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the following query, and it panicked. It's valid in PostgreSQL.

DataFusion CLI v40.0.0

> select * from lateral (select 1) t;
thread 'main' panicked at datafusion/datafusion/sql/src/relation/mod.rs:162:75:
called `Option::unwrap()` on a `None` value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somehow I was thinking that LATERAL is not valid / accepted by the parser for the first element in a SELECT list. I fixed the issue and added a regression test in sql_integration.rs.

@alamb
Copy link
Contributor

alamb commented Jul 16, 2024

This PR is very much on my list to review, but I fear I may not get to it today (I have a bunch of PRs waiting upstream in arrow-rs I need to attend to)

@alamb
Copy link
Contributor

alamb commented Jul 22, 2024

Still on my list... .

I am not sure if @jackwener or @mingmwang might have time to review this (or if they are familiar with this particular rewrite)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @aalexandrov -- sorry for not getting a chance to review this PR earlier (I have been quite busy with other stuff and JOINs are pretty low on my priority list unfortunately)

The basic code in this PR looks good to me-- but I don't know enough about the semantics of lateral joins to know if it is

Can you please add some tests to sqllogictest that shows that this code gets the same code in postgres and in DataFusion

Here are the instructions: https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest

The instructions for postgres compatibility tests are here
https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest#running-tests-postgres-compatibility

Ideally you should be able to extend one of the existing test files in https://github.com/apache/datafusion/tree/main/datafusion/sqllogictest/test_files

@aalexandrov
Copy link
Contributor Author

aalexandrov commented Aug 17, 2024

Can you please add some tests to sqllogictest that shows that this code gets the same code in postgres and in DataFusion.

This PR only focuses on extending datafusion/sql/src/planner.rs so we can get logical plans for the LATERAL syntax.

I was hoping to merge that first before taking a stab at the required extensions in datafusion/sql/src/planner.rs that would allow us to run those queries. Once we have those I will be able to add some SLT tests and fix any newly uncovered bugs in both files.

@alamb
Copy link
Contributor

alamb commented Aug 17, 2024

I was hoping to merge that first before taking a stab at the required extensions in datafusion/sql/src/planner.rs that would allow us to run those queries. Once we have those I will be able to add some SLT tests and fix any newly uncovered bugs in both files.

Implementing in phases makes sense to me and is consistent with what we have done in other places

However, I still think we should add basic .slt test coverage (even if they simply error our with "unsupported subquery") is to document, with code, what the expected behavior is if someone runs this kind of query.

I think a runtime "not yet implemented" error is fine, but going into infinite loops or internal errors is probably less good.

In order to compute the `set_outer_from_schema` argument we currently
use `DFSchema::join`. When we combine the current outer FROM schema with
the current outer query schema columns from the latter should override
columns from the first, so the correct way is to use `DFSchema::merge`.

To witness the fix, note that the query in the fixed test case isn't
planned as expected without the accompanying changes.
@aalexandrov
Copy link
Contributor Author

I added some tests to check the behavior of the SQL planner and assert the current runtime behavior (unimplemented because physical planning doesn't know what to do with correlated column references) to joins.slt.

@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Aug 19, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @aalexandrov - this code and tests look good to me.

I again apologize for not having a chance to review this PR sooner.

All in all, I think this is a good step forward.

I took the liberty of merging this PR up from main to resolve a conflict and fixed some expected tests

cc @jonahgao in case you are interested in reviewing the SQL planning aspects

@alamb
Copy link
Contributor

alamb commented Aug 20, 2024

🚀

@alamb alamb merged commit 6e34280 into apache:main Aug 20, 2024
24 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 20, 2024

Thanks again for bearing with us @aalexandrov -- this is a great step

@aalexandrov
Copy link
Contributor Author

Thanks, @alamb. Can you point me to the unnesting rules that already exist? I might try to extend them to cover some of the more basic queries first.

@alamb
Copy link
Contributor

alamb commented Aug 26, 2024

Thanks, @alamb. Can you point me to the unnesting rules that already exist? I might try to extend them to cover some of the more basic queries first.

The optimizer rules are here: https://docs.rs/datafusion/latest/datafusion/optimizer/optimizer/index.html

I think in particular two that are relevant are:

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sql SQL Planner sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants