Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce user defined SQL planner API #11180

Merged
merged 19 commits into from
Jul 2, 2024
Merged

Conversation

jayzhan211
Copy link
Contributor

@jayzhan211 jayzhan211 commented Jun 30, 2024

Which issue does this PR close?

Closes #11168
part of #10534
Closes #11137

TLDR is to make it easier for users to customize sql planning, specifically to handle operator translation

Rationale for this change

Let sql planner customizable
Follow up from #11168

What changes are included in this PR?

  1. Array function rewrite
  2. GetField
  3. Array literal

TODO in next PRs

#10534, to close this, we have more functions to move.

  • date_part
  • create_struct
  • create_named_struct
  • sql_overlay_to_expr
  • sql_position_to_expr
  • sql_substring_to_expr
  • sql_compound_identifier_to_expr

Are these changes tested?

Are there any user-facing changes?

alamb and others added 6 commits June 28, 2024 13:29
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the sql SQL Planner label Jun 30, 2024
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate labels Jun 30, 2024
Signed-off-by: jayzhan211 <[email protected]>
Arc::new(ArrayFunctionPlanner::try_new(provider).unwrap()) as _;

let field_access_planner =
Arc::new(FieldAccessPlanner::try_new(provider).unwrap()) as _;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_field for struct does not need array expression, but I'm not sure should we move it out

Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@jayzhan211
Copy link
Contributor Author

Interesting, I got stack overflow in CI but not in local.

Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jun 30, 2024
@jayzhan211 jayzhan211 marked this pull request as ready for review June 30, 2024 09:21
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 this is very cool and I think close

I left a bunch of stylistic / naming / comment suggestions -- let me know what you think

datafusion/expr/src/planner.rs Outdated Show resolved Hide resolved
datafusion/expr/src/planner.rs Outdated Show resolved Hide resolved
}

/// This trait allows users to customize the behavior of the SQL planner
pub trait UserDefinedPlanner {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about different names for this struct that might be more specific to SQL and thus make it easier to find

How about UserDefinedSQLPlanner? or SQLPlannerExtensions?

}
}

pub struct BinaryExpr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it may be confusing if this struct is also called BinaryExpr given https://docs.rs/datafusion/latest/datafusion/logical_expr/struct.BinaryExpr.html. What do you think about callng this RawBinaryExpr or maybe SqlBinaryExpr (and then rename FieldAccessExpr similarly)?

Here is some suggested comments to explain the structure (it would be good to do the same for FieldAccessExpr too):

Suggested change
pub struct BinaryExpr {
/// An operator with two arguments to plan
///
/// Note `left` and `right` are DataFusion [`Expr`]s but the `op` is the SQL AST operator.
/// This structure is used by [`UserDefinedPlanner`] to plan operators with custom expressions.
pub struct BinaryExpr {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also struggle with this naming too 😆 . I will take RawBinaryExpr

Comment on lines 125 to 133
pub enum PlannerSimplifyResult {
/// The function call was simplified to an entirely new Expr
Simplified(Expr),
/// the function call could not be simplified, and the arguments
/// are return unmodified.
OriginalBinaryExpr(BinaryExpr),
OriginalFieldAccessExpr(FieldAccessExpr),
OriginalArray(Vec<Expr>),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this might be easier to use if it were generic so that the callsites would know exactly the type of the returns Original

SO something like

Suggested change
pub enum PlannerSimplifyResult {
/// The function call was simplified to an entirely new Expr
Simplified(Expr),
/// the function call could not be simplified, and the arguments
/// are return unmodified.
OriginalBinaryExpr(BinaryExpr),
OriginalFieldAccessExpr(FieldAccessExpr),
OriginalArray(Vec<Expr>),
}
pub enum PlannerSimplifyResult<T> {
/// The structure was planned as an entirely new Expr by the planner
Simplified(Expr),
/// the planner did not handle planning the structure, and it is returned
/// unmodified.
OriginalBinaryExpr(T),
}

field_access_expr = expr;
}
_ => {
return exec_err!("Unexpected result encountered. Did you expect an OriginalFieldAccessExpr?")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can avoid this type of check by using a generic in PlannerSimplifyResult

pub expr: Expr,
}

pub enum PlannerSimplifyResult {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this isn't really "simplifying" anything maybe we could call it PlannerResult instead of PlannerSimplifyResult

/// Get all user defined window function names
fn udwf_names(&self) -> Vec<String>;
}
pub use datafusion_expr::planner::ContextProvider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for backwards compatibility

Comment on lines 189 to 190
/// user defined planner extensions
pub(crate) planners: Vec<Arc<dyn UserDefinedPlanner>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about using a single planner

Suggested change
/// user defined planner extensions
pub(crate) planners: Vec<Arc<dyn UserDefinedPlanner>>,
/// user defined planner extensions
pub planner: Option<Arc<dyn UserDefinedPlanner>>>,

This would match the basic pattern we have elsewhere (like proto extension codec) and people could implement composable planners like @lewiszlw did in https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/composed_extension_codec.rs

Copy link
Contributor Author

@jayzhan211 jayzhan211 Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this

  1. If we only accept single planner, does that mean we need to implement all the trait functions in one struct? We can't have array function planner optional in this case.
  2. If the user expect their own planner + our builtin planner, how can they enable both? With current design, they just chain them with with_user_defined_planner.
query
.with_user_defined_planner(array_planner)
.with_user_defined_planner(field_access_planner)
.with_user_defined_planner(user's planner)
...
  1. The example given https://github.com/apache/datafusion/blob/main/datafusion-examples/examples/composed_extension_codec.rs seems to have multiple codecs too 🤔
/// A PhysicalExtensionCodec that tries one of multiple inner codecs
/// until one works
#[derive(Debug)]
struct ComposedPhysicalExtensionCodec {
    codecs: Vec<Arc<dyn PhysicalExtensionCodec>>,
}

@@ -4910,10 +4910,11 @@ select array_ndims(arrow_cast([null], 'List(List(List(Int64)))'));
3

# array_ndims scalar function #2
# TODO: dimensions 20 is the maximum without stack overflow, find a way to enable deep nested array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is perhaps because the stackframe got larger in debug builds with this change -- one way to workaround it might be to pull the code for planning binary_exprs and the code for planning array access into their own functions rather than leave them inline in plan_expr

@alamb alamb changed the title User defined sql planner Introduce user defined SQL planner API Jun 30, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
@github-actions github-actions bot removed the sqllogictest SQL Logic Tests (.slt) label Jun 30, 2024
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Signed-off-by: jayzhan211 <[email protected]>
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jayzhan211 -- I think this looks great. I pushed two commits to this branch to update comments and rename the variant to avoid having to go back and forth again. Hopefully that is ok

@samuelcolvin what do you think of this PR ? Is it enough to work with instead of #11137?

// register crate of array expressions (if enabled)
#[cfg(feature = "array_expressions")]
{
let array_planner =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb mentioned this pull request Jul 1, 2024
6 tasks
@samuelcolvin
Copy link
Contributor

In general this looks great. I'm going to try to use it on datafusion-functions-json and let you know how I get on.

Should be able to get this done today, or tomorrow if life get's in the way.

///
/// This structure is used by [`UserDefinedSQLPlanner`] to plan operators with
/// custom expressions.
#[derive(Debug, Clone)]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need Clone 🤔 Same as RawFieldAccessExpr and PlannerResult . Others looks good to me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree there isn't a critical usecase for it now, but I figured it didn't hurt. If you feel strongly I will remove Clone

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not a critical issue.

@jayzhan211 jayzhan211 merged commit 48a1754 into apache:main Jul 2, 2024
24 checks passed
@jayzhan211 jayzhan211 deleted the udplan branch July 2, 2024 00:00
@jayzhan211
Copy link
Contributor Author

Thanks @alamb

@samuelcolvin
Copy link
Contributor

I'm a bit surprised this got merged without waiting for me to try it on datafusion-functions-json!

I said just above that I would try to do that yesterday or today.

As far as I can see, this PR provides no way to register external custom query planners. I guess I'll work on a follow up...

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

Thanks @samuelcolvin I agree iterating with some other PRs is a good idea.

@alamb
Copy link
Contributor

alamb commented Jul 2, 2024

I wrote up #11207 to track moving the remaining SQL planning features to this API

}
}

not_impl_err!("GetFieldAccess not supported by UserDefinedExtensionPlanners: {field_access_expr:?}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👋 I'm curious about this default branch because I'm seeing this error when doing integration testing with the delta-rs package. To the best of my knowledge we're not using any extension planners, but now this is failing with the latest datafusion. I'm uncertain whether we're doing something wrong, or if the default behavior of falling through to not_impl_err is causing us grief

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rtyler -- looks to me like the delta.rs code is managing its own SessionState -- like in https://github.com/delta-io/delta-rs/blob/main/crates/core/src/delta_datafusion/mod.rs#L1687 (BTW the Delta API is really nicely thought out)

So I think you'll need to register the same planners in your SessionContext 🤔

Helpfully I think @Omega359 just made a PR to make this easier: #11296

I feel in general DataFusion is hard to use / configure correctly if you are using a custom SessionState / configuration -- one potential thing we were discussing is #11182 (comment) -- I'll file a ticket to make this a thing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I had a similar issue you you @rtyler. Found a workaround described here: #11477

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Proposed fix is here: #11485

@lewiszlw
Copy link
Member

lewiszlw commented Jul 8, 2024

Sorry, I didn't follow up this topic before. I'm thinking if we can have one UserDefinedSQLPlanner trait which all methods have default official implementation, users just need to impl method they want to customize.
We have to loop UserDefinedSQLPlanner instances in current design, it seems not very efficient and has much boilerplate code. And these UserDefinedSQLPlanner instaces might have conflicts.

@alamb
Copy link
Contributor

alamb commented Jul 8, 2024

We have to loop UserDefinedSQLPlanner instances in current design, it seems not very efficient and has much boilerplate code. And these UserDefinedSQLPlanner instaces might have conflicts.

I agree the looping is not ideal.

In terms of conflicts, I think the semantics are pretty well defined now (the planners are ordered and whichever successfully plans the option first will be chosen).

findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* Prototype user defined sql planner might look like

* at arrow

Signed-off-by: jayzhan211 <[email protected]>

* get field

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* plan array literal

Signed-off-by: jayzhan211 <[email protected]>

* move to functions-array

Signed-off-by: jayzhan211 <[email protected]>

* license

Signed-off-by: jayzhan211 <[email protected]>

* cleanup

Signed-off-by: jayzhan211 <[email protected]>

* fix

Signed-off-by: jayzhan211 <[email protected]>

* change nested array test

Signed-off-by: jayzhan211 <[email protected]>

* address comment

Signed-off-by: jayzhan211 <[email protected]>

* fix stack overflow issue

Signed-off-by: jayzhan211 <[email protected]>

* upd cli

Signed-off-by: jayzhan211 <[email protected]>

* fix doc

Signed-off-by: jayzhan211 <[email protected]>

* fix doc

Signed-off-by: jayzhan211 <[email protected]>

* Rename PlannerResult::Simplified to PlannerResult::Planned

* Update comments and add Debug/Clone impls

---------

Signed-off-by: jayzhan211 <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate logical-expr Logical plan and expressions sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants