Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reference visitor TreeNode APIs, change ExecutionPlan::children() and PhysicalExpr::children() return references #10543

Merged
merged 6 commits into from
May 27, 2024

Conversation

peter-toth
Copy link
Contributor

@peter-toth peter-toth commented May 16, 2024

Which issue does this PR close?

Part of #10121, required for #10505 and #10426.

Rationale for this change

The current TreeNode visitor APIs (TreeNode::visit() and TreeNode::apply()) have a limiation due to the lifetimes of the TreeNode references passed to TreeNodeVisitor::f_down(), TreeNodeVisitor::f_up() and the f closure of apply() don't match the lifetime of the root TreeNode reference on which the APIs are called.
This restriction means that we can't build up data structures that contain references to descendant treenodes.

E.g. the following code snippet to collect how many times subexpressions occur in an expression tree doesn't work:

let e = sum((col("a") * (lit(1) - col("b"))) * (lit(1) + col("c")));
let mut m: HashMap<&Expr, usize> = HashMap::new();
e.apply(|e| {
    *m.entry(e).or_insert(0) += 1;
    Ok(TreeNodeRecursion::Continue)
});

println!("m: {:#?}", m);

This PR changes the TreeNode visitor APIs to make sure the lifetime of references match the lifetime of the root TreeNode reference so the above example will work.

Please note:
The LogicalPlan::apply_with_subqueries() and LogicalPlan::visit_with_subqueries() APIs, that are similar to TreeNode's base APIs but provide subquery support, can't be made stricter easily. This is because in LogicalPlan::apply_expressions() and LogicalPlan::apply_subqueries() we create temporary Expr::eq, Expr::Column and LogicalPlan::Subquery objects that are not compatible with the root treenode's lifetime.

What changes are included in this PR?

  • This PR modifies TreeNode::apply(), TreeNode::visit() and TreeNode::apply_children() APIs.
  • TreeNode implementations (Expr, LogicalPlan, ConcreteTreeNode and DynTreeNode) are amended to be able to implement the stricter APIs.

Are these changes tested?

Yes, with new UTs.

Are there any user-facing changes?

No.

@peter-toth
Copy link
Contributor Author

cc @alamb, @berkaysynnada, @ozankabak

@berkaysynnada
Copy link
Contributor

cc @alamb, @berkaysynnada, @ozankabak

Thanks @peter-toth. After a quick look, I started thinking it might be better to use this new ref_visitor API instead of keeping also the original one. I'll take a closer look tomorrow to see if that makes sense.

@peter-toth
Copy link
Contributor Author

Thanks @peter-toth. After a quick look, I started thinking it might be better to use this new ref_visitor API instead of keeping also the original one. I'll take a closer look tomorrow to see if that makes sense.

Sure. We can go that way, but the changes needed will be bigger, see the description of the PR for the details.

@alamb
Copy link
Contributor

alamb commented May 16, 2024

I agree with @berkaysynnada in #10543 (comment) that in an ideal world we woul change TreeNode::visit and TreeNode::apply, however as @peter-toth notes this would:

  1. Be a much larger change (and we would have to sort out how it works for DynTreeNodes)
  2. be a breaking API change

Thus, I suggest we merge this PR as is, and file a follow on PR to potentially unify the API. While it in an ideal world we wouldn't have both, I think this is a step in the right direction

Is that ok with you @berkaysynnada ?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this contribution @peter-toth 🙏

I think we should wait for @berkaysynnada 's feedback before merging this PR byg from my perspective it is good

Can we also please add a note of visit_ref / apply_ref in the description here:

/// [`apply`], [`visit`], [`exists`].

I think we can add the doc as a follow on PR too

@ozankabak
Copy link
Contributor

ozankabak commented May 16, 2024

Breaking the full migration into two PRs (with the latter one removing the old usage and migrating to the new one) sounds reasonable to me. It will probably make sense to do the follow-on quickly because we already will have one major API change (format options) in the next version. Instead of having two versions with two major API changes, it'd probably be better to have one version with these API changes lumped in together.

@peter-toth, how big do you think this PR will be if we do all the changes at once?

@peter-toth
Copy link
Contributor Author

@ozankabak, let me check the changes required for the alternative today or tomorrow and come back to you.

@peter-toth
Copy link
Contributor Author

I'm still working on an alternative to this PR and will need a couple of more days to test a few different ideas...

@ozankabak
Copy link
Contributor

No worries. Will be happy to review and help iterate once you are ready

@alamb
Copy link
Contributor

alamb commented May 20, 2024

What do we think about merging this PR and filing a follow on ticket to unify the APIs?

@peter-toth
Copy link
Contributor Author

What do we think about merging this PR and filing a follow on ticket to unify the APIs?

I'm ok with merging the current state of the PR. But I was also thinking about how to improve it:

As far as I see we have 3 options here:

  1. Adjust the current TreeNode implementations to be compatible with the more strict apply_ref() / visit_ref() and then replace the current apply() and visit() implementation to their ..._ref() versions.
    This change would require:
    1. adding a new method to all DynTreeNodes to retrun references to their children,
    2. fixing LogicalPlan::Join as it only contains the left and right expressions in its on: Vec<(Expr, Expr)> field and LogicalPlan:apply_expressions() creates a temporary Expr::eq instance on the top of the pairs. But creating such temporary objects during visit is not compatible with the stricter API so we should change LogicalPlan::Join to contain Expr::eqs directly.
    3. fixing Expr::Exists, Expr::InSubquery and Expr::ScalarSubquery as they only contain the Subquery strcuts and LogicalPlan::apply_subqueries() creates a temporary LogicalPlan::Subquery instance. We should change the 3 expressions to contain LogicalPlan::Subquerys directly.
  2. Keep the current apply() / visit() and their ..._ref() variants separate like this PR does. Maybe we could adjust the the implementation to not offer apply_ref() / visit_ref() (instead of their current throwing an error behaviour) on TreeNodes that doesn't support it (DynTreeNodes).
  3. Implement apply() / visit() in a way that the references used in the API closures and f_down() / f_up() methods encode if the reference is a strict reference (whose lifetime match the root node's lifetime) or a "loose" reference whose reference doesn't match (used for returning references to temporary node instances).
    I.e. we could introduce a new tree node reference type:
    pub enum TreeNodeRefImpl<'n, 'a, N: TreeNode> {
        Permanent(&'n N), // reference matches the root node's lifetime: 'n
        Temporary(&'a N), // reference doesn't match the root node's lifetime
    }
    
    This change would require to introduce a TreeNodeRef trait and moving the apply() / visit() functionality to there. Both the TreeNode's apply_children() implementations and the API user's closures and f_down() / f_up() methods would need to take into account if the current TreeNodeRef is permanent or temporary, which would make the whole thing quite complex...

So far I've been playing with 3. but it became very complex and still doesn't fully work. Also, I'm no longer sure that such an API makes sense as the API user can't specify what kind of references they want. E.g. in CSE (#10473) I need permanent references and can't do anything with other references whose lifetime doesn't match the root node's lifetime.

So now I'm working on implementing 1., but it will be a pervasive change as there are 56 (ExecutionPlan) + 15 (PhysicalExpr) DynTreeNode implementations in the current codebase. They currently offer fn children(&self) -> Vec<Arc<dyn ExecutionPlan>> and fn children(&self) -> Vec<Arc<dyn PhysicalExpr>> methods to get their children and now I'm changing those to return Vec<&Arc<dyn ...>>.
But if you have a better idea please share.

@github-actions github-actions bot added physical-expr Physical Expressions optimizer Optimizer rules core Core DataFusion crate labels May 22, 2024
@peter-toth peter-toth force-pushed the add-reference-visitor-treenode-apis branch 6 times, most recently from dfd8cd2 to 7470408 Compare May 22, 2024 18:41
@peter-toth peter-toth force-pushed the add-reference-visitor-treenode-apis branch from 7470408 to a857d7f Compare May 22, 2024 19:23
@peter-toth
Copy link
Contributor Author

peter-toth commented May 22, 2024

I've rebased the PR on the latest main. The first commit is exactly the same as the previous state of this PR was, the second commit changes TreeNode:apply() and TreeNode:visit() APIs to use references whose lifetime matches the root treenode's lifetime.
The only TreeNode that was not compatible with this stricter APIs is DynTreeNode. I had to change it and its implementations to return children as Vec<&Arc<Self>> (instead of Vec<Arc<Self>>).

This PR doesn't modify the LogicalPlan::apply_with_subqueries() and LogicalPlan:visit_with_subqueries() APIs. This is becaususe in LogicalPlan::apply_expressions() and LogicalPlan::apply_subqueries() we create temporary Expr::eq, Expr::Column and LogicalPlan::Subquery objects that are not compatible with the root treenode's lifetime.
If we wanted to make the LogicalPlan visitor APIs stricter we would need further changes in LogicalPlan::Join, LogicalPlan::Unnest, LogicalPlan::Extension, Expr::Exists, Expr::InSubquery, Expr::ScalarSubquery enums, but I don't think we need those stricter APIs currently...

I've also updated the PR description.

@berkaysynnada
Copy link
Contributor

Thanks a lot, @peter-toth! This looks great to me. The new children(&self) -> Vec<&Arc<dyn ExecutionPlan>> signature also uncovers implicit clones, and many of them seem to be redundant. Our TreeNode API functionality has reached a very good level. You've done a great job.

Perhaps we could add documentation with code snippets to exemplify the usage of methods and their purposes, to ease the experience for less experienced users.

@@ -258,6 +259,10 @@ impl CaseExpr {
}
}

lazy_static! {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't create temporary Arc::new(NoOp::new()) objects in children(), but as NoOp is just a placeholder, that we need in with_new_children(), we can use 1 global instance. As Arcs can't be const, the simplest solution seemed to be using lazy_static!, but if someone has better idea please share.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we skip pushing for the None cases? The user would know which indices refer to which expressions if needed. My concern is that this behavior could spread to expressions that have a flexible number of children.

Copy link
Contributor Author

@peter-toth peter-toth May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be able to restore a CaseExpr in CaseExpr::with_new_children() from the a children: Vec<Arc<dyn PhysicalExpr>>. If we don't push anyting for the 2 optional CaseExpr.expr and CaseExpr.else_expr fields, then I don't think there is a way to tell if the first element of children belongs to the optional CaseExpr.expr or to CaseExpr.when_then_expr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with_new_children() of CaseExpr says that:
// For physical CaseExpr, we do not allow modifying children size.
That sentence tells me that we can also should not change the existence of expressions. Can we just insert the new children into the Some() ones in the element order?

Copy link
Contributor Author

@peter-toth peter-toth May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, but please check how NoOps are used there, if we don't have those NoOps in children then how can we restore expr, else_expr and when_then_expr? The problem is that we have 2 optional fields...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a world where NoOp's are never used, I don't see a reason why this implementation wouldn't work.

        let expr = if self.expr().is_some() {
            Some(children[0].clone())
        } else {
            None
        };

        let else_expr = if self.expr().is_some() {
            Some(children[children.len() - 1].clone())
        } else {
            None
        };

        let branches = match (&expr, &else_expr) {
            (Some(_), Some(_)) => children[1..children.len() - 1].to_vec(),
            (Some(_), None) => children[1..children.len()].to_vec(),
            (None, Some(_)) => children[0..children.len() - 1].to_vec(),
            (None, None) => children[0..children.len()].to_vec(),
        };
        let mut when_then_expr: Vec<WhenThen> = vec![];
        for (prev, next) in branches.into_iter().tuples() {
            when_then_expr.push((prev, next));
        }

Of course, NoOps are safer and don't require any assumptions, but I'm just curious about what I might be missing.

Copy link
Contributor Author

@peter-toth peter-toth May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, I got it now. I will fix it today...

Copy link
Contributor Author

@peter-toth peter-toth May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can assume that the number of children and their role can't change then probably we can do this: b7eaa47

@alamb
Copy link
Contributor

alamb commented May 23, 2024

I hope to review this PR later today or tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @peter-toth and @berkaysynnada for the review

I think this looks amazing -- really nice 👌

What I think we should do is merge this PR and file some follow on tickets (e.g. to add an example of how to use this API, apply the same treatment to apply_with_subqueries, etc.

I'll plan to file those follow on tickets in the next day or two and merge this PR in unless anyone else would like more time to review

Agian, really nice work @peter-toth

@@ -169,7 +169,7 @@ impl ExecutionPlan for WorkTableExec {
&self.cache
}

fn children(&self) -> Vec<Arc<dyn ExecutionPlan>> {
fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reading through this I think it is also an API change (as it changes the API for ExecutionPlan::children -- however I think it is for the best.

Though I wonder if we are going to be changing the signature anyways, I wonder if we should consider something that doesn't require an allocation like

    fn children(&self) -> [&Arc<dyn ExecutionPlan>] {

Instead of

    fn children(&self) -> Vec<&Arc<dyn ExecutionPlan>> {

🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can return an array if its size is not fixed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe &[&Arc<dyn ExecutionPlan>] was meant?

Copy link
Contributor Author

@peter-toth peter-toth May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could return a slice, but the current Vec is in sync with other implementations how we usually return children. Like LogicalPlan::inputs() or ConcreteTreeNode::children().
Or shall we change those too?

Copy link
Contributor Author

@peter-toth peter-toth May 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actualy, I don't think we can return a slice of references. Returning an empty slice here would be ok, but at other places where there are children to return (e.g. in BinaryExpr) we need to build a temporary container (vec or array) to store the references of children and then return a slice of the container, but who will own the container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened a PR as returning a slice in ConcreteTreeNode is possible: #10666
But it will not work for LogicalPlan or ExecutionPlan or PhysicalExpr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to return &[Arc<dyn ExecutionPlan>] and it fails unsurprisingly -- unless we stores all children in a vector like Union:

    fn children(&self) -> &[Arc<dyn ExecutionPlan>] {
        self.inputs.as_slice()
    }

Otherwise we cannot return a temporary slice

@alamb alamb added the api change Changes the API exposed to users of the crate label May 24, 2024
@alamb alamb changed the title Add reference visitor TreeNode APIs Add reference visitor TreeNode APIs, change ExecutionPlan::children to return references May 24, 2024
@peter-toth peter-toth changed the title Add reference visitor TreeNode APIs, change ExecutionPlan::children to return references Add reference visitor TreeNode APIs, change ExecutionPlan::children() and PhysicalExpr::children() return references May 25, 2024
@alamb
Copy link
Contributor

alamb commented May 25, 2024

Awesome!

I plan to merge this PR tomorrow unless anyone else would like time to review.

@alamb
Copy link
Contributor

alamb commented May 26, 2024

I filed #10678 to handle visit_with_subqueries

I am waiting to merge this PR until the CI is fixed on main #10677

@alamb alamb merged commit ae10754 into apache:main May 27, 2024
23 checks passed
@alamb
Copy link
Contributor

alamb commented May 27, 2024

Thank you again @peter-toth @ozankabak and @berkaysynnada -- I am woking on some doc examples for this as well

@alamb
Copy link
Contributor

alamb commented May 27, 2024

I made these two PRs to add examples of using TreeNode:

  1. Minor: Add examples of using TreeNode with Expr #10686
  2. Minor: Add examples of using TreeNode with LogicalPlan #10687

Also a bonus to improve: #10685

@peter-toth
Copy link
Contributor Author

Thanks all for the review!

@alamb, those API example PRs look great!

Michael-J-Ward added a commit to Michael-J-Ward/datafusion-python that referenced this pull request Jun 13, 2024
andygrove pushed a commit to apache/datafusion-python that referenced this pull request Jun 14, 2024
* deps: update datafusion to 39.0.0, pyo3 to 0.21, and object_store to 0.10.1

`datafusion-common` also depends on `pyo3`, so they need to be upgraded together.

* feat: remove GetIndexField

datafusion replaced Expr::GetIndexField with a FieldAccessor trait.

Ref apache/datafusion#10568
Ref apache/datafusion#10769

* feat: update ScalarFunction

The field `func_name` was changed to `func` as part of removing `ScalarFunctionDefinition` upstream.

Ref apache/datafusion#10325

* feat: incorporate upstream array_slice fixes

Fixes #670

* update ExectionPlan::children impl for DatasetExec

Ref apache/datafusion#10543

* update value_interval_daytime

Ref apache/arrow-rs#5769

* update regexp_replace and regexp_match

Fixes #677

* add gil-refs feature to pyo3

This silences pyo3's deprecation warnings for its new Bounds api.

It's the 1st step of the migration, and should be removed before merge.

Ref https://pyo3.rs/v0.21.0/migration#from-020-to-021

* fix signature for octet_length

Ref apache/datafusion#10726

* update signature for covar_samp

AggregateUDF expressions now have a builder API design, which removes arguments like filter and order_by

Ref apache/datafusion#10545
Ref apache/datafusion#10492

* convert covar_pop to expr_fn api

Ref: https://github.com/apache/datafusion/pull/10418/files

* convert median to expr_fn api

Ref apache/datafusion#10644

* convert variance sample to UDF

Ref apache/datafusion#10667

* convert first_value and last_value to UDFs

Ref apache/datafusion#10648

* checkpointing with a few todos to fix remaining compile errors

* impl PyExpr::python_value for IntervalDayTime and IntervalMonthDayNano

* convert sum aggregate function to UDF

* remove unnecessary clone on double reference

* apply cargo fmt

* remove duplicate allow-dead-code annotation

* update tpch examples for new pyarrow interval

Fixes #665

* marked q11 tpch example as expected fail

Ref #730

* add default stride of None back to array_slice
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…n()` and `PhysicalExpr::children()` return references (apache#10543)

* add reference visitor APIs

* use stricter references in apply() and visit()

* avoid where clause

* remove NO_OP

* remove assert after removing NO_OP

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api change Changes the API exposed to users of the crate core Core DataFusion crate logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Physical Expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants