Skip to content

[SPARK-41049][SQL] Revisit stateful expression handling#39248

Closed
cloud-fan wants to merge 2 commits intoapache:masterfrom
cloud-fan:expr
Closed

[SPARK-41049][SQL] Revisit stateful expression handling#39248
cloud-fan wants to merge 2 commits intoapache:masterfrom
cloud-fan:expr

Conversation

@cloud-fan
Copy link
Contributor

@cloud-fan cloud-fan commented Dec 27, 2022

What changes were proposed in this pull request?

Spark has a Stateful trait for stateful expressions. The basic idea is to have fresh copies of stateful expressions before evaluating them. This is to avoid issues caused by the flexible DataFrame APIs:

  1. A single expression instance may appear more than once in the expression tree. We have to replace it with fresh copies to avoid sharing states.
  2. An expression tree can be evaluated by multiple callers at the same time. We have to use fresh copies before expression evaluation to avoid sharing states.

However, the handling of stateful expression has several problems. This PR fixes all of them:

  1. We should use fresh copies with codegen as well. If the root expression extends CodegenFallback, then the expression tree will be evaluated using the interpreted mode, even with the codegen code path.
  2. The fresh copies will be dropped if the stateful expression is deeply nested (3 layers).
  3. InterpretedSafeProjection never implemented initialize() for initializing Nondeterministic expressions.
  4. ConvertToLocalRelation called a InterpretedMutableProjection constructor which did not implement the existing Stateful-copying logic. I fixed this by moving that logic out of a factory method and into class's main constructor, guaranteeing that it will always run.
  5. Stateful expression is not always nondeterministic, e.g. ScalaUDF. I removed the Stateful trait and added a def stateful: Boolean function in Expression.

Why are the changes needed?

Fix stateful expression handling

Does this PR introduce any user-facing change?

Yes, now we never share states for stateful expressions, which may produce wrong result.

How was this patch tested?

new tests

@github-actions github-actions bot added the SQL label Dec 27, 2022
@cloud-fan cloud-fan force-pushed the expr branch 2 times, most recently from 0838a76 to de9c17b Compare December 28, 2022 06:57
@cloud-fan cloud-fan changed the title [WIP] revisit stateful expression handling [SPARK-41049][SQL] Revisit stateful expression handling Dec 28, 2022
@cloud-fan cloud-fan marked this pull request as ready for review December 28, 2022 07:16
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pre-existing bug. The final expressions we use is exprs, not expressions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old usage of .transform here contained a subtle bug related to how fastEquals works.

Let's say that we have a tree which looks like this:

Outer(Middle(Stateful()))

where Outer and Middle are non-Stateful expressions.

When the .transform is applied to Stateful() and .freshCopy() is called, the returned value will be == to the original Stateful expression but will have a different object identity (because it's a fresh object). Internally, .transform will use fastEquals to check whether the transformation modified the node. Stateful overrides fastEquals so that it only considers object identity, so the transform will return the freshCopy() result.

At the next level up, Middle will check whether any of its children have been changed in the recursive bottom-up transformation (see childrenFastEquals() in withNewChildren(), which is called from mapChildren()). It will detect that its children have changed, so the transform will return a new Middle node.

Finally, at the top level, Outer will perform the same check to see if any of its children have changed. This time, however, it will be calling Middle.fastEquals instead of Stateful.fastEquals. Middle's fastEquals method is the regular implementation which also considers object equality. Both the original and new Middle nodes will be ==, so fastEquals will be true and Outer will conclude that its children have not been changed by the transformation and the original Outer will be returned (losing the copy of the stateful expression).

In other words, the old .transform and copying logic here was incorrect if the Stateful expression was nested more than a single level deep.

In this PR I chose to fix this by adding a freshCopyIfContainsStatefulExpression() method to Expression which implements a custom tree traversal considers only object identity when determining whether the transform has changed a node or a node's children.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, this is a nice catch!

@cloud-fan
Copy link
Contributor Author

Comment on lines +32 to +33
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, will sub-expr elimination extract common stateful expressions as common expr and break the rule (not reusing)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 stateful but deterministic expressions always produce the same result given the same input sequence. So it's OK to apply sub-expr elimination.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called after prepareExpressions is finished, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or more specifically, the implementation should initialize the final expression that it actually uses.

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The proposed logic looks good to me. There are some test failures that looks related.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The keys in tags still refer to original tree node. Is it okay?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this possible? keys are basically strings, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the code here basically follows TreeNode.withNewChildren

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, you're correct. It is just a string for node name.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in 4321604 Dec 31, 2022
viirya pushed a commit that referenced this pull request Jan 3, 2023
…he base class

### What changes were proposed in this pull request?

This is a followup of #39248 , to add one more code cleanup. The expression initialization code is duplicated 6 times and we should put it in the base class.

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #39364 from cloud-fan/expr.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Liang-Chi Hsieh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants