[SPARK-41049][SQL] Revisit stateful expression handling by cloud-fan · Pull Request #39248 · apache/spark

cloud-fan · 2022-12-27T15:32:06Z

What changes were proposed in this pull request?

Spark has a Stateful trait for stateful expressions. The basic idea is to have fresh copies of stateful expressions before evaluating them. This is to avoid issues caused by the flexible DataFrame APIs:

A single expression instance may appear more than once in the expression tree. We have to replace it with fresh copies to avoid sharing states.
An expression tree can be evaluated by multiple callers at the same time. We have to use fresh copies before expression evaluation to avoid sharing states.

However, the handling of stateful expression has several problems. This PR fixes all of them:

We should use fresh copies with codegen as well. If the root expression extends CodegenFallback, then the expression tree will be evaluated using the interpreted mode, even with the codegen code path.
The fresh copies will be dropped if the stateful expression is deeply nested (3 layers).
InterpretedSafeProjection never implemented initialize() for initializing Nondeterministic expressions.
ConvertToLocalRelation called a InterpretedMutableProjection constructor which did not implement the existing Stateful-copying logic. I fixed this by moving that logic out of a factory method and into class's main constructor, guaranteeing that it will always run.
Stateful expression is not always nondeterministic, e.g. ScalaUDF. I removed the Stateful trait and added a def stateful: Boolean function in Expression.

Why are the changes needed?

Fix stateful expression handling

Does this PR introduce any user-facing change?

Yes, now we never share states for stateful expressions, which may produce wrong result.

How was this patch tested?

new tests

cloud-fan · 2022-12-28T07:17:50Z

.../src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala

This is a pre-existing bug. The final expressions we use is exprs, not expressions

cloud-fan · 2022-12-28T07:20:16Z

.../src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedMutableProjection.scala

The old usage of .transform here contained a subtle bug related to how fastEquals works.

Let's say that we have a tree which looks like this:

Outer(Middle(Stateful()))

where Outer and Middle are non-Stateful expressions.

When the .transform is applied to Stateful() and .freshCopy() is called, the returned value will be == to the original Stateful expression but will have a different object identity (because it's a fresh object). Internally, .transform will use fastEquals to check whether the transformation modified the node. Stateful overrides fastEquals so that it only considers object identity, so the transform will return the freshCopy() result.

At the next level up, Middle will check whether any of its children have been changed in the recursive bottom-up transformation (see childrenFastEquals() in withNewChildren(), which is called from mapChildren()). It will detect that its children have changed, so the transform will return a new Middle node.

Finally, at the top level, Outer will perform the same check to see if any of its children have changed. This time, however, it will be calling Middle.fastEquals instead of Stateful.fastEquals. Middle's fastEquals method is the regular implementation which also considers object equality. Both the original and new Middle nodes will be ==, so fastEquals will be true and Outer will conclude that its children have not been changed by the transformation and the original Outer will be returned (losing the copy of the stateful expression).

In other words, the old .transform and copying logic here was incorrect if the Stateful expression was nested more than a single level deep.

In this PR I chose to fix this by adding a freshCopyIfContainsStatefulExpression() method to Expression which implements a custom tree traversal considers only object identity when determining whether the transform has changed a node or a node's children.

Oh, this is a nice catch!

cloud-fan · 2022-12-28T07:27:13Z

cc @viirya @HyukjinKwon @gengliangwang @allisonwang-db

viirya · 2022-12-29T06:16:37Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionsEvaluator.scala

Hmm, will sub-expr elimination extract common stateful expressions as common expr and break the rule (not reusing)?

2 stateful but deterministic expressions always produce the same result given the same input sequence. So it's OK to apply sub-expr elimination.

viirya · 2022-12-29T06:19:52Z

...catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ExpressionsEvaluator.scala

This should be called after prepareExpressions is finished, right?

or more specifically, the implementation should initialize the final expression that it actually uses.

viirya

The proposed logic looks good to me. There are some test failures that looks related.

viirya · 2022-12-29T06:37:17Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala

The keys in tags still refer to original tree node. Is it okay?

How is this possible? keys are basically strings, right?

Note: the code here basically follows TreeNode.withNewChildren

Yea, you're correct. It is just a string for node name.

cloud-fan · 2022-12-31T02:05:19Z

thanks for the review, merging to master!

…he base class ### What changes were proposed in this pull request? This is a followup of #39248 , to add one more code cleanup. The expression initialization code is duplicated 6 times and we should put it in the base class. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #39364 from cloud-fan/expr. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Liang-Chi Hsieh <[email protected]>

github-actions bot added the SQL label Dec 27, 2022

cloud-fan force-pushed the expr branch 2 times, most recently from 0838a76 to de9c17b Compare December 28, 2022 06:57

cloud-fan changed the title ~~[WIP] revisit stateful expression handling~~ [SPARK-41049][SQL] Revisit stateful expression handling Dec 28, 2022

cloud-fan marked this pull request as ready for review December 28, 2022 07:16

cloud-fan commented Dec 28, 2022

View reviewed changes

cloud-fan force-pushed the expr branch from de9c17b to 5e2ac37 Compare December 28, 2022 07:26

cloud-fan mentioned this pull request Dec 28, 2022

[SPARK-42169] Implement code generation for to_csv function (StructsToCsv) #39097

Closed

viirya reviewed Dec 29, 2022

View reviewed changes

cloud-fan added 2 commits December 30, 2022 21:22

revist stateful expression handling

5167a2f

fix tests

3c9e101

cloud-fan force-pushed the expr branch from 4e0da31 to 3c9e101 Compare December 30, 2022 13:22

viirya approved these changes Dec 30, 2022

View reviewed changes

cloud-fan closed this in 4321604 Dec 31, 2022

cloud-fan mentioned this pull request Jan 3, 2023

[SPARK-41049][SQL][FOLLOWUP] Move expression initialization code to the base class #39364

Closed

Conversation

cloud-fan commented Dec 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 28, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 31, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloud-fan commented Dec 27, 2022 •

edited

Loading