-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28306][SQL] Make NormalizeFloatingNumbers rule idempotent #25080
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ping @cloud-fan due to |
|
Test build #107376 has finished for PR 25080 at commit
|
|
Test build #107378 has finished for PR 25080 at commit
|
|
Why this is reported as a bug? If this is a bug, please make a regression test with the title prefix |
|
Changed to improvement. |
|
Thanks, @yeshengm . |
| comparePlans(optimized, correctAnswer) | ||
| } | ||
|
|
||
| test("normalize floating points in window function expressions - idempotence") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can remove this test after we add idempotence policy and change the once policy in this test suite to idempotence?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. Do we have to add a mark here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not necessary, I just want to confirm it.
|
|
||
| case _ if expr.dataType == FloatType || expr.dataType == DoubleType => | ||
| NormalizeNaNAndZero(expr) | ||
| KnownFloatingPointNormalized(NormalizeNaNAndZero(expr)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, from my understanding, we didn't quite like such approach though like analysis barrier. Scope here is small so might be fine but this doesn't particularly look like a good fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is from TransformArray, since we can't easily tell whether a TransformArray is for FP normalization or not. Otherwise we can just check for NormalizeNaNAndZero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And we don't want to add a new kind of TransformArray node in the final logical plan either (and related logic)... I can't really think of an elegant approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has a much less impact than the AnalysisBarrier -- this only applies to expressions whereas the AnalysisBarrier applied to plans.
We'd to leave markers in place in case a plan gets re-optimized after the initial optimization, and we have to have something that provides such information persisted in the plan.
The alternative for providing this information would be something like having a new dedicated expression type for floating point array normalization, which would also be disruptive to the expression tree structure. In terms of code reuse and semantic clarity, I'd say Yesheng's current design strikes the best balance.
| } | ||
| } | ||
|
|
||
| case class KnownFloatingPointNormalized(child: Expression) extends TaggingExpression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we override toString here, so that it's invisible to end users when running EXPLAIN?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's already handled in Expression::toString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cloud-fan should it be invisible though? I'd rather leave a trace of the marker in the plan, but we could make it less verbose by making it something like adding a prefix to the child instead of the regular tostring, e.g. print
normalizing-transform(...)
instead of
knownfloatingpointnormalized(transform(...))
WDYT?
|
Test build #107382 has finished for PR 25080 at commit
|
|
Test build #107394 has finished for PR 25080 at commit
|
|
Test build #107396 has finished for PR 25080 at commit
|
|
retest this please |
|
Test build #107400 has finished for PR 25080 at commit
|
rednaxelafx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks!
|
I'm okie with it too |
|
thanks, merging to master! |
## What changes were proposed in this pull request? The optimizer rule `NormalizeFloatingNumbers` is not idempotent. It will generate multiple `NormalizeNaNAndZero` and `ArrayTransform` expression nodes for multiple runs. This patch fixed this non-idempotence by adding a marking tag above normalized expressions. It also adds missing UTs for `NormalizeFloatingNumbers`. ## How was this patch tested? New UTs. Closes apache#25080 from yeshengm/spark-28306. Authored-by: Yesheng Ma <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
The optimizer rule
NormalizeFloatingNumbersis not idempotent. It will generate multipleNormalizeNaNAndZeroandArrayTransformexpression nodes for multiple runs. This patch fixed this non-idempotence by adding a marking tag above normalized expressions. It also adds missing UTs forNormalizeFloatingNumbers.How was this patch tested?
New UTs.