[SPARK-28306][SQL] Make NormalizeFloatingNumbers rule idempotent #25080

yeshengm · 2019-07-08T23:26:56Z

What changes were proposed in this pull request?

The optimizer rule NormalizeFloatingNumbers is not idempotent. It will generate multiple NormalizeNaNAndZero and ArrayTransform expression nodes for multiple runs. This patch fixed this non-idempotence by adding a marking tag above normalized expressions. It also adds missing UTs for NormalizeFloatingNumbers.

How was this patch tested?

New UTs.

yeshengm · 2019-07-09T00:09:41Z

ping @cloud-fan due to NormalizeFloatingNumbers

SparkQA · 2019-07-09T00:53:05Z

Test build #107376 has finished for PR 25080 at commit 8b914cc.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class KnownFloatingPointNormalized(child: Expression) extends UnaryExpression

SparkQA · 2019-07-09T00:59:58Z

Test build #107378 has finished for PR 25080 at commit 58ec2a2.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TaggingExpression extends UnaryExpression
case class KnownNotNull(child: Expression) extends TaggingExpression
case class KnownFloatingPointNormalized(child: Expression) extends TaggingExpression

dongjoon-hyun · 2019-07-09T01:37:20Z

Why this is reported as a bug? If this is a bug, please make a regression test with the title prefix SPARK-28306.

yeshengm · 2019-07-09T01:57:38Z

Changed to improvement.

dongjoon-hyun · 2019-07-09T02:03:59Z

Thanks, @yeshengm .

cloud-fan · 2019-07-09T02:55:07Z

.../test/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingPointNumbersSuite.scala

+    comparePlans(optimized, correctAnswer)
+  }
+
+  test("normalize floating points in window function expressions - idempotence") {


so we can remove this test after we add idempotence policy and change the once policy in this test suite to idempotence?

Yep. Do we have to add a mark here?

not necessary, I just want to confirm it.

HyukjinKwon · 2019-07-09T03:30:22Z

...talyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/NormalizeFloatingNumbers.scala


    case _ if expr.dataType == FloatType || expr.dataType == DoubleType =>
-      NormalizeNaNAndZero(expr)
+      KnownFloatingPointNormalized(NormalizeNaNAndZero(expr))


Hm, from my understanding, we didn't quite like such approach though like analysis barrier. Scope here is small so might be fine but this doesn't particularly look like a good fix.

The problem is from TransformArray, since we can't easily tell whether a TransformArray is for FP normalization or not. Otherwise we can just check for NormalizeNaNAndZero.

And we don't want to add a new kind of TransformArray node in the final logical plan either (and related logic)... I can't really think of an elegant approach.

This has a much less impact than the AnalysisBarrier -- this only applies to expressions whereas the AnalysisBarrier applied to plans.
We'd to leave markers in place in case a plan gets re-optimized after the initial optimization, and we have to have something that provides such information persisted in the plan.

The alternative for providing this information would be something like having a new dedicated expression type for floating point array normalization, which would also be disruptive to the expression tree structure. In terms of code reuse and semantic clarity, I'd say Yesheng's current design strikes the best balance.

cloud-fan · 2019-07-09T04:58:10Z

...atalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/constraintExpressions.scala

+  }
+}
+
+case class KnownFloatingPointNormalized(child: Expression) extends TaggingExpression


shall we override toString here, so that it's invisible to end users when running EXPLAIN?

I think it's already handled in Expression::toString?

@cloud-fan should it be invisible though? I'd rather leave a trace of the marker in the plan, but we could make it less verbose by making it something like adding a prefix to the child instead of the regular tostring, e.g. print
normalizing-transform(...)
instead of
knownfloatingpointnormalized(transform(...))

WDYT?

SparkQA · 2019-07-09T05:03:24Z

Test build #107382 has finished for PR 25080 at commit 201b287.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
trait TaggingExpression extends UnaryExpression
case class KnownNotNull(child: Expression) extends TaggingExpression
case class KnownFloatingPointNormalized(child: Expression) extends TaggingExpression

SparkQA · 2019-07-09T06:20:45Z

Test build #107394 has finished for PR 25080 at commit 1f6773c.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-07-09T07:05:01Z

Test build #107396 has finished for PR 25080 at commit f255c8e.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

yeshengm · 2019-07-09T08:01:37Z

retest this please

SparkQA · 2019-07-09T11:20:47Z

Test build #107400 has finished for PR 25080 at commit f255c8e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rednaxelafx

LGTM. Thanks!

HyukjinKwon · 2019-07-10T00:34:31Z

I'm okie with it too

cloud-fan · 2019-07-11T02:22:24Z

thanks, merging to master!

## What changes were proposed in this pull request? The optimizer rule `NormalizeFloatingNumbers` is not idempotent. It will generate multiple `NormalizeNaNAndZero` and `ArrayTransform` expression nodes for multiple runs. This patch fixed this non-idempotence by adding a marking tag above normalized expressions. It also adds missing UTs for `NormalizeFloatingNumbers`. ## How was this patch tested? New UTs. Closes apache#25080 from yeshengm/spark-28306. Authored-by: Yesheng Ma <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-28306] Fix idempotence for normalization

8b914cc

yeshengm changed the title ~~[SPARK-28306] Fix idempotence for optimizer rule NormalizeFloatingNumbers~~ [SPARK-28306][SQL] Fix idempotence for optimizer rule NormalizeFloatingNumbers Jul 8, 2019

dongjoon-hyun added the SQL label Jul 8, 2019

dongjoon-hyun changed the title ~~[SPARK-28306][SQL] Fix idempotence for optimizer rule NormalizeFloatingNumbers~~ [SPARK-28306][SQL] Make NormalizeFloatingNumbers rule idempotent Jul 9, 2019

code dedup

201b287

yeshengm force-pushed the spark-28306 branch from 58ec2a2 to 201b287 Compare July 9, 2019 01:54

cloud-fan reviewed Jul 9, 2019

View reviewed changes

cloud-fan approved these changes Jul 9, 2019

View reviewed changes

HyukjinKwon reviewed Jul 9, 2019

View reviewed changes

cloud-fan reviewed Jul 9, 2019

View reviewed changes

nit

f255c8e

yeshengm force-pushed the spark-28306 branch from 1f6773c to f255c8e Compare July 9, 2019 06:27

rednaxelafx approved these changes Jul 9, 2019

View reviewed changes

cloud-fan closed this in 7021588 Jul 11, 2019

[SPARK-28306][SQL] Make NormalizeFloatingNumbers rule idempotent #25080

[SPARK-28306][SQL] Make NormalizeFloatingNumbers rule idempotent #25080

Uh oh!

Conversation

yeshengm commented Jul 8, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

yeshengm commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

dongjoon-hyun commented Jul 9, 2019

Uh oh!

yeshengm commented Jul 9, 2019

Uh oh!

dongjoon-hyun commented Jul 9, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jul 9, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

yeshengm commented Jul 9, 2019

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

rednaxelafx left a comment

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Jul 10, 2019

Uh oh!

cloud-fan commented Jul 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cloud-fan Jul 9, 2019 •

edited

Loading