Skip to content

Conversation

@yeshengm
Copy link
Contributor

@yeshengm yeshengm commented Jul 8, 2019

What changes were proposed in this pull request?

The optimizer rule NormalizeFloatingNumbers is not idempotent. It will generate multiple NormalizeNaNAndZero and ArrayTransform expression nodes for multiple runs. This patch fixed this non-idempotence by adding a marking tag above normalized expressions. It also adds missing UTs for NormalizeFloatingNumbers.

How was this patch tested?

New UTs.

@yeshengm yeshengm changed the title [SPARK-28306] Fix idempotence for optimizer rule NormalizeFloatingNumbers [SPARK-28306][SQL] Fix idempotence for optimizer rule NormalizeFloatingNumbers Jul 8, 2019
@yeshengm
Copy link
Contributor Author

yeshengm commented Jul 9, 2019

ping @cloud-fan due to NormalizeFloatingNumbers

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107376 has finished for PR 25080 at commit 8b914cc.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class KnownFloatingPointNormalized(child: Expression) extends UnaryExpression

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107378 has finished for PR 25080 at commit 58ec2a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TaggingExpression extends UnaryExpression
  • case class KnownNotNull(child: Expression) extends TaggingExpression
  • case class KnownFloatingPointNormalized(child: Expression) extends TaggingExpression

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-28306][SQL] Fix idempotence for optimizer rule NormalizeFloatingNumbers [SPARK-28306][SQL] Make NormalizeFloatingNumbers rule idempotent Jul 9, 2019
@dongjoon-hyun
Copy link
Member

Why this is reported as a bug? If this is a bug, please make a regression test with the title prefix SPARK-28306.

@yeshengm
Copy link
Contributor Author

yeshengm commented Jul 9, 2019

Changed to improvement.

@dongjoon-hyun
Copy link
Member

Thanks, @yeshengm .

comparePlans(optimized, correctAnswer)
}

test("normalize floating points in window function expressions - idempotence") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we can remove this test after we add idempotence policy and change the once policy in this test suite to idempotence?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Do we have to add a mark here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessary, I just want to confirm it.


case _ if expr.dataType == FloatType || expr.dataType == DoubleType =>
NormalizeNaNAndZero(expr)
KnownFloatingPointNormalized(NormalizeNaNAndZero(expr))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, from my understanding, we didn't quite like such approach though like analysis barrier. Scope here is small so might be fine but this doesn't particularly look like a good fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is from TransformArray, since we can't easily tell whether a TransformArray is for FP normalization or not. Otherwise we can just check for NormalizeNaNAndZero.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we don't want to add a new kind of TransformArray node in the final logical plan either (and related logic)... I can't really think of an elegant approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has a much less impact than the AnalysisBarrier -- this only applies to expressions whereas the AnalysisBarrier applied to plans.
We'd to leave markers in place in case a plan gets re-optimized after the initial optimization, and we have to have something that provides such information persisted in the plan.

The alternative for providing this information would be something like having a new dedicated expression type for floating point array normalization, which would also be disruptive to the expression tree structure. In terms of code reuse and semantic clarity, I'd say Yesheng's current design strikes the best balance.

}
}

case class KnownFloatingPointNormalized(child: Expression) extends TaggingExpression
Copy link
Contributor

@cloud-fan cloud-fan Jul 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we override toString here, so that it's invisible to end users when running EXPLAIN?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already handled in Expression::toString?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan should it be invisible though? I'd rather leave a trace of the marker in the plan, but we could make it less verbose by making it something like adding a prefix to the child instead of the regular tostring, e.g. print
normalizing-transform(...)
instead of
knownfloatingpointnormalized(transform(...))

WDYT?

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107382 has finished for PR 25080 at commit 201b287.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait TaggingExpression extends UnaryExpression
  • case class KnownNotNull(child: Expression) extends TaggingExpression
  • case class KnownFloatingPointNormalized(child: Expression) extends TaggingExpression

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107394 has finished for PR 25080 at commit 1f6773c.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107396 has finished for PR 25080 at commit f255c8e.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yeshengm
Copy link
Contributor Author

yeshengm commented Jul 9, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #107400 has finished for PR 25080 at commit f255c8e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@rednaxelafx rednaxelafx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks!

@HyukjinKwon
Copy link
Member

I'm okie with it too

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 7021588 Jul 11, 2019
j-baker pushed a commit to palantir/spark that referenced this pull request Jan 28, 2020
## What changes were proposed in this pull request?
The optimizer rule `NormalizeFloatingNumbers` is not idempotent. It will generate multiple `NormalizeNaNAndZero` and `ArrayTransform` expression nodes for multiple runs. This patch fixed this non-idempotence by adding a marking tag above normalized expressions. It also adds missing UTs for `NormalizeFloatingNumbers`.

## How was this patch tested?
New UTs.

Closes apache#25080 from yeshengm/spark-28306.

Authored-by: Yesheng Ma <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants