Skip to content

[SPARK-31705][SQL][FOLLOWUP] Avoid the unnecessary CNF computation for full-outer joins#28810

Closed
maropu wants to merge 4 commits intoapache:masterfrom
maropu:SPARK-31705
Closed

[SPARK-31705][SQL][FOLLOWUP] Avoid the unnecessary CNF computation for full-outer joins#28810
maropu wants to merge 4 commits intoapache:masterfrom
maropu:SPARK-31705

Conversation

@maropu
Copy link
Member

@maropu maropu commented Jun 12, 2020

What changes were proposed in this pull request?

To avoid the unnecessary CNF computation for full-outer joins, this PR fixes code for filtering out full-outer joins at the entrance of the rule.

Why are the changes needed?

To mitigate optimizer overhead.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing tests.

@maropu
Copy link
Member Author

maropu commented Jun 12, 2020

cc: @gengliangwang @cloud-fan

@gengliangwang
Copy link
Member

@maropu Thanks for the improvement. Shall we update PushPredicateThroughJoin as well?

@maropu
Copy link
Member Author

maropu commented Jun 12, 2020

Yea, sure.

case UsingJoin(_, _) => sys.error("Untransformed Using join node")

case jt =>
sys.error(s"Unexpected join type: $jt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since we are here, can we throw an exception instead? sys.error will exit the JVM IIRC

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about IllegalStateException? Throwing an exception here looks okay to me, but I personally think we need consistent handling for unexpected code pathes. I've checed how to handle this unexpected behaivour in the other optimzier rules, then I found there are some rules to use sys.error;

$ grep -nr "sys.error" .
./Optimizer.scala:1345:          sys.error(s"Unexpected join type: $jt")
./Optimizer.scala:1381:          sys.error(s"Unexpected join type: $jt")
./PushCNFPredicateThroughJoin.scala:64:            sys.error(s"Unexpected join type: $jt")
./subquery.scala:458:        sys.error(s"Unexpected operator in scalar subquery: $lp")
./subquery.scala:496:          sys.error(s"Correlated subquery has unexpected operator $op below filter")
./subquery.scala:498:        case op @ _ => sys.error(s"Unexpected operator $op in correlated subquery")
./subquery.scala:502:    sys.error("This line should be unreachable")
./subquery.scala:564:              case op => sys.error(s"Unexpected operator $op in corelated subquery")

Is it better to change them, too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about AnalysisException?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AnalysisException is mainly used for an analyzing phase? https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/AnalysisException.scala#L24

yea, I know some rules in the optimizer throw AnalysisException though...

(base) maropu@~/Repositories/spark/spark-master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer: (SPARK-31705 $)$grep -nr "throw" .
./joins.scala:187:        // `requires attributes from more than one child`, we throw firstly here for better
./joins.scala:189:        throw new AnalysisException("Using PythonUDF in join condition of join type" +
./joins.scala:206:          throw new AnalysisException("Using PythonUDF in join condition of join type" +
./NormalizeFloatingNumbers.scala:100:      throw new IllegalStateException("grouping/join/window partition keys cannot be map type.")
./NormalizeFloatingNumbers.scala:134:    case _ => throw new IllegalStateException(s"fail to normalize $expr")
./objects.scala:250:        throw new IllegalStateException("LambdaVariable should never has 0 as its ID.")
./objects.scala:255:          throw new IllegalStateException(
./objects.scala:264:          throw new IllegalStateException(
./Optimizer.scala:1345:          throw new IllegalStateException(s"Unexpected join type: $jt")
./Optimizer.scala:1381:          throw new IllegalStateException(s"Unexpected join type: $jt")
./Optimizer.scala:1436:          throw new AnalysisException(
./PushCNFPredicateThroughJoin.scala:64:            throw new IllegalStateException(s"Unexpected join type: $jt")
./ReplaceExceptWithFilter.scala:106:    throw new IllegalStateException("Leaf node is expected")
./ReplaceNullWithFalseInPredicate.scala:105:        throw new IllegalArgumentException(message)
./subquery.scala:79:            throw new AnalysisException("Found conflicting attributes " +

@SparkQA
Copy link

SparkQA commented Jun 12, 2020

Test build #123908 has finished for PR 28810 at commit 6cad6c6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 12, 2020

retest this please

@SparkQA
Copy link

SparkQA commented Jun 12, 2020

Test build #123917 has finished for PR 28810 at commit 7850c2b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2020

Test build #123923 has finished for PR 28810 at commit 1251b68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 12, 2020

Test build #123931 has finished for PR 28810 at commit 1251b68.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case UsingJoin(_, _) => sys.error("Untransformed Using join node")

case jt =>
throw new IllegalStateException(s"Unexpected join type: $jt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IllegalStateException means unexpected code path, I think this is the standard in Java, +1

@maropu
Copy link
Member Author

maropu commented Jun 16, 2020

retest this please

@maropu
Copy link
Member Author

maropu commented Jun 16, 2020

ping @cloud-fan @gengliangwang

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124081 has finished for PR 28810 at commit 1251b68.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member Author

maropu commented Jun 16, 2020

retest this please

case NaturalJoin(_) => sys.error("Untransformed NaturalJoin node")
case UsingJoin(_, _) => sys.error("Untransformed Using join node")

case jt =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: jt => other

case FullOuter => j
case NaturalJoin(_) => sys.error("Untransformed NaturalJoin node")
case UsingJoin(_, _) => sys.error("Untransformed Using join node")
case jt =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124104 has finished for PR 28810 at commit 1251b68.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@wangyum wangyum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make conjunctiveNormalForm to protected in this pr?

@maropu
Copy link
Member Author

maropu commented Jun 16, 2020

Thanks, all! Pending, Jenkins.

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124122 has finished for PR 28810 at commit 6dbff0c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum wangyum closed this in 8d57709 Jun 16, 2020
@wangyum
Copy link
Member

wangyum commented Jun 16, 2020

Merged to master.

@SparkQA
Copy link

SparkQA commented Jun 16, 2020

Test build #124120 has finished for PR 28810 at commit f4497a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants