[SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize #29104

leanken-zz · 2020-07-14T15:40:30Z

What changes were proposed in this pull request?

Normally, a Null aware anti join will be planed into BroadcastNestedLoopJoin which is very time consuming, for instance, in TPCH Query 16.

select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
from
    partsupp,
    part
where
    p_partkey = ps_partkey
    and p_brand <> 'Brand#45'
    and p_type not like 'MEDIUM POLISHED%'
    and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by
    p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
    p_size

In above query, will planed into

LeftAnti
condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey))

Inside BroadcastNestedLoopJoinExec will perform O(M*N), BUT if there is only single column in NAAJ, we can always change buildSide into a HashSet, and streamedSide just need to lookup in the HashSet, then the calculation will be optimized into O(M).

But this optimize is only targeting on null aware anti join with single column case, because multi-column support is much more complicated, we might be able to support multi-column in future.
After apply this patch, the TPCH Query 16 performance decrease from 41mins to 30s

The semantic of null-aware anti join is:

Why are the changes needed?

TPCH is a common benchmark for distributed compute engine, all other 21 Query works fine on Spark, except for Query 16, apply this patch will make Spark more competitive among all these popular engine. BTW, this patch has restricted rules and only apply on NAAJ Single Column case, which is safe enough.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

SQLQueryTestSuite with NOT IN keyword SQL, add CONFIG_DIM with spark.sql.optimizeNullAwareAntiJoin on and off
added case in org.apache.spark.sql.JoinSuite.
added case in org.apache.spark.sql.SubquerySuite.
Compare performance before and after applying this patch against TPCH Query 16.
config combination against e2e test with following

Map(
  "spark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "false",
  "spark.sql.codegen.wholeStage" -> "false"
),
Map(
  "sspark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "false",
  "spark.sql.codegen.wholeStage" -> "true"
),
Map(
  "spark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "true",
  "spark.sql.codegen.wholeStage" -> "false"
),
Map(
  "spark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "true",
  "spark.sql.codegen.wholeStage" -> "true"
)

leanken-zz · 2020-07-14T15:42:45Z

@cloud-fan
Could you please have a quick look at this issue, many thanks !!

maropu · 2020-07-15T01:55:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+              notInSubquerySingleColumnOptimizeStreamedKeyIndex,
+              notInSubquerySingleColumnOptimizeStreamedKey.dataType
+            )
+            val notInKeyEqual = params.buildSideHashSet.contains(streamedRowNotInKey)


Could we reuse [Unsafe|Long]HashedRelation here?

done changing into HashedRelation, nice advise.

maropu · 2020-07-15T01:59:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+            if leftAttr.semanticEquals(tmpLeft) && rightAttr.semanticEquals(tmpRight) =>
+          notInSubquerySingleColumnOptimizeSetStreamedKey(leftAttr, rightAttr)
+          if (notInSubquerySingleColumnOptimizeStreamedKeyIndex != -1) {
+            true
+          } else {
+            logWarning(s"failed to find notInSubquerySingleColumnOptimizeStreamedKeyIndex," +
+              s" fallback to leftExistenceJoin.")
+            false


This code block is the same with the line244-251? If so, could you merge them? How about defining an extractor object for the case?

I check on the source code on subquery.scala, found that
Or(EqualTo(a, b), IsNull(EqualTo(a, b))) will be the only option, there is no need to handle two Or pattern. so i remove the duplicate code.

# See. org/apache/spark/sql/catalyst/optimizer/subquery.scala val inConditions = values.zip(sub.output).map(EqualTo.tupled) val nullAwareJoinConds = inConditions.map(c => Or(c, IsNull(c)))

maropu · 2020-07-15T02:00:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+      // or(a=b,isnull(a=b))
+      // or(isnull(a=b),a=b)
+      condition.get match {
+        case _@Or(_@EqualTo(leftAttr: AttributeReference, rightAttr: AttributeReference),


btw, could you follow the format in the other code? For example, we need to a space between @ and EqualTo.

maropu · 2020-07-15T02:06:36Z

Could you update TPCDSQueryBenchmark-results.txt, too? Can the number of q16 get better?

spark/sql/core/benchmarks/TPCDSQueryBenchmark-results.txt

Line 101 in 03b5707

    
           q16                                                1658           1707          69          0.0      Infinity       1.0X

maropu · 2020-07-15T02:06:49Z

ok to test

maropu · 2020-07-15T02:22:51Z

oh, btw, thanks for the first contribution, @leanken .

leanken-zz · 2020-07-15T02:56:16Z

oh, btw, thanks for the first contribution, @leanken .

Will reply your comments ASAP, many thanks.

SparkQA · 2020-07-15T07:05:02Z

Test build #125871 has finished for PR 29104 at commit 042ca4a.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class NotInSubquerySingleColumnOptimizeParams(

dilipbiswal · 2020-07-15T07:06:29Z

retest this please

leanken-zz · 2020-07-15T12:02:10Z

Could you update TPCDSQueryBenchmark-results.txt, too? Can the number of q16 get better?

spark/sql/core/benchmarks/TPCDSQueryBenchmark-results.txt

Line 101 in 03b5707

q16 1658 1707 69 0.0 Infinity 1.0X

I am afraid that TPCDS sqls does not have NotInSubquery case, TPCDS sqls using Not Exists instead of Not In. What i ran before is TPCH Query 16. But i am more than happy to just write TPHC benchmark code and do benchmark after this issue closed, if needed, ^_^

maropu · 2020-07-15T12:11:51Z

Ah, I see and I missed that. You said not TPCDS but TPCH, right.

leanken-zz · 2020-07-15T12:12:18Z

the Origin 800,000,000 * 4,898 times calculation could easily cost 50~60 mins, after apply this patch, it takes only 9s.

SparkQA · 2020-07-15T23:43:19Z

Test build #125884 has finished for PR 29104 at commit 40d7174.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

leanken-zz · 2020-07-16T01:20:47Z

@maropu Any further comments?

maropu · 2020-07-16T01:53:20Z

But this optimize is only targeting on NotInSubquery with single column case.

One question; why did you apply this optimization only in the case? The optimization itself looks more general though.

leanken-zz · 2020-07-16T02:07:15Z

But this optimize is only targeting on NotInSubquery with single column case.

One question; why did you apply this optimization only in the case? The optimization itself looks more general though.

you mean extend it to support multi columns?

maropu · 2020-07-16T02:10:15Z

yea, we cannnot handle the case?

leanken-zz · 2020-07-16T02:15:08Z

-- Test cases for multi-column ``WHERE a NOT IN (SELECT c FROM r ...)'':
-- | # | does subquery include null? | do filter columns contain null? | a = c? | b = d? | row included in result? |
-- | 1 | empty | * | * | * | yes |
-- | 2 | 1+ row has null for all columns | * | * | * | no |
-- | 3 | no row has null for all columns | (yes, yes) | * | * | no |
-- | 4 | no row has null for all columns | (no, yes) | yes | * | no |
-- | 5 | no row has null for all columns | (no, yes) | no | * | yes |
-- | 6 | no | (no, no) | yes | yes | no |
-- | 7 | no | (no, no) | _ | _ | yes

multi column Not(IsNull) is much more complicated. i am afraid that the lookup code and if-else logic will be un-readable.

leanken-zz · 2020-07-16T02:29:24Z

let me take some time to find out common pattern among single and multi column support.

maropu · 2020-07-16T02:41:23Z

hm, it might be okay to support the limited optimization as a first step if it has a huge impact on the performance of common caes. But, I think the method (& parameter) names should be more general and we need to leave to TODO for future work.

leanken-zz · 2020-07-16T02:52:46Z

For example.
-- Case 4
-- (one column null, other column matches a row in the subquery result -> row not returned)
SELECT *
FROM m
WHERE b = 1.0 -- Matches (null, 1.0)
AND (a, b) NOT IN (SELECT *
FROM s
WHERE c IS NOT NULL) -- Matches (0, 1.0), (2, 3.0), (4, null)
;

in this case, i can't not use InternalRow(null, 1.0) to lookup in HashedRelation. I need to exclude all null column, and try found match within the not null column, which i think HashedRelation is not a suitable structure for multi-column support. But if change into multi column and need to deal with null column, which means i can't use Hash to lookup, so it will still be a M*N, that's no gona help.

leanken-zz · 2020-07-16T03:02:27Z

let's say in streamedSide there is a record
(null, 1, null)
and buildSide is
(1, 2, 3)
(1, 1, 3)
(null, 1, 3)

if i need to confirm a Not In, i need to extract second column values, and build HashSet; what if next time streamedSide is a
(1, null, null)

in simple words, i could not rebuild a "CUBE-LIKE" HashSet for all column combinations; or I can just rollback to just compare two records in BuildSide row by row, which is still M*N.

So I think multiple column is not suitable for these Hash Optimize because its null safe complexity

leanken-zz · 2020-07-16T03:03:56Z

hm, it might be okay to support the limited optimization as a first step if it has a huge impact on the performance of common caes. But, I think the method (& parameter) names should be more general and we need to leave to TODO for future work.

Myself also thinks of these methodName and paramsName being too long, do you have better suggestion for me, that will be great help.

leanken-zz · 2020-07-16T04:01:47Z

For example.
-- Case 4
-- (one column null, other column matches a row in the subquery result -> row not returned)
SELECT *
FROM m
WHERE b = 1.0 -- Matches (null, 1.0)
AND (a, b) NOT IN (SELECT *
FROM s
WHERE c IS NOT NULL) -- Matches (0, 1.0), (2, 3.0), (4, null)
;

in this case, i can't not use InternalRow(null, 1.0) to lookup in HashedRelation. I need to exclude all null column, and try found match within the not null column, which i think HashedRelation is not a suitable structure for multi-column support. But if change into multi column and need to deal with null column, which means i can't use Hash to lookup, so it will still be a M*N, that's no gona help.

ping @maropu on the multi column support conclusion.

agrawaldevesh · 2020-07-16T17:08:09Z

I don't see a unit test in this PR. Can you please add one. Thanks.

agrawaldevesh

This is pretty neat and it would make Spark look pretty cool on TPCH. Thanks for taking this up.

But Please add some Unit Tests ! Another way to test this "Exhaustively" would be to have a config to force this optimization and then run it through the existing Not in test suites, which already do a fairly good job. But I think you would have to copy that test perhaps to make sure it runs fully with this "forced config" enabled.

As for the general design of this optimization, I feel a bit uncomfortable doing this check at "Runtime". An alternative design would be to do this check somehow at the optimizer / compile time and then set a flag in the regular BroadcastHashJoin that it should now be null aware. I am wondering if you considered that strategy ? It might be a better UX for the user: The explain plan and spark UI would be more faithful.

On that note ? Is it worth somehow communicating to the user that their BroadcastNLJ was "accelerated" using this approach ? Do you want to up-level that as a metric etc such that it can show up in the Spark UI. Not sure if it is worth the plumbing.

Thanks.

agrawaldevesh · 2020-07-16T17:10:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+      // BuildSide must be single column, condition must be the following pattern
+      // Or(EqualTo(a, b), IsNull(EqualTo(a, b)))
+      condition.get match {
+        case _ @ Or(


I believe you can write this more simply as:

Or(EqualTo(leftAttr, rightAttr), IsNull(tmpLeft, tmpRight)) if ...

I believe you don't need the dashes.

agrawaldevesh · 2020-07-16T17:12:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+              AttributeSeq(left.output))
+          )
+        streamedIter.filter(row => {
+          // See. not-in-unit-tests-single-column.sql for detail filter rules


Lets not refer to test code for describing production code :-)

agrawaldevesh · 2020-07-16T17:15:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    buildConf("spark.sql.notInSubquery.singleColumn.optimize.enabled")
+      .internal()
+      .doc("When true, single column not in subquery execution in BroadcastNestedLoopJoinExec " +
+        "will be optimized from M*N calculation into M*log(N) calculation using HashMap lookup " +


N00b/dumb question: Why M*log(N) instead of M * 1 ? Shouldn't HT probe lookup be O(1) and O(log N).

agrawaldevesh · 2020-07-16T17:19:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+      .booleanConf
+      .createWithDefault(false)
+
+  val NOT_IN_SUBQUERY_SINGLE_COLUMN_OPTIMIZE_ROW_COUNT_THRESHOLD =


What should be the relationship of this config vs spark.sql.autoBroadcastJoinThreshold ? Shouldn't the threshold be based on the size of the build size in bytes vs num rows ? I think it is confusing to have two configs for the similar sort of information: How big can the table be.

done remove this config and use spark.sql.autoBroadcastJoinThreshold

agrawaldevesh · 2020-07-16T17:21:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+      isBuildRowsEmpty: Boolean)
+
+  private def notInSubquerySingleColumnOptimizeEnabled: Boolean = {
+    if (SQLConf.get.notInSubquerySingleColumnOptimizeEnabled && right.output.length == 1) {


Dumb question: Should left.output.length be checked as well ?

I believe not everyone would know of the nuances with multi-column Null Aware anti join (see http://www.vldb.org/pvldb/vol2/vldb09-423.pdf section 6.1 and 6.2), so it would be nice to mention atleast that multi-column is not being handled because it is insanely complicated.

left.output.length could be more than 1.

agrawaldevesh · 2020-07-16T17:37:14Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+          )
+        streamedIter.filter(row => {
+          // See. not-in-unit-tests-single-column.sql for detail filter rules
+          if (params.isBuildRowsEmpty) {


I believe you can pull this check out .. No point in going through a "filter" that is unconditionally true.

agrawaldevesh · 2020-07-16T17:38:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+            val lookupRow: UnsafeRow = keyGenerator(row)
+            val notInKeyEqual = params.buildSideHashedRelation.get(lookupRow) match {
+              case null => false
+              case _ => true


Can this be simplified to params.buildSideHashedRelation.get(lookupRow) != null ?

agrawaldevesh · 2020-07-16T17:42:43Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+        BindReferences.bindReferences[Expression](
+          Seq(right.output.head), AttributeSeq(right.output)),
+        buildRows.length),
+      buildRows.exists(row => row.isNullAt(0)),


dumb question: row is guaranteed to have only a single column, right ?

yes. it is.

agrawaldevesh · 2020-07-16T17:44:19Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+              case _ => true
+            }
+
+            if (!streamedRowIsNull && !params.isNullExists && !notInKeyEqual) {


The check for isNullExists can also be pulled out: If isNullExists, then we will unconditionally return nothing.

You have already paid the one time cost of doing a scan on the build size to check if any nulls exist (when preparing the params), you might as well exploit that check to gain some speed here :-)

agrawaldevesh · 2020-07-16T17:45:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastNestedLoopJoinExec.scala

+          .notInSubquerySingleColumnOptimizeRowCountThreshold}, fallback to leftExistenceJoin.")
+      leftExistenceJoin(relation, false)
+    } else {
+      val params = notInSubquerySingleColumnOptimizeBuildParams(buildRows)


This Null aware single column hash join is subtle and could use some comments or perhaps a reference. Perhaps you could link to section 6.1 of that paper above ?

done. with TODO and comment left.

leanken-zz · 2020-07-16T23:24:55Z

@agrawaldevesh thanks for your feedback, I will first consider your suggestion about doing it in optimizer. it might take some time.

leanken-zz · 2020-07-17T00:19:50Z

Hi. @agrawaldevesh
I am afraid that putting the optimize into BroadcastHashJoinExec is not that easy.
right now, I've got
BroadcastNestedLoopJoinExec(LeftAnti with condition Or(EqualTo(a=b), IsNull(EqualTo(a=b))))

if i want to translate into BroadcastHashJoinExec, first of all i need a join key, right?
BroadcastHashJoinExec(LeftAnti joinKey(a=b), with condition)
But the EquiJoinKeys itself already break the integrity of the origin condition Or(EqualTo(a=b), IsNull(EqualTo(a=b))

Let's see what codegenAnti is like:

s"""
         |boolean $found = false;
         |// generate join key for stream side
         |${keyEv.code}
         |// Check if the key has nulls.
         |if (!($anyNull)) {
         |  // Check if the HashedRelation exists.
         |  UnsafeRow $matched = (UnsafeRow)$relationTerm.getValue(${keyEv.value});
         |  if ($matched != null) {
         |    // Evaluate the condition.
         |    $checkCondition {
         |      $found = true;
         |    }
         |  }
         |}
         |if (!$found) {
         |  $numOutput.add(1);
         |  ${consume(ctx, input)}
         |}
       """.stripMargin

antiJoin with Key will keep streamedSideRow if streamedSide key is a null, but it's totally opposite in NotInSubquery. I can certainly do some if-else check here, but it might mess up the whole BroadcastHashJoinExec Code.

Besides the streamedSide key null difference, need to go through the entire buildSide to see if there is a null key exists, that's also kind of weird.

BroadcastHashJoinExec assume that it has join key, but if i apply my NotInSubquery check here, it would like, hey, I found two key should be joined, but wait a minute, there are a tiny corner case here, so back off.

if it's up to me to choose, i won't choose to break integrity of BroadcastHashJoinExec, i would rather count NotInSubquerySingleColumn as an runtime optimize.

So, I am polling out the relative information for you guys, seeking advice till I move forward to next step.

Choose A.
Count NotInSubquerySingleColumn as runtime optimize

Choose B.
Move code into BroadcastHashJoinExec but Codegen looks tricky.

looking for your reply, many many thanks.

sub commit 1. change spark.sql.nullAwareAntiJoin.optimize.enabled => spark.sql.optimizeNullAwareAntiJoin 2. add assertion for isNullAware 3. update CONFIG_DIM with spark.sql.optimizeNullAwareAntiJoin 4. code style refined. Change-Id: I871fe95664e233908bb39b63444e73c4a24126c0

typo. Change-Id: Id5db52227468cb4b22bad1923eab85bd6ce6fb5d

Change-Id: Icbf28bdbee90de6b09172ab4c495383002f340a4

sub commit 1. change EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys to singleton object 2. change default implementation of NullAwareHashedRelation to throw UnsupportedOperationException Change-Id: I173ce102bb704677699b89daa1c9906f748c94aa

agrawaldevesh · 2020-07-28T00:43:08Z

sql/core/src/test/resources/sql-tests/inputs/group-by-filter.sql

+--CONFIG_DIM1 spark.sql.optimizeNullAwareAntiJoin=true
+--CONFIG_DIM1 spark.sql.optimizeNullAwareAntiJoin=false
+


Thanks for adding these. It gives us more confidence.

thanks to @cloud-fan I know of this better way to do e2e case coverage when adding a new feature.

SparkQA · 2020-07-28T03:21:15Z

Test build #126672 has finished for PR 29104 at commit 233eff6.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-07-28T04:42:12Z

github action passes, I'm merging it to master, thanks for your great work!

MaxGekk · 2020-08-03T11:07:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .checkValue(_ >= 0, "The value must be non-negative.")
      .createWithDefault(8)

+  val OPTIMIZE_NULL_AWARE_ANTI_JOIN =


You forgot to add version(). Here is the follow up PR #29335

viirya · 2020-08-18T23:35:48Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala

+object EmptyHashedRelationWithAllNullKeys extends NullAwareHashedRelation {
+  override def asReadOnlyCopy(): EmptyHashedRelationWithAllNullKeys.type = this


This object name really confuses. EmptyHashedRelation is from empty input, and EmptyHashedRelationWithAllNullKeys is from non-empty input.

Yes, indeed, but I can't come out with better naming, could you please help with the naming, and i will create a new PR to do code refine, since this PR is closed.

probably just remove Empty to make it HashedRelationWithAllNullKeys?

viirya · 2020-08-18T23:44:38Z

sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala

+        return s"""
+                  |boolean $found = false;
+                  |// generate join key for stream side
+                  |${keyEv.code}
+                  |if ($anyNull) {
+                  |  $found = true;
+                  |} else {
+                  |  UnsafeRow $matched = (UnsafeRow)$relationTerm.getValue(${keyEv.value});
+                  |  if ($matched != null) {
+                  |    $found = true;
+                  |  }
+                  |}
+                  |
+                  |if (!$found) {
+                  |  $numOutput.add(1);
+                  |  ${consume(ctx, input)}
+                  |}


Seems we can get rid of found variable and move this two lines to above if/else. found looks not correct in its semantics too. anyNull is true, doesn't mean we found matched row.

how about

s""" |// generate join key for stream side |${keyEv.code} |if (!$anyNull && $relationTerm.getValue(${keyEv.value}) == null) { | $numOutput.add(1); | ${consume(ctx, input)} |} """.stripMargin

maybe I could update these code as well with the new HashedRelation Name in next PR.

…join ### What changes were proposed in this pull request? NULL-aware ANTI join (https://issues.apache.org/jira/browse/SPARK-32290) detects NULL join keys during building the map for `HashedRelation`, and will immediately return `HashedRelationWithAllNullKeys` without taking care of the map built already. Before returning `HashedRelationWithAllNullKeys`, the map needs to be freed properly to save memory and keep memory accounting correctly. ### Why are the changes needed? Save memory and keep memory accounting correctly for the join query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests introduced in #29104 . Closes #32939 from c21/free-null-aware. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…join ### What changes were proposed in this pull request? NULL-aware ANTI join (https://issues.apache.org/jira/browse/SPARK-32290) detects NULL join keys during building the map for `HashedRelation`, and will immediately return `HashedRelationWithAllNullKeys` without taking care of the map built already. Before returning `HashedRelationWithAllNullKeys`, the map needs to be freed properly to save memory and keep memory accounting correctly. ### Why are the changes needed? Save memory and keep memory accounting correctly for the join query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests introduced in #29104 . Closes #32939 from c21/free-null-aware. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e0d81d9) Signed-off-by: Wenchen Fan <[email protected]>

…join ### What changes were proposed in this pull request? NULL-aware ANTI join (https://issues.apache.org/jira/browse/SPARK-32290) detects NULL join keys during building the map for `HashedRelation`, and will immediately return `HashedRelationWithAllNullKeys` without taking care of the map built already. Before returning `HashedRelationWithAllNullKeys`, the map needs to be freed properly to save memory and keep memory accounting correctly. ### Why are the changes needed? Save memory and keep memory accounting correctly for the join query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests introduced in apache#29104 . Closes apache#32939 from c21/free-null-aware. Authored-by: Cheng Su <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit e0d81d9) Signed-off-by: Wenchen Fan <[email protected]>

…degen is disabled ### What changes were proposed in this pull request? BHJ LeftAnti does not update numOutputRows when codegen is disabled ### Why are the changes needed? PR #29104 Only update numOutputRows when codegen is enabled, but there is no numOutputRows when codegen is disabled, and numOutputRows is equal to 0. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes #38489 from cxzl25/SPARK-41003. Authored-by: sychen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…degen is disabled ### What changes were proposed in this pull request? BHJ LeftAnti does not update numOutputRows when codegen is disabled ### Why are the changes needed? PR apache#29104 Only update numOutputRows when codegen is enabled, but there is no numOutputRows when codegen is disabled, and numOutputRows is equal to 0. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? add UT Closes apache#38489 from cxzl25/SPARK-41003. Authored-by: sychen <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

### What changes were proposed in this pull request? Run `NullPropagation` after NOT IN subquery rewrite. ### Why are the changes needed? NOT IN subqueries like `SELECT * FROM t1 WHERE c NOT IN (SELECT c FROM t2)` are rewritten as left anti join `t1.c = t2.c` with additional `OR IsNull(t1.c = t2.c)` conditions which prevents equi join implementations to be used so those joins end up as `BroadcastNestedLoopJoin`. When we know the columns can't be null, we can either drop those additional conditions during subquery rewrite or call `NullPropagation` after the rewrite to simplify them to `false`. This PR contains the latter. Please note that #29104 already optmized the single column NOT IN subqueries from `BroadcastNestedLoopJoin` to "null aware" `BroadcastHashJoin` very well, but when the columns are not nullable we can optimize multi column cases as well and the join don't need to be "null aware". ### Does this PR introduce _any_ user-facing change? Yes, performance improvement. ### How was this patch tested? A new UTs was added and some exsisting tests were adjusted to keep their validity. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #53733 from peter-toth/SPARK-54972-improve-not-in-with-non-nullables. Authored-by: Peter Toth <[email protected]> Signed-off-by: Peter Toth <[email protected]>

### What changes were proposed in this pull request? Run `NullPropagation` after NOT IN subquery rewrite. ### Why are the changes needed? NOT IN subqueries like `SELECT * FROM t1 WHERE c NOT IN (SELECT c FROM t2)` are rewritten as left anti join `t1.c = t2.c` with additional `OR IsNull(t1.c = t2.c)` conditions which prevents equi join implementations to be used so those joins end up as `BroadcastNestedLoopJoin`. When we know the columns can't be null, we can either drop those additional conditions during subquery rewrite or call `NullPropagation` after the rewrite to simplify them to `false`. This PR contains the latter. Please note that apache#29104 already optmized the single column NOT IN subqueries from `BroadcastNestedLoopJoin` to "null aware" `BroadcastHashJoin` very well, but when the columns are not nullable we can optimize multi column cases as well and the join don't need to be "null aware". ### Does this PR introduce _any_ user-facing change? Yes, performance improvement. ### How was this patch tested? A new UTs was added and some exsisting tests were adjusted to keep their validity. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#53733 from peter-toth/SPARK-54972-improve-not-in-with-non-nullables. Authored-by: Peter Toth <[email protected]> Signed-off-by: Peter Toth <[email protected]>

probot-autolabeler bot added the SQL label Jul 14, 2020

maropu reviewed Jul 15, 2020

View reviewed changes

agrawaldevesh reviewed Jul 16, 2020

View reviewed changes

leanken-zz added 4 commits July 28, 2020 06:06

[SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize

7824c45

typo. Change-Id: Id5db52227468cb4b22bad1923eab85bd6ce6fb5d

comment refine.

5e050c4

Change-Id: Icbf28bdbee90de6b09172ab4c495383002f340a4

leanken-zz force-pushed the leanken-SPARK-32290 branch from 395823d to 233eff6 Compare July 27, 2020 22:06

agrawaldevesh reviewed Jul 28, 2020

View reviewed changes

cloud-fan closed this in 12b9787 Jul 28, 2020

leanken-zz deleted the leanken-SPARK-32290 branch July 28, 2020 08:38

MaxGekk reviewed Aug 3, 2020

View reviewed changes

agrawaldevesh mentioned this pull request Aug 8, 2020

[SPARK-32399][SQL] Full outer shuffled hash join #29342

Closed

viirya reviewed Aug 18, 2020

View reviewed changes

c21 mentioned this pull request Jun 16, 2021

[SPARK-35791][SQL] Release on-going map properly for NULL-aware ANTI join #32939

Closed

This was referenced Aug 30, 2022

Rules of anti join in Spark can be not aligned with query engines' behavior apache/incubator-gluten#341

Closed

The behaviour of not in and left anti join is different from PostgreSQL ClickHouse/ClickHouse#40788

Open

cxzl25 mentioned this pull request Nov 3, 2022

[SPARK-41003][SQL] BHJ LeftAnti does not update numOutputRows when codegen is disabled #38489

Closed

peter-toth mentioned this pull request Jan 8, 2026

[SPARK-54972][SQL] Improve NOT IN subqueries with non-nullable columns #53733

Closed

		--CONFIG_DIM1 spark.sql.optimizeNullAwareAntiJoin=true
		--CONFIG_DIM1 spark.sql.optimizeNullAwareAntiJoin=false

		object EmptyHashedRelationWithAllNullKeys extends NullAwareHashedRelation {
		override def asReadOnlyCopy(): EmptyHashedRelationWithAllNullKeys.type = this

[SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize #29104

[SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize #29104

Uh oh!

Conversation

leanken-zz commented Jul 14, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

leanken-zz commented Jul 14, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

leanken-zz Jul 15, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Jul 15, 2020

Uh oh!

maropu commented Jul 15, 2020

Uh oh!

maropu commented Jul 15, 2020

Uh oh!

leanken-zz commented Jul 15, 2020

Uh oh!

SparkQA commented Jul 15, 2020

Uh oh!

dilipbiswal commented Jul 15, 2020

Uh oh!

leanken-zz commented Jul 15, 2020

Uh oh!

maropu commented Jul 15, 2020

Uh oh!

leanken-zz commented Jul 15, 2020

Uh oh!

SparkQA commented Jul 15, 2020

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

maropu commented Jul 16, 2020

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

maropu commented Jul 16, 2020

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

maropu commented Jul 16, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

leanken-zz commented Jul 16, 2020

Uh oh!

agrawaldevesh commented Jul 16, 2020

Uh oh!

agrawaldevesh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

leanken-zz commented Jul 14, 2020 •

edited

Loading

leanken-zz Jul 15, 2020 •

edited

Loading

maropu commented Jul 16, 2020 •

edited

Loading