[SPARK-45755][SQL] Improve Dataset.isEmpty() by applying global limit 1 (apache#251)

wangyum · GitHub Enterprise · commit 9cdeeb80b8a5 · 2024-03-05T01:14:00.000-06:00
### What changes were proposed in this pull request? This PR makes `Dataset.isEmpty()` to execute global limit 1 first. `LimitPushDown` may push down global limit 1 to lower nodes to improve query performance. Note that we use global limit 1 here, because the local limit cannot be pushed down the group only case: https://github.com/apache/spark/blob/89ca8b6065e9f690a492c778262080741d50d94d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala#L766-L770 ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual testing: ```scala spark.range(300000000).selectExpr("id", "array(id, id % 10, id % 100) as eo").write.saveAsTable("t1") spark.range(100000000).selectExpr("id", "array(id, id % 10, id % 1000) as eo").write.saveAsTable("t2") println(spark.sql("SELECT * FROM t1 LATERAL VIEW explode_outer(eo) AS e UNION SELECT * FROM t2 LATERAL VIEW explode_outer(eo) AS e").isEmpty) ``` Before this PR | After this PR -- | -- <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/417adc05-4160-4470-b63c-125faac08c9c"> | <img width="430" alt="image" src="https://github.com/apache/spark/assets/5399861/bdeff231-e725-4c55-9da2-1b4cd59ec8c8"> ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#43617 from wangyum/SPARK-45755. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <yumwang@apache.org> Signed-off-by: Jiaan Geng <beliefer@163.com> (cherry picked from commit c7bba9b)
diff --git a/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala b/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
@@ -651,7 +651,7 @@ class Dataset[T] private[sql](
    * @group basic
    * @since 2.4.0
    */
-  def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>
+  def isEmpty: Boolean = withAction("isEmpty", select().limit(1).queryExecution) { plan =>
     plan.executeTake(1).isEmpty
   }
 

Original file line number	Diff line number	Diff line change
`@@ -651,7 +651,7 @@ class Dataset[T] private[sql](`
`651`	`651`	`* @group basic`
`652`	`652`	`* @since 2.4.0`
`653`	`653`	`*/`
`654`		`- def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>`
	`654`	`+ def isEmpty: Boolean = withAction("isEmpty", select().limit(1).queryExecution) { plan =>`
`655`	`655`	`plan.executeTake(1).isEmpty`
`656`	`656`	`}`
`657`	`657`