[SPARK-28595][SQL] explain should not trigger partition listing by cloud-fan · Pull Request #25328 · apache/spark

cloud-fan · 2019-08-01T16:50:22Z

What changes were proposed in this pull request?

Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again.

This is caused by FileSourceScanExec:

In its toString, it needs to report the number of partitions it reads. This needs to query the hive metastore.
In its outputOrdering, it needs to get all the files. This needs to query the hive metastore.

This PR fixes by:

toString do not need to report the number of partitions it reads. We should report it via SQL metrics.
The outputOrdering is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.

How was this patch tested?

existing tests

cloud-fan · 2019-08-01T16:50:43Z

cc @hvanhovell @maryannxue @viirya

dongjoon-hyun · 2019-08-01T17:48:08Z

I agree that it was very hard to meet the condition. BTW, IIRC, the main reason for that optimization was to get the same result with Hive for the LIMIT queries which didn't have ORDER BY.

I think it's worth to give up this optimization so that explain don't need to get stuck.

SparkQA · 2019-08-01T18:17:00Z

Test build #108525 has finished for PR 25328 at commit fb8793d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-02T04:42:33Z

I don't think a system needs to guarantee the output order of a SQL query without ORDER BY. But let me add a legacy config to keep this optimization, just in case. what do you think @dongjoon-hyun ?

SparkQA · 2019-08-02T18:47:04Z

Test build #108571 has finished for PR 25328 at commit fa763eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-06T04:59:47Z

Test build #108688 has finished for PR 25328 at commit 0652c22.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

gatorsmile · 2019-08-06T07:03:49Z

retest this please

SparkQA · 2019-08-06T10:45:25Z

Test build #108699 has finished for PR 25328 at commit 0652c22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-06T10:51:25Z

Test build #108700 has finished for PR 25328 at commit 0652c22.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maryannxue · 2019-08-06T14:32:43Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

+  }
+
+  protected override def afterAll(): Unit = {
+    spark.sessionState.conf.unsetConf(SQLConf.LEGACY_BUCKETED_TABLE_SCAN_OUTPUT_ORDERING)


Should we do a "store and recover the old conf" instead?

In the test, we assume that every test suite will keep the shared SparkSession clean after tests are run. So the old conf should be the default conf here, and we only need to call unsetConf to restore the default config.

maryannxue · 2019-08-06T14:34:07Z

LGTM except one minor comment https://github.com/apache/spark/pull/25328/files#r311096949.

gatorsmile · 2019-08-07T00:51:27Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveExplainSuite.scala

+      sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j")
+      assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
+      spark.table("t").explain()
+      assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)


Add a test case that can return non-zero when spark.sql.legacy.bucketedTableScan.outputOrdering is set to true?

gatorsmile · 2019-08-07T00:51:47Z

LGTM except one comment.

SparkQA · 2019-08-07T05:22:52Z

Test build #108745 has finished for PR 25328 at commit 264a259.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-08-07T07:05:02Z

Test build #108747 has finished for PR 25328 at commit bf9b261.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-07T07:07:41Z

retest this please

SparkQA · 2019-08-07T10:42:55Z

Test build #108753 has finished for PR 25328 at commit bf9b261.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-08-07T11:14:55Z

thanks for the review, merging to master!

natangsalgia · 2024-09-26T16:58:16Z

@cloud-fan :

The outputOrdering is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.

We see cases where sorting the pre-sorted data results in 30+ mins to runtime when reading terabytes of data. There was similar discussion in the email list this Aug[1].

Are there plans to remove this config that can potentially break Spark users with large datasets that benefit from this?

dongjoon-hyun added the SQL label Aug 1, 2019

cloud-fan added 2 commits August 6, 2019 10:17

explain should not trigger partition listing

ca34d7b

add a config

b2b016e

cloud-fan force-pushed the ui branch from fa763eb to b2b016e Compare August 6, 2019 02:18

fix

0652c22

viirya reviewed Aug 6, 2019

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala Show resolved Hide resolved

maryannxue reviewed Aug 6, 2019

View reviewed changes

gatorsmile reviewed Aug 7, 2019

View reviewed changes

viirya approved these changes Aug 7, 2019

View reviewed changes

address comment

bf9b261

cloud-fan force-pushed the ui branch from 264a259 to bf9b261 Compare August 7, 2019 05:56

cloud-fan closed this in 469423f Aug 7, 2019

Comments

Conversation

cloud-fan commented Aug 1, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

cloud-fan commented Aug 1, 2019

Uh oh!

dongjoon-hyun commented Aug 1, 2019

Uh oh!

SparkQA commented Aug 1, 2019

Uh oh!

cloud-fan commented Aug 2, 2019

Uh oh!

SparkQA commented Aug 2, 2019

Uh oh!

SparkQA commented Aug 6, 2019

Uh oh!

Uh oh!

gatorsmile commented Aug 6, 2019

Uh oh!

SparkQA commented Aug 6, 2019

Uh oh!

SparkQA commented Aug 6, 2019

Uh oh!

maryannxue Aug 6, 2019

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 6, 2019

Choose a reason for hiding this comment

Uh oh!

maryannxue commented Aug 6, 2019

Uh oh!

gatorsmile Aug 7, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gatorsmile commented Aug 7, 2019

Uh oh!

SparkQA commented Aug 7, 2019

Uh oh!

SparkQA commented Aug 7, 2019

Uh oh!

cloud-fan commented Aug 7, 2019

Uh oh!

SparkQA commented Aug 7, 2019

Uh oh!

cloud-fan commented Aug 7, 2019

Uh oh!

natangsalgia commented Sep 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

gatorsmile Aug 7, 2019 •

edited

Loading