Skip to content

Comments

[SPARK-28595][SQL] explain should not trigger partition listing#25328

Closed
cloud-fan wants to merge 4 commits intoapache:masterfrom
cloud-fan:ui
Closed

[SPARK-28595][SQL] explain should not trigger partition listing#25328
cloud-fan wants to merge 4 commits intoapache:masterfrom
cloud-fan:ui

Conversation

@cloud-fan
Copy link
Contributor

What changes were proposed in this pull request?

Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again.

This is caused by FileSourceScanExec:

  1. In its toString, it needs to report the number of partitions it reads. This needs to query the hive metastore.
  2. In its outputOrdering, it needs to get all the files. This needs to query the hive metastore.

This PR fixes by:

  1. toString do not need to report the number of partitions it reads. We should report it via SQL metrics.
  2. The outputOrdering is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.

How was this patch tested?

existing tests

@cloud-fan
Copy link
Contributor Author

cc @hvanhovell @maryannxue @viirya

@dongjoon-hyun
Copy link
Member

I agree that it was very hard to meet the condition. BTW, IIRC, the main reason for that optimization was to get the same result with Hive for the LIMIT queries which didn't have ORDER BY.

I think it's worth to give up this optimization so that explain don't need to get stuck.

@SparkQA
Copy link

SparkQA commented Aug 1, 2019

Test build #108525 has finished for PR 25328 at commit fb8793d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

I don't think a system needs to guarantee the output order of a SQL query without ORDER BY. But let me add a legacy config to keep this optimization, just in case. what do you think @dongjoon-hyun ?

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108571 has finished for PR 25328 at commit fa763eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108688 has finished for PR 25328 at commit 0652c22.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108699 has finished for PR 25328 at commit 0652c22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108700 has finished for PR 25328 at commit 0652c22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

protected override def afterAll(): Unit = {
spark.sessionState.conf.unsetConf(SQLConf.LEGACY_BUCKETED_TABLE_SCAN_OUTPUT_ORDERING)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we do a "store and recover the old conf" instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the test, we assume that every test suite will keep the shared SparkSession clean after tests are run. So the old conf should be the default conf here, and we only need to call unsetConf to restore the default config.

@maryannxue
Copy link
Contributor

sql("CREATE TABLE t USING json PARTITIONED BY (j) AS SELECT 1 i, 2 j")
assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
spark.table("t").explain()
assert(HiveCatalogMetrics.METRIC_PARTITIONS_FETCHED.getCount == 0)
Copy link
Member

@gatorsmile gatorsmile Aug 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case that can return non-zero when spark.sql.legacy.bucketedTableScan.outputOrdering is set to true?

@gatorsmile
Copy link
Member

LGTM except one comment.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108745 has finished for PR 25328 at commit 264a259.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108747 has finished for PR 25328 at commit bf9b261.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Aug 7, 2019

Test build #108753 has finished for PR 25328 at commit bf9b261.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor Author

thanks for the review, merging to master!

@cloud-fan cloud-fan closed this in 469423f Aug 7, 2019
@natangsalgia
Copy link

@cloud-fan :

  1. The outputOrdering is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.

We see cases where sorting the pre-sorted data results in 30+ mins to runtime when reading terabytes of data. There was similar discussion in the email list this Aug[1].

Are there plans to remove this config that can potentially break Spark users with large datasets that benefit from this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants