[SPARK-15365] [SQL]: When table size statistics are not available from metastore, we should fallback to HDFS #13150

Parth-Brahmbhatt · 2016-05-17T17:31:44Z

What changes were proposed in this pull request?

Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.

How was this patch tested?

I have executed queries locally to test.

HyukjinKwon · 2016-05-18T00:12:54Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala

(I remember I was told that chaining getOrElse is not preferred because it is confusing)

What alternative was chosen as more readable/less confusing?

I hope this PR is helpful.. #12256

yea there is nothing wrong with getOrElse, but chaining a lot of them together (also with option filter) can get really confusing. While you are at this, it'd be better to fix this.

Expanding it to be longer (and maybe use some imperative style code) could make it less confusing.

@rxin I have replaced if with conventional if/elseif.

sameeragarwal · 2016-05-20T22:37:31Z

@Parth-Brahmbhatt this looks pretty good. However, given that hitting the underlying filesystem directly can incur a lot of latency (especially in case of S3), can you please conf protect this change (with a comment about the potential performance issues)? Additionally, perhaps it might be nice to set the conf to false by default to prevent silent regressions for existing queries (especially if we're targeting this for 2.0).

Parth-Brahmbhatt · 2016-05-23T15:58:35Z

@sameeragarwal Added config option. @rxin can you take a look one more time?

…m metastore, fall back to HDFS.

…of stats or not. Default is false.

Parth-Brahmbhatt · 2016-05-23T16:40:31Z

@sameeragarwal @rxin FYI, Github currently has some latency issues so you probably can't see the updates.

sameeragarwal · 2016-05-24T18:10:33Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala


+  val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS =
+    SQLConfigBuilder("spark.sql.enableFallBackToHdfsForStats")
+    .doc("If the table statistics are not available from table metadata enable fall back to hdfs" +


nit: missing period after hdfs

sameeragarwal · 2016-05-24T18:10:40Z

LGTM

sameeragarwal · 2016-05-24T18:24:27Z

jenkins test this please

SparkQA · 2016-05-24T20:24:45Z

Test build #59215 has finished for PR 13150 at commit f5d5dde.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-24T22:11:52Z

sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala

-            .getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
+        // if the size is still less than zero, we try to get the file size from HDFS.
+        // given this is only needed for optimization, if the HDFS call fails we return the default.
+        if (Option(totalSize).map(_.toLong).getOrElse(0L) > 0) {


can we write something like this to make it easier to read?

if (totalSize != null && totalSize.toLong > 0L) { totalSize.toLong } else if (rawDataSize != null && rawDataSize.toLong > 0) { rawDataSize.toLong } else if (sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled) { ... } else { sparkSession.sessionState.conf.defaultSizeInBytes }

rxin · 2016-05-24T22:12:54Z

This looks good. Just two minor nits. If you can fix those that would be great.

Also - would it be possible to add a test case?

Parth-Brahmbhatt · 2016-05-25T01:36:52Z

@rxin added a test case.

rxin · 2016-05-25T01:50:41Z

Great - thanks.

Jenkins, test this please.

rxin · 2016-05-25T01:56:22Z

sql/hive/src/test/scala/org/apache/spark/sql/hive/StatisticsSuite.scala

+
+        sql(
+          s"""CREATE EXTERNAL TABLE csv_table(page_id INT, impressions INT)
+        ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'


need to indent this properly. I can fix this when I merge if Jenkins passes.

SparkQA · 2016-05-25T03:21:58Z

Test build #3017 has finished for PR 13150 at commit ff69f91.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2016-05-25T03:58:30Z

Merging in master/2.0.

…metastore, we should fallback to HDFS ## What changes were proposed in this pull request? Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins. ## How was this patch tested? I have executed queries locally to test. Author: Parth Brahmbhatt <[email protected]> Closes #13150 from Parth-Brahmbhatt/SPARK-15365. (cherry picked from commit 4acabab) Signed-off-by: Reynold Xin <[email protected]>

rxin · 2016-05-25T03:59:29Z

@Parth-Brahmbhatt you should add the email address you used in your commit to your github profile, so the commit is associated with your account. Thanks.

Parth-Brahmbhatt · 2016-05-26T17:41:34Z

@rxin Thanks for taking the time to review and merging the patch. I have added the Email to my profile.

HyukjinKwon reviewed May 18, 2016
View reviewed changes

Parth-Brahmbhatt added 3 commits May 23, 2016 09:16

[SPARK-15365] [SQL]: When table size statistics are not available fro…

1f6086f

…m metastore, fall back to HDFS.

Addressing review comments.

9729d72

Adding a config to control if we should fall back to HDFS in absence …

f5d5dde

…of stats or not. Default is false.

Parth-Brahmbhatt force-pushed the SPARK-15365 branch from 04e013d to f5d5dde Compare May 23, 2016 17:04

sameeragarwal reviewed May 24, 2016
View reviewed changes

rxin reviewed May 24, 2016
View reviewed changes

Addressing more review comments, adding a test case.

ff69f91

rxin reviewed May 25, 2016
View reviewed changes

asfgit closed this in 4acabab May 25, 2016

Parth-Brahmbhatt deleted the SPARK-15365 branch May 26, 2016 17:39

[SPARK-15365] [SQL]: When table size statistics are not available from metastore, we should fallback to HDFS #13150

[SPARK-15365] [SQL]: When table size statistics are not available from metastore, we should fallback to HDFS #13150

Uh oh!

Conversation

Parth-Brahmbhatt commented May 17, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

HyukjinKwon May 18, 2016

Choose a reason for hiding this comment

Uh oh!

Parth-Brahmbhatt May 18, 2016

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon May 18, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rxin May 18, 2016

Choose a reason for hiding this comment

Uh oh!

Parth-Brahmbhatt May 19, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal commented May 20, 2016

Uh oh!

Parth-Brahmbhatt commented May 23, 2016

Uh oh!

Parth-Brahmbhatt commented May 23, 2016

Uh oh!

sameeragarwal May 24, 2016

Choose a reason for hiding this comment

Uh oh!

sameeragarwal commented May 24, 2016

Uh oh!

sameeragarwal commented May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented May 24, 2016

Uh oh!

rxin May 24, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Parth-Brahmbhatt May 25, 2016

Choose a reason for hiding this comment

Uh oh!

rxin commented May 24, 2016

Uh oh!

Parth-Brahmbhatt commented May 25, 2016

Uh oh!

rxin commented May 25, 2016

Uh oh!

rxin May 25, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 25, 2016

Uh oh!

rxin commented May 25, 2016

Uh oh!

rxin commented May 25, 2016

Uh oh!

Parth-Brahmbhatt commented May 26, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon May 18, 2016 •

edited

Loading

sameeragarwal commented May 24, 2016 •

edited

Loading

rxin May 24, 2016 •

edited

Loading