Skip to content

Conversation

@Parth-Brahmbhatt
Copy link
Contributor

What changes were proposed in this pull request?

Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.

How was this patch tested?

I have executed queries locally to test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I remember I was told that chaining getOrElse is not preferred because it is confusing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What alternative was chosen as more readable/less confusing?

Copy link
Member

@HyukjinKwon HyukjinKwon May 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hope this PR is helpful.. #12256

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea there is nothing wrong with getOrElse, but chaining a lot of them together (also with option filter) can get really confusing. While you are at this, it'd be better to fix this.

Expanding it to be longer (and maybe use some imperative style code) could make it less confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin I have replaced if with conventional if/elseif.

@sameeragarwal
Copy link
Member

@Parth-Brahmbhatt this looks pretty good. However, given that hitting the underlying filesystem directly can incur a lot of latency (especially in case of S3), can you please conf protect this change (with a comment about the potential performance issues)? Additionally, perhaps it might be nice to set the conf to false by default to prevent silent regressions for existing queries (especially if we're targeting this for 2.0).

@Parth-Brahmbhatt
Copy link
Contributor Author

@sameeragarwal Added config option. @rxin can you take a look one more time?

@Parth-Brahmbhatt
Copy link
Contributor Author

@sameeragarwal @rxin FYI, Github currently has some latency issues so you probably can't see the updates.


val ENABLE_FALL_BACK_TO_HDFS_FOR_STATS =
SQLConfigBuilder("spark.sql.enableFallBackToHdfsForStats")
.doc("If the table statistics are not available from table metadata enable fall back to hdfs" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: missing period after hdfs

@sameeragarwal
Copy link
Member

LGTM

@sameeragarwal
Copy link
Member

sameeragarwal commented May 24, 2016

jenkins test this please

@SparkQA
Copy link

SparkQA commented May 24, 2016

Test build #59215 has finished for PR 13150 at commit f5d5dde.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

.getOrElse(sparkSession.sessionState.conf.defaultSizeInBytes)))
// if the size is still less than zero, we try to get the file size from HDFS.
// given this is only needed for optimization, if the HDFS call fails we return the default.
if (Option(totalSize).map(_.toLong).getOrElse(0L) > 0) {
Copy link
Contributor

@rxin rxin May 24, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we write something like this to make it easier to read?

if (totalSize != null && totalSize.toLong > 0L) {
  totalSize.toLong
} else if (rawDataSize != null && rawDataSize.toLong > 0) {
  rawDataSize.toLong
} else if (sparkSession.sessionState.conf.fallBackToHdfsForStatsEnabled) {
  ...
} else {
  sparkSession.sessionState.conf.defaultSizeInBytes
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@rxin
Copy link
Contributor

rxin commented May 24, 2016

This looks good. Just two minor nits. If you can fix those that would be great.

Also - would it be possible to add a test case?

@Parth-Brahmbhatt
Copy link
Contributor Author

@rxin added a test case.

@rxin
Copy link
Contributor

rxin commented May 25, 2016

Great - thanks.

Jenkins, test this please.


sql(
s"""CREATE EXTERNAL TABLE csv_table(page_id INT, impressions INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to indent this properly. I can fix this when I merge if Jenkins passes.

@SparkQA
Copy link

SparkQA commented May 25, 2016

Test build #3017 has finished for PR 13150 at commit ff69f91.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented May 25, 2016

Merging in master/2.0.

asfgit pushed a commit that referenced this pull request May 25, 2016
…metastore, we should fallback to HDFS

## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.

## How was this patch tested?
I have executed queries locally to test.

Author: Parth Brahmbhatt <[email protected]>

Closes #13150 from Parth-Brahmbhatt/SPARK-15365.

(cherry picked from commit 4acabab)
Signed-off-by: Reynold Xin <[email protected]>
@rxin
Copy link
Contributor

rxin commented May 25, 2016

@Parth-Brahmbhatt you should add the email address you used in your commit to your github profile, so the commit is associated with your account. Thanks.

@asfgit asfgit closed this in 4acabab May 25, 2016
@Parth-Brahmbhatt Parth-Brahmbhatt deleted the SPARK-15365 branch May 26, 2016 17:39
@Parth-Brahmbhatt
Copy link
Contributor Author

@rxin Thanks for taking the time to review and merging the patch. I have added the Email to my profile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants