Skip to content

[SPARK-32985][SQL] Decouple bucket scan and bucket filter pruning for data source v1#31413

Closed
c21 wants to merge 10 commits intoapache:masterfrom
c21:bucket-pruning
Closed

[SPARK-32985][SQL] Decouple bucket scan and bucket filter pruning for data source v1#31413
c21 wants to merge 10 commits intoapache:masterfrom
c21:bucket-pruning

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Feb 1, 2021

What changes were proposed in this pull request?

As a followup from discussion in #29804 (comment) . Currently in data source v1 file scan FileSourceScanExec, bucket filter pruning will only take effect with bucket table scan. However this is unnecessary, as bucket filter pruning can also happen if we disable bucketed table scan. Read files with bucket hash partitioning, and bucket filter pruning are two orthogonal features, and do not need to couple together.

Why are the changes needed?

This help query leverage the benefit from bucket filter pruning to save CPU/IO to not read unnecessary bucket files, and do not bound by bucket table scan when the parallelism of tasks is a concern.

In addition, this also resolves the issue to reduce number of tasks launched for simple query with bucket column filter - SPARK-33207, because with bucket scan, we launch # of tasks to equal to # of buckets, and this is unnecessary.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in BucketedReadSuite.scala to make all existing unit tests for bucket filter work with this PR.

withTable("bucketed_table") {
val numBuckets = NumBucketsForPruningDF
val bucketSpec = BucketSpec(numBuckets, Seq("j"), Nil)
// json does not support predicate push-down, and thus json is used here
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not true anymore as json filter push down was added in https://issues.apache.org/jira/browse/SPARK-30648 .

("SELECT j FROM t1", 0, 0),
// Filter on bucketed column
("SELECT * FROM t1 WHERE i = 1", 1, 1),
("SELECT * FROM t1 WHERE i = 1", 0, 1),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unit test change is expected, as we no longer need to do bucket scan for this kind of query. See related change in DisableUnnecessaryBucketedScan.scala

@c21
Copy link
Contributor Author

c21 commented Feb 1, 2021

cc @cloud-fan and @maropu could you guys take a look when you have time? Thanks.

@SparkQA
Copy link

SparkQA commented Feb 1, 2021

Test build #134712 has finished for PR 31413 at commit 3b72a6b.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@github-actions github-actions bot added the SQL label Feb 1, 2021
@SparkQA
Copy link

SparkQA commented Feb 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39298/

partitionValues = partition.values
)

if (filePruning(filePath)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, nice improvement!

val filePruning: Path => Boolean = optionalBucketSet match {
case Some(bucketSet) =>
filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
.getOrElse(sys.error(s"Invalid bucket file $filePath")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use IllegalStateException instead of sys.error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - I was following code path for creating bucketed RDD. Do we want to change this place as well? Just wondering why we prefer IllegalStateException here? Is it better error classification?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following code path for creating bucketed RDD. Do we want to change this place as well?

Yea, since the fix is trivial, changing it in this PR looks fine to me.

Just wondering why we prefer IllegalStateException here? Is it better error classification?

This is not a strict rule, but I think we tend to use IllegalStateException for a unexpected code path. For example, see the related previous comment:
#28810 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, sys.error crashes the JVM and should be avoided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu , @cloud-fan - sure, updated.

s"open cost is considered as scanning $openCostInBytes bytes.")

// Filter files with bucket pruning if possible
val filePruning: Path => Boolean = optionalBucketSet match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: filePruning -> can[Bucket]Prune?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - changed.

@maropu
Copy link
Member

maropu commented Feb 1, 2021

I left minor comments and it looks fine otherwise. cc: @viirya , too.

val filePruning: Path => Boolean = optionalBucketSet match {
case Some(bucketSet) =>
filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
.getOrElse(sys.error(s"Invalid bucket file $filePath")))
Copy link
Member

@viirya viirya Feb 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks not good to fail the query here. The scan node actually reads bucket implicitly because bucket scan is disabled for this path. Instead of query failure, maybe log warning and skip pruning?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error here indicates there's data corruption (invalid file name) for spark data source bucketed table. The benefit for logging warning here is to unblock read these kind of corrupted bucketed tables with disabling bucketing. I feel this is dangerous. Users should not rely on disabling bucketing to read potentially wrong data from bucketed table, they should correct the table. I am more preferring to fail loud here with exception, as warning logging would be very hard to debug. But I am open to others opinions as well, cc @maropu and @cloud-fan .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your reason sounds reasonable. But it still sounds like a potential breaking change. I'm not sure if some users read table in this way, but it is indeed possible. It will very confuse to them as they won't ask to read table in bucketed way and it works previously.

If eventually we still decide to fail the query, then a good to understand error message is necessary to let users know why we use bucketed spec here and why Spark fails to read the table. Maybe we should also provide some hints for solving it, if possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have some options to control missing & corrupted files for data sources, so how about following the semantics?

val IGNORE_CORRUPT_FILES = buildConf("spark.sql.files.ignoreCorruptFiles")
.doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " +
"encountering corrupted files and the contents that have been read will still be returned. " +
"This configuration is effective only when using file-based sources such as Parquet, JSON " +
"and ORC.")
.version("2.1.1")
.booleanConf
.createWithDefault(false)
val IGNORE_MISSING_FILES = buildConf("spark.sql.files.ignoreMissingFiles")
.doc("Whether to ignore missing files. If true, the Spark jobs will continue to run when " +
"encountering missing files and the contents that have been read will still be returned. " +
"This configuration is effective only when using file-based sources such as Parquet, JSON " +
"and ORC.")
.version("2.3.0")
.booleanConf
.createWithDefault(false)

For the data corruption case above, how about throwing an exception by default and then stating in the exception message that you should use a specified option if you want to ignore it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think we better can avoid the case that no way to overcome the bucket id corruption. Using existing option to ignore the failure sounds good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu , @viirya - This sounds reasonable to me. Changed to use spark.sql.files.ignoreCorruptFiles to handle invalid bucket file name case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@SparkQA
Copy link

SparkQA commented Feb 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39298/

Copy link
Member

@viirya viirya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks okay overall.

@SparkQA
Copy link

SparkQA commented Feb 1, 2021

Test build #134717 has finished for PR 31413 at commit cd90f0a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed all comments, and ready to review again, thanks. cc @maropu , @viirya and @cloud-fan .

s"open cost is considered as scanning $openCostInBytes bytes.")

// Filter files with bucket pruning if possible
val filePruning: Path => Boolean = optionalBucketSet match {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - changed.

val filePruning: Path => Boolean = optionalBucketSet match {
case Some(bucketSet) =>
filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
.getOrElse(sys.error(s"Invalid bucket file $filePath")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu , @cloud-fan - sure, updated.

val filePruning: Path => Boolean = optionalBucketSet match {
case Some(bucketSet) =>
filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
.getOrElse(sys.error(s"Invalid bucket file $filePath")))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu , @viirya - This sounds reasonable to me. Changed to use spark.sql.files.ignoreCorruptFiles to handle invalid bucket file name case.

@SparkQA
Copy link

SparkQA commented Feb 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39332/

@SparkQA
Copy link

SparkQA commented Feb 1, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39332/

s"open cost is considered as scanning $openCostInBytes bytes.")

// Filter files with bucket pruning if possible
val ignoreCorruptFiles = fsRelation.sparkSession.sessionState.conf.ignoreCorruptFiles
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lazy val?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - updated.

throw new IllegalStateException(
s"Invalid bucket file $filePath when doing bucket pruning. " +
s"Enable ${SQLConf.IGNORE_CORRUPT_FILES.key} to disable exception " +
"and read the file.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: wrong indent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - updated.

} else {
throw new IllegalStateException(
s"Invalid bucket file $filePath when doing bucket pruning. " +
s"Enable ${SQLConf.IGNORE_CORRUPT_FILES.key} to disable exception " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: disable -> ignore

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu - updated.

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Test build #134746 has finished for PR 31413 at commit e63a8c3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39343/


// Filter files with bucket pruning if possible
lazy val ignoreCorruptFiles = fsRelation.sparkSession.sessionState.conf.ignoreCorruptFiles
val canPrune: Path => Boolean = optionalBucketSet match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps rename this to shouldNotPrune or shouldProcess? canPrune sounds like the path should be ignored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very bad at naming :) This is suggested from #31413 (comment). Shall I change again? cc @maropu .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to shouldProcess as I feel shouldNotPrune is hard to reason about.

case Some(id) => bucketSet.get(id)
case None =>
if (ignoreCorruptFiles) {
// If ignoring corrupt file, do not prune when bucket file name is invalid
Copy link
Member

@sunchao sunchao Feb 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious what's the previous behavior of this, or is this newly introduced? we may need to add the info to the PR description (user-facing change).

Also I'm not sure if this is the best choice: if a bucketed table is corrupted, should we read the corrupt file? it will likely lead to incorrect results. On the other hand we can choose to ignore the file which seems to be more aligned with the name of the config, although result could still be incorrect.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sunchao - this is newly introduced. Updated PR description.

Also I'm not sure if this is the best choice: if a bucketed table is corrupted, should we read the corrupt file? it will likely lead to incorrect results. On the other hand we can choose to ignore the file which seems to be more aligned with the name of the config, although result could still be incorrect.

Note by default the exception will be thrown here and query will be failed loud. We allow a config here to help existing users to work around if they want. See relevant discussion in #31413 (comment) .

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool thanks for pointing to the discussion. I'm just not sure whether the corrupted file should be ignored or processed if the flag is turned on. ignoreCorruptedFiles seems to indicate that the problematic file should be ignored so it is a bit confusing that we still process it here. Also IMO ignoring it seems to be slightly safer (thinking someone dump garbage files into the bucketed partition dir)?

cc @maropu @viirya

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel either skipping or processing the file is no way perfect. There can be other corruption case, where e.g. the table (specified with 1024 buckets), but only had 500 files underneath. This could be due to some other compute engines or users accidentally dump data here without respecting spark bucketing metadata. We have no efficient way to handle if number of files fewer than number of buckets.

The existing usage of ignoreCorruptFiles skip reading some of content of file, so it's also not completely ignoring. But I am fine if we think we need another config name for this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given users explicitly disable bucketing here for reading the table, I would assume they want to read the table as a non-bucketed table, so they would like to read all of input files, no? cc @viirya what's the use case you are thinking here? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel either skipping or processing the file is no way perfect.

Yes agreed.

Given users explicitly disable bucketing here for reading the table, I would assume they want to read the table as a non-bucketed table, so they would like to read all of input files

Good point. Although it seems a bit weird that someone would do this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated to "decouple bucket scan and bucket filter pruning". Can we do it in a followup PR and discuss there? Let's not introduce extra behavior change when not necessary.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - sorry which part you are suggesting to do in a followup PR? Here we anyway need to decide how do we handle when file name is not a valid bucket file name (process or not process the file) for pruning. Does I miss anything?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we anyway need to decide how do we handle when file name is not a valid bucket file name (process or not process the file) for pruning

We should follow whatever behavior before this PR, since the correct behavior is not obvious and triggers discussion here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - okay. The behavior before this PR is to process all files for bucketed table if disabling bucketing. I changed the code to not prune the file if bucket file name is invalid. So this should follow previous behavior, and we can discuss whether to throw exception/ignore file/process file in followup PR. cc @maropu and @viirya for code change here.

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39343/

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39351/

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39351/

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Test build #134757 has finished for PR 31413 at commit 3d348a6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Test build #134765 has finished for PR 31413 at commit 9a6999d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - addressed your comment, and it's ready for review again, thanks.

case Some(id) => bucketSet.get(id)
case None =>
if (ignoreCorruptFiles) {
// If ignoring corrupt file, do not prune when bucket file name is invalid
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - okay. The behavior before this PR is to process all files for bucketed table if disabling bucketing. I changed the code to not prune the file if bucket file name is invalid. So this should follow previous behavior, and we can discuss whether to throw exception/ignore file/process file in followup PR. cc @maropu and @viirya for code change here.

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39435/

@SparkQA
Copy link

SparkQA commented Feb 3, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39435/

}

withSQLConf(SQLConf.BUCKETING_ENABLED.key -> "false") {
// Bucket pruning should still work when bucketing is disabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

then do we have a config to turn on/off bucketing file pruning, or it's always applied?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - always applied. I can add one if this makes surprise to existing users/queries, and they can turn off if needed. e.g. spark.sql.sources.bucketing.pruning.enabled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or we can keep the previous behavior: when SQLConf.BUCKETING_ENABLED is off, don't do any bucket optimization, including bucket scan and bucket pruning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - make sense to me. Updated the change.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134848 has finished for PR 31413 at commit 7d2f849.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

s"open cost is considered as scanning $openCostInBytes bytes.")

// Filter files with bucket pruning if possible
lazy val bucketingEnabled = fsRelation.sparkSession.sessionState.conf.bucketingEnabled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a lazy val bucketedScan: Boolean in L261

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - that's bucketedScan, but we need ...conf.bucketingEnabled, here we already in the code path where bucketedScan is false - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L413 .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry I missed that. Probably we can just make it val, since reading conf is very cheap, while lazy val has overhead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since reading conf is very cheap

That's what I feel too, but I got feedback earlier here - #31413 (comment). @maropu - could you help provide more context here? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, reverting it back looks okay.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed lazy val.

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39445/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39445/

@SparkQA
Copy link

SparkQA commented Feb 4, 2021

Test build #134857 has finished for PR 31413 at commit 066d5a4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor Author

@c21 c21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - addressed all comments, and this is ready for review again, thanks.

s"open cost is considered as scanning $openCostInBytes bytes.")

// Filter files with bucket pruning if possible
lazy val bucketingEnabled = fsRelation.sparkSession.sessionState.conf.bucketingEnabled
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed lazy val.

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39493/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39493/

@SparkQA
Copy link

SparkQA commented Feb 5, 2021

Test build #134910 has finished for PR 31413 at commit 03120af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@jaceklaskowski jaceklaskowski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (non-binding)

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 76baaf7 Feb 5, 2021
@c21
Copy link
Contributor Author

c21 commented Feb 5, 2021

Thank you all for review!

@c21 c21 deleted the bucket-pruning branch February 5, 2021 17:52
logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
s"open cost is considered as scanning $openCostInBytes bytes.")

// Filter files with bucket pruning if possible
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a bit odd that we call method name as createNonBucketedReadRDD but do something with buckets. I guess we could name createNonBucketedReadRDD like just createReadRDD or createStandardReadRDD

case Some(id) => bucketSet.get(id)
case None =>
// Do not prune the file if bucket file name is invalid
true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it could be one liner:

        filePath => BucketingUtils.getBucketId(filePath.getName).forall(bucketSet.get)

If it looks less readable we could:

        filePath => BucketingUtils.getBucketId(filePath.getName).map(bucketSet.get).getOrElse

If we worry about perf penalty from pattern matching, etc. we could do:

        filePath => {
          val bucketId = BucketingUtils.getBucketId(filePath.getName)
          if (bucketId.isEmpty) true else bucketSet.get(bucketId.get)
        }

HyukjinKwon pushed a commit that referenced this pull request Mar 30, 2021
…r change in FileSourceScanExec

### What changes were proposed in this pull request?

This PR is a followup change to address comments in #31413 (comment) and #31413 (comment) . Minor change in `FileSourceScanExec`. No actual logic change here.

### Why are the changes needed?

Better readability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #32000 from c21/bucket-scan.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants