[SPARK-32985][SQL] Decouple bucket scan and bucket filter pruning for data source v1 by c21 · Pull Request #31413 · apache/spark

c21 · 2021-02-01T06:27:44Z

What changes were proposed in this pull request?

As a followup from discussion in #29804 (comment) . Currently in data source v1 file scan FileSourceScanExec, bucket filter pruning will only take effect with bucket table scan. However this is unnecessary, as bucket filter pruning can also happen if we disable bucketed table scan. Read files with bucket hash partitioning, and bucket filter pruning are two orthogonal features, and do not need to couple together.

Why are the changes needed?

This help query leverage the benefit from bucket filter pruning to save CPU/IO to not read unnecessary bucket files, and do not bound by bucket table scan when the parallelism of tasks is a concern.

In addition, this also resolves the issue to reduce number of tasks launched for simple query with bucket column filter - SPARK-33207, because with bucket scan, we launch # of tasks to equal to # of buckets, and this is unnecessary.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added unit test in BucketedReadSuite.scala to make all existing unit tests for bucket filter work with this PR.

c21 · 2021-02-01T06:28:39Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

    withTable("bucketed_table") {
      val numBuckets = NumBucketsForPruningDF
      val bucketSpec = BucketSpec(numBuckets, Seq("j"), Nil)
-      // json does not support predicate push-down, and thus json is used here


This is not true anymore as json filter push down was added in https://issues.apache.org/jira/browse/SPARK-30648 .

c21 · 2021-02-01T06:29:23Z

sql/core/src/test/scala/org/apache/spark/sql/sources/DisableUnnecessaryBucketedScanSuite.scala

        ("SELECT j FROM t1", 0, 0),
        // Filter on bucketed column
-        ("SELECT * FROM t1 WHERE i = 1", 1, 1),
+        ("SELECT * FROM t1 WHERE i = 1", 0, 1),


This unit test change is expected, as we no longer need to do bucket scan for this kind of query. See related change in DisableUnnecessaryBucketedScan.scala

c21 · 2021-02-01T06:30:27Z

cc @cloud-fan and @maropu could you guys take a look when you have time? Thanks.

SparkQA · 2021-02-01T06:51:54Z

Test build #134712 has finished for PR 31413 at commit 3b72a6b.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-01T07:34:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39298/

maropu · 2021-02-01T07:40:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

-          partitionValues = partition.values
-        )
+
+        if (filePruning(filePath)) {


Ah, nice improvement!

maropu · 2021-02-01T07:41:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    val filePruning: Path => Boolean = optionalBucketSet match {
+      case Some(bucketSet) =>
+        filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
+          .getOrElse(sys.error(s"Invalid bucket file $filePath")))


Could you use IllegalStateException instead of sys.error?

@maropu - I was following code path for creating bucketed RDD. Do we want to change this place as well? Just wondering why we prefer IllegalStateException here? Is it better error classification?

I was following code path for creating bucketed RDD. Do we want to change this place as well?

Yea, since the fix is trivial, changing it in this PR looks fine to me.

Just wondering why we prefer IllegalStateException here? Is it better error classification?

This is not a strict rule, but I think we tend to use IllegalStateException for a unexpected code path. For example, see the related previous comment:
#28810 (comment)

+1, sys.error crashes the JVM and should be avoided.

@maropu , @cloud-fan - sure, updated.

maropu · 2021-02-01T07:41:47Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

      s"open cost is considered as scanning $openCostInBytes bytes.")

+    // Filter files with bucket pruning if possible
+    val filePruning: Path => Boolean = optionalBucketSet match {


nit: filePruning -> can[Bucket]Prune?

@maropu - changed.

maropu · 2021-02-01T07:44:05Z

I left minor comments and it looks fine otherwise. cc: @viirya , too.

viirya · 2021-02-01T08:01:40Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    val filePruning: Path => Boolean = optionalBucketSet match {
+      case Some(bucketSet) =>
+        filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
+          .getOrElse(sys.error(s"Invalid bucket file $filePath")))


It looks not good to fail the query here. The scan node actually reads bucket implicitly because bucket scan is disabled for this path. Instead of query failure, maybe log warning and skip pruning?

The error here indicates there's data corruption (invalid file name) for spark data source bucketed table. The benefit for logging warning here is to unblock read these kind of corrupted bucketed tables with disabling bucketing. I feel this is dangerous. Users should not rely on disabling bucketing to read potentially wrong data from bucketed table, they should correct the table. I am more preferring to fail loud here with exception, as warning logging would be very hard to debug. But I am open to others opinions as well, cc @maropu and @cloud-fan .

Your reason sounds reasonable. But it still sounds like a potential breaking change. I'm not sure if some users read table in this way, but it is indeed possible. It will very confuse to them as they won't ask to read table in bucketed way and it works previously.

If eventually we still decide to fail the query, then a good to understand error message is necessary to let users know why we use bucketed spec here and why Spark fails to read the table. Maybe we should also provide some hints for solving it, if possible.

We already have some options to control missing & corrupted files for data sources, so how about following the semantics?

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

Lines 1282 to 1298 in 4e7e7ee

val IGNORE_CORRUPT_FILES = buildConf("spark.sql.files.ignoreCorruptFiles")

.doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " +

"encountering corrupted files and the contents that have been read will still be returned. " +

"This configuration is effective only when using file-based sources such as Parquet, JSON " +

"and ORC.")

.version("2.1.1")

.booleanConf

.createWithDefault(false)

val IGNORE_MISSING_FILES = buildConf("spark.sql.files.ignoreMissingFiles")

.doc("Whether to ignore missing files. If true, the Spark jobs will continue to run when " +

"encountering missing files and the contents that have been read will still be returned. " +

"This configuration is effective only when using file-based sources such as Parquet, JSON " +

"and ORC.")

.version("2.3.0")

.booleanConf

.createWithDefault(false)

For the data corruption case above, how about throwing an exception by default and then stating in the exception message that you should use a specified option if you want to ignore it?

Yea, I think we better can avoid the case that no way to overcome the bucket id corruption. Using existing option to ignore the failure sounds good.

@maropu , @viirya - This sounds reasonable to me. Changed to use spark.sql.files.ignoreCorruptFiles to handle invalid bucket file name case.

SparkQA · 2021-02-01T08:03:22Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39298/

viirya

It looks okay overall.

SparkQA · 2021-02-01T11:27:59Z

Test build #134717 has finished for PR 31413 at commit cd90f0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

Addressed all comments, and ready to review again, thanks. cc @maropu , @viirya and @cloud-fan .

c21 · 2021-02-01T21:45:54Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

      s"open cost is considered as scanning $openCostInBytes bytes.")

+    // Filter files with bucket pruning if possible
+    val filePruning: Path => Boolean = optionalBucketSet match {


@maropu - changed.

c21 · 2021-02-01T21:46:15Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    val filePruning: Path => Boolean = optionalBucketSet match {
+      case Some(bucketSet) =>
+        filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
+          .getOrElse(sys.error(s"Invalid bucket file $filePath")))


@maropu , @cloud-fan - sure, updated.

c21 · 2021-02-01T21:47:41Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+    val filePruning: Path => Boolean = optionalBucketSet match {
+      case Some(bucketSet) =>
+        filePath => bucketSet.get(BucketingUtils.getBucketId(filePath.getName)
+          .getOrElse(sys.error(s"Invalid bucket file $filePath")))


@maropu , @viirya - This sounds reasonable to me. Changed to use spark.sql.files.ignoreCorruptFiles to handle invalid bucket file name case.

SparkQA · 2021-02-01T22:45:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39332/

SparkQA · 2021-02-01T23:04:02Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39332/

maropu · 2021-02-02T01:23:35Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

      s"open cost is considered as scanning $openCostInBytes bytes.")

+    // Filter files with bucket pruning if possible
+    val ignoreCorruptFiles = fsRelation.sparkSession.sessionState.conf.ignoreCorruptFiles


nit: lazy val?

@maropu - updated.

maropu · 2021-02-02T01:23:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+                throw new IllegalStateException(
+                  s"Invalid bucket file $filePath when doing bucket pruning. " +
+                  s"Enable ${SQLConf.IGNORE_CORRUPT_FILES.key} to disable exception " +
+                    "and read the file.")


nit: wrong indent?

@maropu - updated.

maropu · 2021-02-02T01:24:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+              } else {
+                throw new IllegalStateException(
+                  s"Invalid bucket file $filePath when doing bucket pruning. " +
+                  s"Enable ${SQLConf.IGNORE_CORRUPT_FILES.key} to disable exception " +


nit: disable -> ignore

@maropu - updated.

SparkQA · 2021-02-02T02:20:20Z

Test build #134746 has finished for PR 31413 at commit e63a8c3.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-02T05:57:06Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39343/

sunchao · 2021-02-02T05:55:25Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala


+    // Filter files with bucket pruning if possible
+    lazy val ignoreCorruptFiles = fsRelation.sparkSession.sessionState.conf.ignoreCorruptFiles
+    val canPrune: Path => Boolean = optionalBucketSet match {


nit: perhaps rename this to shouldNotPrune or shouldProcess? canPrune sounds like the path should be ignored.

I am very bad at naming :) This is suggested from #31413 (comment). Shall I change again? cc @maropu .

Changed to shouldProcess as I feel shouldNotPrune is hard to reason about.

sunchao · 2021-02-02T06:09:58Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+            case Some(id) => bucketSet.get(id)
+            case None =>
+              if (ignoreCorruptFiles) {
+                // If ignoring corrupt file, do not prune when bucket file name is invalid


curious what's the previous behavior of this, or is this newly introduced? we may need to add the info to the PR description (user-facing change).

Also I'm not sure if this is the best choice: if a bucketed table is corrupted, should we read the corrupt file? it will likely lead to incorrect results. On the other hand we can choose to ignore the file which seems to be more aligned with the name of the config, although result could still be incorrect.

@sunchao - this is newly introduced. Updated PR description.

Also I'm not sure if this is the best choice: if a bucketed table is corrupted, should we read the corrupt file? it will likely lead to incorrect results. On the other hand we can choose to ignore the file which seems to be more aligned with the name of the config, although result could still be incorrect.

Note by default the exception will be thrown here and query will be failed loud. We allow a config here to help existing users to work around if they want. See relevant discussion in #31413 (comment) .

Cool thanks for pointing to the discussion. I'm just not sure whether the corrupted file should be ignored or processed if the flag is turned on. ignoreCorruptedFiles seems to indicate that the problematic file should be ignored so it is a bit confusing that we still process it here. Also IMO ignoring it seems to be slightly safer (thinking someone dump garbage files into the bucketed partition dir)?

cc @maropu @viirya

I feel either skipping or processing the file is no way perfect. There can be other corruption case, where e.g. the table (specified with 1024 buckets), but only had 500 files underneath. This could be due to some other compute engines or users accidentally dump data here without respecting spark bucketing metadata. We have no efficient way to handle if number of files fewer than number of buckets.

The existing usage of ignoreCorruptFiles skip reading some of content of file, so it's also not completely ignoring. But I am fine if we think we need another config name for this.

Given users explicitly disable bucketing here for reading the table, I would assume they want to read the table as a non-bucketed table, so they would like to read all of input files, no? cc @viirya what's the use case you are thinking here? Thanks.

I feel either skipping or processing the file is no way perfect.

Yes agreed.

Given users explicitly disable bucketing here for reading the table, I would assume they want to read the table as a non-bucketed table, so they would like to read all of input files

Good point. Although it seems a bit weird that someone would do this.

This seems unrelated to "decouple bucket scan and bucket filter pruning". Can we do it in a followup PR and discuss there? Let's not introduce extra behavior change when not necessary.

@cloud-fan - sorry which part you are suggesting to do in a followup PR? Here we anyway need to decide how do we handle when file name is not a valid bucket file name (process or not process the file) for pruning. Does I miss anything?

Here we anyway need to decide how do we handle when file name is not a valid bucket file name (process or not process the file) for pruning

We should follow whatever behavior before this PR, since the correct behavior is not obvious and triggers discussion here.

@cloud-fan - okay. The behavior before this PR is to process all files for bucketed table if disabling bucketing. I changed the code to not prune the file if bucket file name is invalid. So this should follow previous behavior, and we can discuss whether to throw exception/ignore file/process file in followup PR. cc @maropu and @viirya for code change here.

SparkQA · 2021-02-02T06:27:05Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39343/

SparkQA · 2021-02-02T08:05:13Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39351/

SparkQA · 2021-02-02T08:32:16Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39351/

SparkQA · 2021-02-02T09:48:56Z

Test build #134757 has finished for PR 31413 at commit 3d348a6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-02-02T11:26:36Z

Test build #134765 has finished for PR 31413 at commit 9a6999d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

@cloud-fan - addressed your comment, and it's ready for review again, thanks.

c21 · 2021-02-03T21:21:07Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+            case Some(id) => bucketSet.get(id)
+            case None =>
+              if (ignoreCorruptFiles) {
+                // If ignoring corrupt file, do not prune when bucket file name is invalid


@cloud-fan - okay. The behavior before this PR is to process all files for bucketed table if disabling bucketing. I changed the code to not prune the file if bucket file name is invalid. So this should follow previous behavior, and we can discuss whether to throw exception/ignore file/process file in followup PR. cc @maropu and @viirya for code change here.

SparkQA · 2021-02-03T23:21:48Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39435/

SparkQA · 2021-02-03T23:40:09Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39435/

cloud-fan · 2021-02-04T02:45:06Z

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

        }
+
+        withSQLConf(SQLConf.BUCKETING_ENABLED.key -> "false") {
+          // Bucket pruning should still work when bucketing is disabled


then do we have a config to turn on/off bucketing file pruning, or it's always applied?

@cloud-fan - always applied. I can add one if this makes surprise to existing users/queries, and they can turn off if needed. e.g. spark.sql.sources.bucketing.pruning.enabled?

Or we can keep the previous behavior: when SQLConf.BUCKETING_ENABLED is off, don't do any bucket optimization, including bucket scan and bucket pruning.

@cloud-fan - make sense to me. Updated the change.

SparkQA · 2021-02-04T02:52:39Z

Test build #134848 has finished for PR 31413 at commit 7d2f849.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…f pruning

cloud-fan · 2021-02-04T05:24:52Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

      s"open cost is considered as scanning $openCostInBytes bytes.")

+    // Filter files with bucket pruning if possible
+    lazy val bucketingEnabled = fsRelation.sparkSession.sessionState.conf.bucketingEnabled


There is already a lazy val bucketedScan: Boolean in L261

@cloud-fan - that's bucketedScan, but we need ...conf.bucketingEnabled, here we already in the code path where bucketedScan is false - https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala#L413 .

ah sorry I missed that. Probably we can just make it val, since reading conf is very cheap, while lazy val has overhead.

since reading conf is very cheap

That's what I feel too, but I got feedback earlier here - #31413 (comment). @maropu - could you help provide more context here? Thanks.

Yea, reverting it back looks okay.

Removed lazy val.

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala

SparkQA · 2021-02-04T07:03:59Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39445/

SparkQA · 2021-02-04T07:32:39Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39445/

SparkQA · 2021-02-04T10:17:47Z

Test build #134857 has finished for PR 31413 at commit 066d5a4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

c21

@cloud-fan - addressed all comments, and this is ready for review again, thanks.

c21 · 2021-02-05T04:43:37Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

      s"open cost is considered as scanning $openCostInBytes bytes.")

+    // Filter files with bucket pruning if possible
+    lazy val bucketingEnabled = fsRelation.sparkSession.sessionState.conf.bucketingEnabled


Removed lazy val.

SparkQA · 2021-02-05T05:57:30Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39493/

SparkQA · 2021-02-05T06:34:06Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39493/

SparkQA · 2021-02-05T09:37:26Z

Test build #134910 has finished for PR 31413 at commit 03120af.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jaceklaskowski

LGTM (non-binding)

cloud-fan · 2021-02-05T13:00:02Z

thanks, merging to master!

c21 · 2021-02-05T17:52:00Z

Thank you all for review!

HyukjinKwon · 2021-03-29T13:11:32Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

    logInfo(s"Planning scan with bin packing, max size: $maxSplitBytes bytes, " +
      s"open cost is considered as scanning $openCostInBytes bytes.")

+    // Filter files with bucket pruning if possible


It's a bit odd that we call method name as createNonBucketedReadRDD but do something with buckets. I guess we could name createNonBucketedReadRDD like just createReadRDD or createStandardReadRDD

HyukjinKwon · 2021-03-29T13:31:51Z

sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala

+            case Some(id) => bucketSet.get(id)
+            case None =>
+              // Do not prune the file if bucket file name is invalid
+              true


Hm, it could be one liner:

filePath => BucketingUtils.getBucketId(filePath.getName).forall(bucketSet.get)

If it looks less readable we could:

filePath => BucketingUtils.getBucketId(filePath.getName).map(bucketSet.get).getOrElse

If we worry about perf penalty from pattern matching, etc. we could do:

filePath => { val bucketId = BucketingUtils.getBucketId(filePath.getName) if (bucketId.isEmpty) true else bucketSet.get(bucketId.get) }

…r change in FileSourceScanExec ### What changes were proposed in this pull request? This PR is a followup change to address comments in #31413 (comment) and #31413 (comment) . Minor change in `FileSourceScanExec`. No actual logic change here. ### Why are the changes needed? Better readability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32000 from c21/bucket-scan. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

c21 added 2 commits January 31, 2021 22:18

Decouple bucket scan and bucket filter pruning for data source v1

84742f4

Remove unnecessary comment

3b72a6b

c21 commented Feb 1, 2021

View reviewed changes

Fix style

cd90f0a

github-actions bot added the SQL label Feb 1, 2021

maropu reviewed Feb 1, 2021

View reviewed changes

viirya reviewed Feb 1, 2021

View reviewed changes

Address all comments

e63a8c3

c21 commented Feb 1, 2021

View reviewed changes

maropu approved these changes Feb 2, 2021

View reviewed changes

Address all comments

3d348a6

sunchao reviewed Feb 2, 2021

View reviewed changes

Change naming again

9a6999d

Follow existing behavior to not prune file if name is invalid

7d2f849

c21 commented Feb 3, 2021

View reviewed changes

cloud-fan reviewed Feb 4, 2021

View reviewed changes

Address comment to use spark.sql.sources.bucketing.enabled to turn of…

820c164

…f pruning

cloud-fan reviewed Feb 4, 2021

View reviewed changes

sql/core/src/test/scala/org/apache/spark/sql/sources/BucketedReadSuite.scala Show resolved Hide resolved

Refine test and fix scala compilation error

066d5a4

Remove lazy val

03120af

c21 commented Feb 5, 2021

View reviewed changes

jaceklaskowski approved these changes Feb 5, 2021

View reviewed changes

cloud-fan approved these changes Feb 5, 2021

View reviewed changes

cloud-fan closed this in 76baaf7 Feb 5, 2021

c21 deleted the bucket-pruning branch February 5, 2021 17:52

HyukjinKwon reviewed Mar 29, 2021

View reviewed changes

c21 mentioned this pull request Mar 30, 2021

[SPARK-32985][SQL][FOLLOWUP] Rename createNonBucketedReadRDD and minor change in FileSourceScanExec #32000

Closed

	val IGNORE_CORRUPT_FILES = buildConf("spark.sql.files.ignoreCorruptFiles")
	.doc("Whether to ignore corrupt files. If true, the Spark jobs will continue to run when " +
	"encountering corrupted files and the contents that have been read will still be returned. " +
	"This configuration is effective only when using file-based sources such as Parquet, JSON " +
	"and ORC.")
	.version("2.1.1")
	.booleanConf
	.createWithDefault(false)

	val IGNORE_MISSING_FILES = buildConf("spark.sql.files.ignoreMissingFiles")
	.doc("Whether to ignore missing files. If true, the Spark jobs will continue to run when " +
	"encountering missing files and the contents that have been read will still be returned. " +
	"This configuration is effective only when using file-based sources such as Parquet, JSON " +
	"and ORC.")
	.version("2.3.0")
	.booleanConf
	.createWithDefault(false)

Conversation

c21 commented Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c21 commented Feb 1, 2021

Uh oh!

SparkQA commented Feb 1, 2021

Uh oh!

SparkQA commented Feb 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Feb 1, 2021

Uh oh!

viirya Feb 1, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2021

Uh oh!

c21 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Feb 1, 2021

Uh oh!

SparkQA commented Feb 1, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

c21 commented Feb 1, 2021 •

edited

Loading

viirya Feb 1, 2021 •

edited

Loading

sunchao Feb 2, 2021 •

edited

Loading