[SPARK-33298][CORE] Decouple file naming from FileCommitProtocol #32881

c21 · 2021-06-11T07:54:42Z

What changes were proposed in this pull request?

This PR is to propose to decouple file naming functionality from FileCommitProtocol. Currently FileCommitProtocol mainly does three things:

commit task output by renaming from stage file name to final file name
commit job output by renaming directory for dynamic partitions query or writing to custom location
specify task staging output file name, and final output file name if needed

A FileCommitProtocol should cover first two functionalities, but not the 3rd (naming output file). The file commit protocol (by its name and design) should take care of committing output (e.g. rename file and directory), but it should not control what's the file name as well. It should leave caller the flexibility to specify file names (e.g. Hive/Presto/Trino compatible bucket file name is different from what Spark has now - #30003). So here the PR is to decouple the file naming from FileCommitProtocol.

The changes are:

Introduce an interface FileNamingProtocol to specify how to do output file naming. Add implementation BatchFileNamingProtocol for batch queries and StreamingFileNamingProtocol for streaming queries.
Modify existing method newTaskTempFile and newTaskTempFileAbsPath in FileCommitProtocol to allow commit protocol to be notified when a new task output file is added. The input is relative file name/path, and output is the full file path.
Change FileFormatDataWriter to call FileNamingProtocol to get relative file path and FileCommitProtocol to get full file path.

Why are the changes needed?

To make commit protocol clearer and allow future flexibility to specify Spark output file name.
Pre-requisite of #30003.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test e.g. DataFrameReaderWriterSuite.scala and InsertSuite.scala
Will add more unit tests if required. This PR anyway is a code refactoring.

c21 · 2021-06-11T07:56:09Z

cc @cloud-fan could you help take a look when you have time?
Will craft more unit tests if we have consensus on overall design, thanks.

SparkQA · 2021-06-11T11:42:29Z

Test build #139693 has finished for PR 32881 at commit fc60685.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DefaultNamingProtocol(
abstract class FileNamingProtocol
final case class FileContext(
class HadoopMapReduceNamingProtocol(

SparkQA · 2021-06-11T22:32:58Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44227/

SparkQA · 2021-06-11T23:10:23Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44227/

SparkQA · 2021-06-14T20:39:46Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44303/

SparkQA · 2021-06-14T21:11:29Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44303/

SparkQA · 2021-06-14T21:50:04Z

Test build #139777 has finished for PR 32881 at commit 6b0cae7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-15T06:03:07Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44322/

SparkQA · 2021-06-15T07:18:40Z

Test build #139796 has finished for PR 32881 at commit 1bb6e16.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-15T10:08:32Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44332/

SparkQA · 2021-06-15T10:48:47Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44332/

SparkQA · 2021-06-15T11:10:32Z

Test build #139805 has finished for PR 32881 at commit 65346ab.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-06-17T14:12:02Z

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

I'd like to understand more about backward compatibility. If there is an old impl of file commit protocol, in Spark 3.2 the impl will commit nothing, and this silent "noop" is even worse than an explicit compile error.

This is a developer API, if there is really no way to keep backward compatibility, let's just break the API and force people to correct their impl according to the new API.

@cloud-fan - I think it's impossible for me to keep backward compatibility cleanly. Updated here to just introduce new method and break the API.

cloud-fan · 2021-06-17T14:13:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

do we really need to allow users to customize the file naming?

@cloud-fan - yes I think so. Before this PR, file naming is part of commit protocol, and each commit protocol checked in Spark code base, has their own specification for naming - HadoopMapReduceCommitProtocol, PathOutputCommitProtocol and ManifestFileCommitProtocol. So the external commit protocol we should expect they might already have their custom way of naming, and we should allow them to implement their own naming protocol.

I agree that the file naming shouldn't be part of commit protocol, but I think the callers in Spark should decide the file name, like file format writer, or something similar for hive tables, not users.

Updated per discussion offline. Keep naming protocol internal and not user-facing via config.

…ated as PathOutputCommitProtocol depends on it

… partitions

SparkQA · 2021-06-18T03:25:12Z

Test build #139958 has finished for PR 32881 at commit c2fd356.

This patch fails to build.
This patch does not merge cleanly.
This patch adds the following public classes (experimental):
class PathOutputNamingProtocol(
.doc(\"The class name for output file naming protocol. This is used together with \" +
s\"$
.doc(\"The class name for streaming output file naming protocol. This is used together \" +
s\"with $
class ManifestFileNamingProtocol(

SparkQA · 2021-06-18T03:32:11Z

Test build #139959 has finished for PR 32881 at commit e221b85.

This patch fails to build.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class PathOutputNamingProtocol(
.doc(\"The class name for output file naming protocol. This is used together with \" +
s\"$
.doc(\"The class name for streaming output file naming protocol. This is used together \" +
s\"with $
class ManifestFileNamingProtocol(

SparkQA · 2021-06-18T03:57:22Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44485/

SparkQA · 2021-06-18T04:05:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44486/

SparkQA · 2021-06-18T04:23:08Z

Test build #139964 has finished for PR 32881 at commit 9878f2e.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-18T04:30:38Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44485/

SparkQA · 2021-06-18T04:39:21Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44486/

SparkQA · 2021-06-18T05:12:33Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44491/

SparkQA · 2021-06-18T05:23:13Z

Test build #139968 has finished for PR 32881 at commit ddc6955.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-18T05:48:15Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44491/

SparkQA · 2021-06-18T06:00:14Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44495/

SparkQA · 2021-06-18T06:18:36Z

Test build #139970 has finished for PR 32881 at commit f716efe.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-18T06:32:14Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44495/

SparkQA · 2021-06-18T07:01:54Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44497/

SparkQA · 2021-06-18T07:34:14Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44497/

Co-authored-by: Wenchen Fan <[email protected]>

c21 · 2021-06-18T09:57:20Z

@cloud-fan - thank you for offline discussion and update the PR to use the discussed approach. Thanks.

SparkQA · 2021-06-18T10:24:50Z

Test build #139985 has finished for PR 32881 at commit db2594d.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2021-06-18T11:00:57Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44512/

SparkQA · 2021-06-18T11:32:55Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44512/

SparkQA · 2021-06-19T03:28:23Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44538/

SparkQA · 2021-06-19T04:04:49Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44538/

SparkQA · 2021-06-19T10:35:13Z

Test build #140012 has finished for PR 32881 at commit 0f3df0f.

This patch fails from timeout after a configured wait of 500m.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2021-06-21T05:20:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/errors/QueryExecutionErrors.scala

  }

-  def addFilesWithAbsolutePathUnsupportedError(commitProtocol: String): Throwable = {
+  def addFilesWithAbsolutePathUnsupportedError(protocol: String): Throwable = {


why this change?

@cloud-fan - sorry, this does not need. Was calling it from naming protocol as well in one iteration. Will change.

cloud-fan · 2021-06-21T05:23:18Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala

      plan: SparkPlan,
      fileFormat: FileFormat,
-      committer: FileCommitProtocol,
+      protocols: (FileCommitProtocol, FileNamingProtocol),


why not pass 2 parameter?

From my side, whenever I add the 11th parameter of a scala method, intellij will mark it as a lint error. Do we have # of parameter rule in Spark?

cloud-fan · 2021-06-21T08:38:32Z

core/src/main/scala/org/apache/spark/internal/io/FileNamingProtocol.scala

+ */
+final case class FileContext(
+  ext: String,
+  relativeDir: Option[String],


I'm not very sure about it. It seems more clear to me to let FileNamingProtocol only generate filename, and the caller side should construct the proper relative path w.r.t. the partition dir.

same to ext and prefix: can we let the caller side to pretend/append ext/prefix to the generated file name?

All the information in FileContext is not something that the impl can customize: the generated file name must have ext and the end, prefix at the beginning, and relativeDir as the parent dir. Then it's better to let the caller side to guarantee it.

And the API can be simply named getTaskFileName.

This API is only to abstract the naming differences between batch and streaming file writing. If that's not necessary, maybe we can remove this abstraction entirely.

jobId is generated with UUID as well, I don't see why streaming write needs to generate a new UUID per file, instead of using job id.

I'm cleaning this up in #33002

cloud-fan · 2021-06-21T08:40:26Z

...re/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingFileNamingProtocol.scala

+    val split = taskContext.getTaskAttemptID.getTaskID.getId
+    val uuid = UUID.randomUUID.toString
+    val ext = fileContext.ext
+    val filename = f"part-$split%05d-$uuid$ext"


shall we fail here if prefix is not None? or support prefix here?

cloud-fan · 2021-06-21T08:48:24Z

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

   */
  def newTaskTempFileAbsPath(
-      taskContext: TaskAttemptContext, absoluteDir: String, ext: String): String
+      taskContext: TaskAttemptContext, relativePath: String, finalPath: String): String


note: I can see that, this API makes the code simpler, but it makes the semantic a bit more complicated. What if the final path doesn't have the same file name as the relativePath? Maybe it's better to have fileName: String, targetDir: String. Then the semantic is pretty here: the impl should commit the new file to the target dir.

c21 · 2021-06-27T00:20:31Z

Update: we decided to go with #33012 instead of this PR, as we know some other projects (delta/core/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala:newTaskTempFile()) are using the existing APIs to customize file name, so we don't want to break them for now.

github-actions bot added CORE SQL STRUCTURED STREAMING labels Jun 11, 2021

c21 force-pushed the commit-protocol branch from 6b0cae7 to 1bb6e16 Compare June 15, 2021 05:08

cloud-fan reviewed Jun 17, 2021

View reviewed changes

c21 added 4 commits June 17, 2021 20:13

Decouple file naming from FileCommitProtocol

14f4879

Add back HadoopMapReduceCommitProtocol.getFilename and mark it deprec…

93fe5bd

…ated as PathOutputCommitProtocol depends on it

Change interface of newTaskFile to try to fix test failure of dynamic…

fef7946

… partitions

Address comment for introducing new API and removing old API

e221b85

c21 force-pushed the commit-protocol branch from c2fd356 to e221b85 Compare June 18, 2021 03:16

Try to fix build

9878f2e

Try to fix build again

ddc6955

Try to fix build again

f716efe

c21 and others added 2 commits June 18, 2021 02:42

Change the approach to pass relative path to commit protocol

8341ebc

Co-authored-by: Wenchen Fan <[email protected]>

Try to fix unit test failure

db2594d

Try to fix build

0f3df0f

cloud-fan reviewed Jun 21, 2021

View reviewed changes

c21 closed this Jun 27, 2021

[SPARK-33298][CORE] Decouple file naming from FileCommitProtocol #32881

[SPARK-33298][CORE] Decouple file naming from FileCommitProtocol #32881

Uh oh!

Conversation

c21 commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

c21 commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 11, 2021

Uh oh!

SparkQA commented Jun 14, 2021

Uh oh!

SparkQA commented Jun 14, 2021

Uh oh!

SparkQA commented Jun 14, 2021

Uh oh!

SparkQA commented Jun 15, 2021

Uh oh!

SparkQA commented Jun 15, 2021

Uh oh!

SparkQA commented Jun 15, 2021

Uh oh!

SparkQA commented Jun 15, 2021

Uh oh!

SparkQA commented Jun 15, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

Uh oh!

c21 commented Jun 18, 2021

Uh oh!

SparkQA commented Jun 18, 2021

c21 commented Jun 11, 2021 •

edited

Loading

cloud-fan Jun 21, 2021 •

edited

Loading

c21 commented Jun 27, 2021 •

edited

Loading