Skip to content

Conversation

@c21
Copy link
Contributor

@c21 c21 commented Jun 11, 2021

What changes were proposed in this pull request?

This PR is to propose to decouple file naming functionality from FileCommitProtocol. Currently FileCommitProtocol mainly does three things:

  • commit task output by renaming from stage file name to final file name
  • commit job output by renaming directory for dynamic partitions query or writing to custom location
  • specify task staging output file name, and final output file name if needed

A FileCommitProtocol should cover first two functionalities, but not the 3rd (naming output file). The file commit protocol (by its name and design) should take care of committing output (e.g. rename file and directory), but it should not control what's the file name as well. It should leave caller the flexibility to specify file names (e.g. Hive/Presto/Trino compatible bucket file name is different from what Spark has now - #30003). So here the PR is to decouple the file naming from FileCommitProtocol.

The changes are:

  • Introduce an interface FileNamingProtocol to specify how to do output file naming. Add implementation BatchFileNamingProtocol for batch queries and StreamingFileNamingProtocol for streaming queries.
  • Modify existing method newTaskTempFile and newTaskTempFileAbsPath in FileCommitProtocol to allow commit protocol to be notified when a new task output file is added. The input is relative file name/path, and output is the full file path.
  • Change FileFormatDataWriter to call FileNamingProtocol to get relative file path and FileCommitProtocol to get full file path.

Why are the changes needed?

To make commit protocol clearer and allow future flexibility to specify Spark output file name.
Pre-requisite of #30003.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Existing unit test e.g. DataFrameReaderWriterSuite.scala and InsertSuite.scala
Will add more unit tests if required. This PR anyway is a code refactoring.

@c21
Copy link
Contributor Author

c21 commented Jun 11, 2021

cc @cloud-fan could you help take a look when you have time?
Will craft more unit tests if we have consensus on overall design, thanks.

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Test build #139693 has finished for PR 32881 at commit fc60685.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class DefaultNamingProtocol(
  • abstract class FileNamingProtocol
  • final case class FileContext(
  • class HadoopMapReduceNamingProtocol(

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44227/

@SparkQA
Copy link

SparkQA commented Jun 11, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44227/

@SparkQA
Copy link

SparkQA commented Jun 14, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44303/

@SparkQA
Copy link

SparkQA commented Jun 14, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44303/

@SparkQA
Copy link

SparkQA commented Jun 14, 2021

Test build #139777 has finished for PR 32881 at commit 6b0cae7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@c21 c21 force-pushed the commit-protocol branch from 6b0cae7 to 1bb6e16 Compare June 15, 2021 05:08
@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44322/

@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Test build #139796 has finished for PR 32881 at commit 1bb6e16.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44332/

@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44332/

@SparkQA
Copy link

SparkQA commented Jun 15, 2021

Test build #139805 has finished for PR 32881 at commit 65346ab.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand more about backward compatibility. If there is an old impl of file commit protocol, in Spark 3.2 the impl will commit nothing, and this silent "noop" is even worse than an explicit compile error.

This is a developer API, if there is really no way to keep backward compatibility, let's just break the API and force people to correct their impl according to the new API.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - I think it's impossible for me to keep backward compatibility cleanly. Updated here to just introduce new method and break the API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need to allow users to customize the file naming?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - yes I think so. Before this PR, file naming is part of commit protocol, and each commit protocol checked in Spark code base, has their own specification for naming - HadoopMapReduceCommitProtocol, PathOutputCommitProtocol and ManifestFileCommitProtocol. So the external commit protocol we should expect they might already have their custom way of naming, and we should allow them to implement their own naming protocol.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the file naming shouldn't be part of commit protocol, but I think the callers in Spark should decide the file name, like file format writer, or something similar for hive tables, not users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated per discussion offline. Keep naming protocol internal and not user-facing via config.

@c21 c21 force-pushed the commit-protocol branch from c2fd356 to e221b85 Compare June 18, 2021 03:16
@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139958 has finished for PR 32881 at commit c2fd356.

  • This patch fails to build.
  • This patch does not merge cleanly.
  • This patch adds the following public classes (experimental):
  • class PathOutputNamingProtocol(
  • .doc(\"The class name for output file naming protocol. This is used together with \" +
  • s\"$
  • .doc(\"The class name for streaming output file naming protocol. This is used together \" +
  • s\"with $
  • class ManifestFileNamingProtocol(

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139959 has finished for PR 32881 at commit e221b85.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class PathOutputNamingProtocol(
  • .doc(\"The class name for output file naming protocol. This is used together with \" +
  • s\"$
  • .doc(\"The class name for streaming output file naming protocol. This is used together \" +
  • s\"with $
  • class ManifestFileNamingProtocol(

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44485/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44486/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139964 has finished for PR 32881 at commit 9878f2e.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44485/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44486/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44491/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139968 has finished for PR 32881 at commit ddc6955.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44491/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44495/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139970 has finished for PR 32881 at commit f716efe.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44495/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44497/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44497/

@c21
Copy link
Contributor Author

c21 commented Jun 18, 2021

@cloud-fan - thank you for offline discussion and update the PR to use the discussed approach. Thanks.

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Test build #139985 has finished for PR 32881 at commit db2594d.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44512/

@SparkQA
Copy link

SparkQA commented Jun 18, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44512/

@SparkQA
Copy link

SparkQA commented Jun 19, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44538/

@SparkQA
Copy link

SparkQA commented Jun 19, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44538/

@SparkQA
Copy link

SparkQA commented Jun 19, 2021

Test build #140012 has finished for PR 32881 at commit 0f3df0f.

  • This patch fails from timeout after a configured wait of 500m.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

def addFilesWithAbsolutePathUnsupportedError(commitProtocol: String): Throwable = {
def addFilesWithAbsolutePathUnsupportedError(protocol: String): Throwable = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan - sorry, this does not need. Was calling it from naming protocol as well in one iteration. Will change.

plan: SparkPlan,
fileFormat: FileFormat,
committer: FileCommitProtocol,
protocols: (FileCommitProtocol, FileNamingProtocol),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not pass 2 parameter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my side, whenever I add the 11th parameter of a scala method, intellij will mark it as a lint error. Do we have # of parameter rule in Spark?

*/
final case class FileContext(
ext: String,
relativeDir: Option[String],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very sure about it. It seems more clear to me to let FileNamingProtocol only generate filename, and the caller side should construct the proper relative path w.r.t. the partition dir.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same to ext and prefix: can we let the caller side to pretend/append ext/prefix to the generated file name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the information in FileContext is not something that the impl can customize: the generated file name must have ext and the end, prefix at the beginning, and relativeDir as the parent dir. Then it's better to let the caller side to guarantee it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the API can be simply named getTaskFileName.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This API is only to abstract the naming differences between batch and streaming file writing. If that's not necessary, maybe we can remove this abstraction entirely.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jobId is generated with UUID as well, I don't see why streaming write needs to generate a new UUID per file, instead of using job id.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm cleaning this up in #33002

val split = taskContext.getTaskAttemptID.getTaskID.getId
val uuid = UUID.randomUUID.toString
val ext = fileContext.ext
val filename = f"part-$split%05d-$uuid$ext"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we fail here if prefix is not None? or support prefix here?

*/
def newTaskTempFileAbsPath(
taskContext: TaskAttemptContext, absoluteDir: String, ext: String): String
taskContext: TaskAttemptContext, relativePath: String, finalPath: String): String
Copy link
Contributor

@cloud-fan cloud-fan Jun 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: I can see that, this API makes the code simpler, but it makes the semantic a bit more complicated. What if the final path doesn't have the same file name as the relativePath? Maybe it's better to have fileName: String, targetDir: String. Then the semantic is pretty here: the impl should commit the new file to the target dir.

@c21
Copy link
Contributor Author

c21 commented Jun 27, 2021

Update: we decided to go with #33012 instead of this PR, as we know some other projects (delta/core/src/main/scala/org/apache/spark/sql/delta/files/DelayedCommitProtocol.scala:newTaskTempFile()) are using the existing APIs to customize file name, so we don't want to break them for now.

@c21 c21 closed this Jun 27, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants