[SPARK-35843][SQL] Unify the file name between batch and streaming file writers #33002

cloud-fan · 2021-06-21T15:48:39Z

What changes were proposed in this pull request?

Currently, the batch and streaming file writers generate the file name a bit differently:

The batch file writer, generates the file name with partition ID and job id (which is a unique UUID)
The streaming file writer, generates the file name with partition ID and a fresh UUID

The reason for it is:

The file output committer for batch creates a unique staging directory for each task attempt, so partition ID is good enough to avoid file name collision, as concurrent task attempts (task retry, speculative tasks) are well handled by the staging directory.
The file output committe for streaming does not use staging directories. It writes files to the final path directly and uses a manifest file to track the committed files. Thus, partition ID is not sufficient to avoid file name collision. That's why we add a fresh UUID to the file name.

This PR proposes to unify the file name, by putting task attempt ID (which must be unique within the job) and the job ID in the file name.

Why are the changes needed?

Remove confusion when people try to understand how Spark generates file names. We can further refactor the code later to move the file name generation outside of the output committer.

Does this PR introduce any user-facing change?

No. The read side doesn't care about file name at all, and only care about how to list files. No backward compatibility issues.

How was this patch tested?

existing tests

cloud-fan · 2021-06-21T15:49:28Z

cc @viirya @squito @HeartSaVioR

cloud-fan · 2021-06-21T15:52:19Z

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala

      d => new Path(workDir, d)
    }.getOrElse(workDir)
-    val file = new Path(parent, getFilename(taskContext, ext))
+    val file = new Path(parent, getFilename(ext))


I'm not sure if this change affects this new committer, but I think it should be a positive change. The file name now use task attempt id instead of partition id, which is "more unique".

@steveloughran

Commit protocols MUST NOT contain any assumptions about filenames. It would be silly.

Well, almost not. try creating a file with .pending or .pendingset in the magic committer and it'd be very confused. (Maybe we should change that to something really obscure...)

cloud-fan · 2021-06-21T15:55:32Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

    // the file name is fine and won't overflow.
-    val split = taskContext.getTaskAttemptID.getTaskID.getId
-    f"part-$split%05d-$jobId$ext"
+    f"part-$taskId%05d-$jobId$ext"


A more aggressive way is to simply use a fresh UUID here, but I'm not sure if that's better. cc @zsxwing

Previously it uses task id after part-, now this is taskAttemptId. Is it still the same format as before, e.g. part-xxxxx-?

the value will be very different. For one query, the partition id always starts with 0. But task attempt id is unique within a spark application and won't be reset for a new query.

If we do want to keep the part-00000 prefix for some reason, we can also apply the naming rule from streaming to batch. I don't know who will care about the final naming. The commit protocols I'm aware of only care about file listing.

part # is handy for a bit of blame assignment. FWIW the name "part" can be configured in "mapreduce.output.basename". Don't know if anyone does.

Now, the v2 committer whose lack of task commit idempotency is well known is only going be able to recover from a failure partway through task attempt commit if the second attempt creates files with the same name. This should not be a barrier to having better names as, well, it's still broken.

But task attempt id is unique within a spark application and won't be reset for a new query.

this true? I think really need to understand differences between spark job, task ID and attempt IDs and the YARN ones, which as we know, have had duplicate job IDs until SPARK-33402.

Spark has: job id -> stage id -> partition ID

job id is simply a UUID
stage id is an integer starting from 0, globally unique within the spark application
partition ID is an integer starting from 0, unique within a stage

Eash task does not only have partition ID, but also has attempt ID, which is an integer starting from 0, globally unique within the spark application. There is also an attempt number, which starts from 0 and increases by one for each attempt of this task.

SparkQA · 2021-06-21T17:11:02Z

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44612/

zsxwing · 2021-06-21T18:52:35Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala

    // the file name is fine and won't overflow.
-    val split = taskContext.getTaskAttemptID.getTaskID.getId
-    val uuid = UUID.randomUUID.toString
-    val filename = f"part-$split%05d-$uuid$ext"


2. The file output committe for streaming does not use staging directories. It writes files to the final path directly and uses a manifest file to track the committed files. Thus, partition ID is not sufficient to avoid file name collision. That's why we add a fresh UUID to the file name.

Could you explain this? Currently ManifestFileCommitProtocol should always pick up a new uuid for each file.

Currently it does so to avoid file name collision, but I think it's overkill and we can use "task attempt id + job id" to avoid name collision as well, which is more consistent with the batch side.

It may also be useful to include the job id in the file name like the batch side does, so that people can see which files were written by the same job.

SparkQA · 2021-06-21T19:04:37Z

Test build #140084 has finished for PR 33002 at commit 715ce3c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya

Looks fine. If we'd like to have consistent filename, how about use a common method to assign/generate the filenames?

steveloughran · 2021-06-23T20:00:47Z

ok, this gets into a world of fun; thanks for mentioning me.

steveloughran

I don't see any problems with this. We have had problems with S3A staging committer (And apparently occasionally with the classic FileOutputCommitter from >1 job starting in the same second and the generated YARN job ID being duplicate. Having #30319 in is a prerequisite here.

Now, the S3A committers (And I should do the same for the IntermediateManifestCommitter of apache/hadoop#2971 ) will use the value of "spark.sql.sources.writeJobUUID" as their unique ID in preference to anything else. That is, if set, it trusts spark. This is what spark used to do, stopped for a bit, had restored. Please can you keep this option and set it to the job uuid. thanks.

steveloughran · 2021-06-23T20:13:02Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

    // the file name is fine and won't overflow.
-    val split = taskContext.getTaskAttemptID.getTaskID.getId
-    f"part-$split%05d-$jobId$ext"
+    f"part-$taskId%05d-$jobId$ext"


part # is handy for a bit of blame assignment. FWIW the name "part" can be configured in "mapreduce.output.basename". Don't know if anyone does.

Now, the v2 committer whose lack of task commit idempotency is well known is only going be able to recover from a failure partway through task attempt commit if the second attempt creates files with the same name. This should not be a barrier to having better names as, well, it's still broken.

But task attempt id is unique within a spark application and won't be reset for a new query.

this true? I think really need to understand differences between spark job, task ID and attempt IDs and the YARN ones, which as we know, have had duplicate job IDs until SPARK-33402.

steveloughran · 2021-06-23T20:15:21Z

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala

      d => new Path(workDir, d)
    }.getOrElse(workDir)
-    val file = new Path(parent, getFilename(taskContext, ext))
+    val file = new Path(parent, getFilename(ext))


Commit protocols MUST NOT contain any assumptions about filenames. It would be silly.

Well, almost not. try creating a file with .pending or .pendingset in the magic committer and it'd be very confused. (Maybe we should change that to something really obscure...)

steveloughran · 2021-06-23T20:16:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala

+      // To avoid file name collision, we should generate a new job ID for every write job, instead
+      // of using batchId, as we may use the same batchId to write files again, if the streaming job
+      // fails and we restore from the checkpoint.
+      val jobId = java.util.UUID.randomUUID().toString


One change of SPARK-33402 was including some timestamp/version info. That's potentially quite handy later just to see when things were created/order

This is the job id for spark file commit protocol. In Hadoop JobId, we do append the timestamp info: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L266

But that's a different story.

steveloughran · 2021-06-23T20:19:39Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala

+    // collide if the file name also includes job ID. The Hadoop task id is equivalent to Spark's
+    // partitionId, which is not unique within the write job, for cases like task retry or
+    // speculative tasks.
+    val taskId = TaskContext.get.taskAttemptId()


Specifically: Hadoop Task ID MUST be the same for all task attempts, so that committers can commit the output of more than one task attempt by renaming the the Task Attempt dir to output/_temporary/jobAttempt/taskID ; as only one task commit can do this (Assuming fs has atomic rename; google GCS doesn't), you get unique output. My WiP manifest committer creates a JSON manifest with task ID in the filename for the same reason: only one file can be committed by file rename (Atomic on GCS as well as azure).

Hadoop Task ID MUST be the same for all task attempts

This doesn't change: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L268

This PR is only to unify the file name generated by the builtin Spark file commit protocol, and doesn't change anything in Hadoop Job/Task setting.

viirya · 2021-06-24T07:22:01Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+    // Spark task attempt ID here.
+    val taskId = TaskContext.get.taskAttemptId()
+    // The file name looks like part-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet
+    // Note that %05d does not truncate the taskId, so if we have more than 100000 tasks,


taskId -> taskAttemptId?

viirya · 2021-06-24T07:23:11Z

...ore/src/main/scala/org/apache/spark/sql/execution/streaming/ManifestFileCommitProtocol.scala

+    // speculative tasks.
+    val taskId = TaskContext.get.taskAttemptId()
+    // The file name looks like part-00000-2dd664f9-d2c4-4ffe-878f-c6c70c1fb0cb_00003.gz.parquet
+    // Note that %05d does not truncate the taskId, so if we have more than 100000 tasks,


same, taskId -> taskAttemptId?

github-actions · 2021-10-03T00:11:07Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

unify the file name between batch and streaming file writer

715ce3c

cloud-fan mentioned this pull request Jun 21, 2021

[SPARK-33298][CORE] Decouple file naming from FileCommitProtocol #32881

Closed

cloud-fan commented Jun 21, 2021

View reviewed changes

github-actions bot added CORE SQL STRUCTURED STREAMING labels Jun 21, 2021

zsxwing reviewed Jun 21, 2021

View reviewed changes

viirya reviewed Jun 22, 2021

View reviewed changes

steveloughran reviewed Jun 23, 2021

View reviewed changes

viirya reviewed Jun 24, 2021

View reviewed changes

github-actions bot added the Stale label Oct 3, 2021

github-actions bot closed this Oct 4, 2021

[SPARK-35843][SQL] Unify the file name between batch and streaming file writers #33002

[SPARK-35843][SQL] Unify the file name between batch and streaming file writers #33002

Uh oh!

Conversation

cloud-fan commented Jun 21, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

cloud-fan commented Jun 21, 2021

Uh oh!

cloud-fan Jun 21, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 21, 2021

Uh oh!

viirya left a comment

Choose a reason for hiding this comment

Uh oh!

steveloughran commented Jun 23, 2021

Uh oh!

steveloughran left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Jun 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 3, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

cloud-fan Jun 21, 2021 •

edited

Loading

cloud-fan Jun 24, 2021 •

edited

Loading