[SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode #29000

WinkerDu · 2020-07-05T20:00:29Z

What changes were proposed in this pull request?

When using dynamic partition overwrite, each task has its working dir under staging dir like stagingDir/.spark-staging-{jobId}, each task commits to outputPath/.spark-staging-{jobId}/{partitionId}/part-{taskId}-{jobId}{ext}.
When speculation enable, multiple task attempts would be setup for one task, they have same task id and they would commit to same file concurrently. Due to host done or node preemption, the partly-committed files aren't cleaned up, a FileAlreadyExistsException would be raised in this situation, resulting in job failure.

I don't try to change task commit process for dynamic partition overwrite, like adding attempt id to task working dir for each attempts and committing to final output dir via a new outputCommitCoordinator, here is reason:

FileOutputCommitter already has commit coordinator for each task attempts, we can leverage it rather than build a new one.
To say the least, we implement a coordinator solving task attempts commit conflict, suppose a severe case, application master failover, tasks with same attempt id and same task id would commit to same files, the FileAlreadyExistsException risk still exists

In this pr, I leverage FileOutputCommitter to solve the problem:

when initing a write job description, set outputPath/.spark-staging-{jobId} as the output dir
each task attempt writes output to outputPath/.spark-staging-{jobId}/_temporary/${appAttemptId}/_temporary/${taskAttemptId}/{partitionId}/part-{taskId}-{jobId}{ext}
leverage FileOutputCommitter coordinator, write job firstly commits output to outputPath/.spark-staging-{jobId}/{partitionId}
for dynamic partition overwrite, write job finally move outputPath/.spark-staging-{jobId}/{partitionId} to outputPath/{partitionId}

Why are the changes needed?

Without this pr, dynamic partition overwrite would fail

Does this PR introduce any user-facing change?

No.

How was this patch tested?

added UT.

WinkerDu · 2020-07-05T20:06:08Z

gentle ping @cloud-fan @dongjoon-hyun @xuanyuanking @turboFei @LuciferYang @advancedxy
plz have a review, thx

dongjoon-hyun · 2020-07-05T23:11:57Z

ok to test

SparkQA · 2020-07-05T23:30:13Z

Test build #124977 has finished for PR 29000 at commit a4d99d0.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-06T01:10:12Z

Test build #124990 has finished for PR 29000 at commit 6921f22.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2020-07-06T01:51:23Z

sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala

+          val stagingDir = new File(d, ".spark-staging-jobId")
+          stagingDir.mkdirs()
+          val conflictTaskFile = new File(stagingDir, "part-00000-jobId-c000.snappy.parquet")
+          conflictTaskFile.createNewFile()


Sorry, the ut is wrong, I have fixed it.
val stagingDir = new File(d, ".spark-staging-jobId")
stagingDir.mkdirs()
val stagingPartDir = new File(stagingDir, "p1=2")
stagingPartDir.mkdirs()
val conflictTaskFile = new File(stagingPartDir, "part-00000-jobId.c000.snappy.parquet")
conflictTaskFile.createNewFile()

I also recreated another PR, #28989,
In this Pr, I define a Spark staging output committer to leverage OutputCommitCoordinator

Thank you for reminding, this UT is based on your pr #26339 , I'll correct it

So, @turboFei you would prefer #28989 ? If so, maybe a bit clarification in this pr thread should be added.

@advancedxy thanks

SparkQA · 2020-07-06T02:58:34Z

Test build #125006 has finished for PR 29000 at commit 9b4f8f3.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

LuciferYang · 2020-07-06T03:04:36Z

···
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-install-plugin:3.0.0-M1:install (default-cli) on project spark-parent_2.12: ArtifactInstallerException: Failed to install metadata org.apache.spark:spark-parent_2.12/maven-metadata.xml: Could not parse metadata /home/jenkins/.m2/repository/org/apache/spark/spark-parent_2.12/maven-metadata-local.xml: in epilog non whitespace content is not allowed but got t (position: END_TAG seen ...\nt... @26:2) -> [Help 1]
···
Is there any problem of CI environment @dongjoon-hyun

turboFei · 2020-07-06T03:30:31Z

Just left some comments.

This PR did resolve the issue, it also involve some costs.
In this pr, for dynamic partition overwrite mode.
Each task might create multi partition paths under a unique task attempt output path.
In fact, Dynamic partition overwrite always cause too many small files if user does not repartition by dynamic partition columns.
So, I am afraid that this pr might cause lots of directories during runtime.

I prefer #28989, in this PR, I define a Spark staging output committer based on the current implementation of HadoopMapReduceCommitProtocol.

WinkerDu · 2020-07-07T06:35:00Z

@turboFei Thanks for your comments. Actually I think there is no partition-explosion cost. Commit task output dir is generated by FileFormatDataWriter.newOutputWriter, no matter the task is static-partitioned or dynamic-partitioned, one commit task only deal with one partition dir.

turboFei · 2020-07-09T02:51:31Z

Thanks for your reply @WinkerDu

I am wrong about that, In #28989 I thought the taskAttemptContext.getTaskAttemptId.getId is same with the taskAttemptId of spark and it would at most create several(the largest task attempt number) staging partition dir.
But the taskAttemptContext.getTaskAttemptId.getId is also a uniq id, so #28989 would also create multi staging partition dir for each task. I will close that PR.

WinkerDu · 2020-07-09T04:34:19Z

Thanks for your reply @turboFei
Gentle ping @dongjoon-hyun @cloud-fan @xuanyuanking to review this pr, thank you

cloud-fan · 2020-07-09T17:45:33Z

ok to test

SparkQA · 2020-07-09T21:43:48Z

Test build #125501 has finished for PR 29000 at commit 9b4f8f3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

WinkerDu · 2020-07-13T12:43:45Z

retest this please @cloud-fan @dongjoon-hyun

WinkerDu · 2020-07-16T13:26:59Z

gentle ping @cloud-fan @dongjoon-hyun @SparkQA to retest this pr

turboFei · 2020-07-16T15:49:23Z

gentle ping @cloud-fan @dongjoon-hyun @SparkQA to retest this pr

You can push a new commit(like two commits, the second one revert the first one) to trigger jenkins job.

WinkerDu · 2020-07-17T02:18:41Z

You can push a new commit(like two commits, the second one revert the first one) to trigger jenkins job.

thanks for your advice, will try it :)

SparkQA · 2020-07-17T09:50:12Z

Test build #126035 has finished for PR 29000 at commit 4766830.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WinkerDu · 2020-07-17T09:57:37Z

gentle ping @dongjoon-hyun @cloud-fan @xuanyuanking to review this pr :)

xuanyuanking

The root cause of this issue is that the speculative task and normal task sharing the final output dir in dynamic overwrite mode. Please emphasize this in the PR description.

xuanyuanking · 2020-07-24T09:25:22Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

-        this.stagingDir
      // For FileOutputCommitter it has its own staging path called "work path".
      case f: FileOutputCommitter =>
+        handleDynamicPartitionOverwrite(dir)


Since we changed the behavior, please also update the comment: https://github.com/apache/spark/pull/29000/files#diff-d97cfb5711116287a7655f32cd5675cbR43

xuanyuanking · 2020-07-24T09:26:04Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

        new Path(Option(f.getWorkPath).map(_.toString).getOrElse(path))
-      case _ => new Path(path)
+      case _ =>
+        handleDynamicPartitionOverwrite(dir)


If both case branches need to call handleDynamicPartitionOverwrite, then we can call it outside case match?

Since this pr leverages FileOutputCommitter to deal with commit collision, only handle dynamic partition overwrite in the first branch.
BTW, it seems all committers practically used in Spark are derived from 'FIleOutputCommiter'

xuanyuanking · 2020-07-24T09:45:53Z

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

    }

+    // For SPARK-27194 unit test, we try to set constant jobId carried by options
+    val jobId = options.getOrElse("test.jobId", java.util.UUID.randomUUID().toString)


Can we try to reproduce the file collision without adding this extra option?

good catch, I'll try to pre-create a partition file with a customized file commit protocol to keep this code clean

SparkQA · 2020-07-27T06:28:47Z

Test build #126618 has finished for PR 29000 at commit 2e3d03c.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-27T07:05:05Z

Test build #126621 has finished for PR 29000 at commit 84b7093.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-27T10:21:59Z

Test build #126629 has finished for PR 29000 at commit 5865f51.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-19T15:09:39Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35954/

SparkQA · 2020-11-19T16:30:22Z

Test build #131350 has finished for PR 29000 at commit 85aa12a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-19T18:52:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35967/

SparkQA · 2020-11-19T19:18:48Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35967/

SparkQA · 2020-11-19T20:06:37Z

Test build #131363 has finished for PR 29000 at commit 6efba79.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-19T20:07:18Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35972/

SparkQA · 2020-11-19T20:35:10Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35972/

SparkQA · 2020-11-19T20:37:55Z

Test build #131368 has finished for PR 29000 at commit b50ca37.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-23T15:02:22Z

retest this please

SparkQA · 2020-11-23T15:43:49Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36155/

SparkQA · 2020-11-23T16:11:20Z

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36155/

SparkQA · 2020-11-23T17:45:16Z

Test build #131552 has finished for PR 29000 at commit b50ca37.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-23T17:52:26Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36162/

SparkQA · 2020-11-23T18:17:52Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36162/

SparkQA · 2020-11-23T19:41:34Z

Test build #131560 has finished for PR 29000 at commit 0d57d92.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-24T04:54:47Z

Test build #131607 has finished for PR 29000 at commit 0d57d92.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-24T10:21:03Z

Test build #131634 has finished for PR 29000 at commit 45f8ea5.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-11-24T19:35:07Z

Test build #131682 has finished for PR 29000 at commit 85aa12a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-11-25T12:50:19Z

GA passed, merging to master!

cloud-fan · 2020-11-25T12:54:08Z

@WinkerDu do you have a JIRA account?

WinkerDu · 2020-11-25T14:03:19Z

@cloud-fan yes, I have a JIRA account named 'duripeng'

WinkerDu · 2020-11-25T14:04:14Z

thank all for patch review!

xuanyuanking · 2020-11-26T02:47:42Z

Congrats on your first contribution, Ripeng! :)

LuciferYang · 2020-11-26T02:48:28Z

Congrats on your first contribution, Ripeng! :) +1

tgravescs · 2021-06-11T14:16:30Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+ *                                  {appAttemptId}/{taskId}/a=1/b=1,
+ *                                  then move them to
+ *                                  /path/to/outputPath/.spark-staging-{jobId}/a=1/b=1.
+ *                                  2. When [[FileOutputCommitter]] algorithm version set to 2,


so this isn't the normal behavior of the algorithm version 2, right? Normally it writes the task files directly to the final output location. The whole point of algorithm 2 is to prevent all of the extra moves on the driver at the end of the job. For large jobs this time can be huge. I'm not sure the benefit here of algorithm 2 because that is all happening distributed on each task?

v2 isn't safe in the presence of failures during task commit; at least here if the entire job fails then, provided job ids are unique, the output doesn't become visible. it is essentially a second attempt at the v1 rename algorithm with (hopefully) smaller output datasets.

probot-autolabeler bot added CORE SQL labels Jul 5, 2020

WinkerDu changed the title ~~[SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic parti…~~ [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode Jul 5, 2020

WinkerDu force-pushed the master-fix-dynamic-partition-multi-commit branch from a4d99d0 to 6921f22 Compare July 6, 2020 01:03

turboFei reviewed Jul 6, 2020

View reviewed changes

WinkerDu force-pushed the master-fix-dynamic-partition-multi-commit branch from 6921f22 to 9b4f8f3 Compare July 6, 2020 02:48

turboFei mentioned this pull request Jul 9, 2020

[SPARK-27194][SPARK-29302][SQL] Define a spark staging committer to resolve FileAlreadyExistingException #28989

Closed

xuanyuanking reviewed Jul 24, 2020

View reviewed changes

probot-autolabeler bot added the BUILD label Jul 27, 2020

WinkerDu force-pushed the master-fix-dynamic-partition-multi-commit branch from 45f8ea5 to 85aa12a Compare November 24, 2020 16:38

cloud-fan closed this in 7c59aee Nov 25, 2020

tgravescs reviewed Jun 11, 2021

View reviewed changes

abellina mentioned this pull request Dec 16, 2022

[BUG][ORC] GpuInsertIntoHadoopFsRelationCommand should use staging directory for dynamic partition overwrite NVIDIA/spark-rapids#7378

Closed

zhengchenyu mentioned this pull request Oct 14, 2025

[SPARK-37210][CORE][SQL] Allow forced use of staging directory #37346

Closed

zhengchenyu mentioned this pull request Oct 24, 2025

[SPARK-54003][SQL] Use the staging directory as the output path then move to final path. #52720

Open

[SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode #29000

[SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode #29000

Uh oh!

Conversation

WinkerDu commented Jul 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

WinkerDu commented Jul 5, 2020

Uh oh!

dongjoon-hyun commented Jul 5, 2020

Uh oh!

SparkQA commented Jul 5, 2020

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

turboFei Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 6, 2020

Uh oh!

LuciferYang commented Jul 6, 2020

Uh oh!

turboFei commented Jul 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WinkerDu commented Jul 7, 2020

Uh oh!

turboFei commented Jul 9, 2020

Uh oh!

WinkerDu commented Jul 9, 2020

Uh oh!

cloud-fan commented Jul 9, 2020

Uh oh!

SparkQA commented Jul 9, 2020

Uh oh!

WinkerDu commented Jul 13, 2020

Uh oh!

WinkerDu commented Jul 16, 2020

Uh oh!

turboFei commented Jul 16, 2020

Uh oh!

WinkerDu commented Jul 17, 2020

Uh oh!

SparkQA commented Jul 17, 2020

Uh oh!

WinkerDu commented Jul 17, 2020

Uh oh!

xuanyuanking left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 27, 2020

Uh oh!

SparkQA commented Jul 27, 2020

Uh oh!

SparkQA commented Jul 27, 2020

Uh oh!

SparkQA commented Nov 19, 2020

WinkerDu commented Jul 5, 2020 •

edited

Loading

turboFei Jul 6, 2020 •

edited

Loading

turboFei commented Jul 6, 2020 •

edited

Loading