Skip to content

Conversation

@WinkerDu
Copy link
Contributor

@WinkerDu WinkerDu commented Jul 5, 2020

What changes were proposed in this pull request?

When using dynamic partition overwrite, each task has its working dir under staging dir like stagingDir/.spark-staging-{jobId}, each task commits to outputPath/.spark-staging-{jobId}/{partitionId}/part-{taskId}-{jobId}{ext}.
When speculation enable, multiple task attempts would be setup for one task, they have same task id and they would commit to same file concurrently. Due to host done or node preemption, the partly-committed files aren't cleaned up, a FileAlreadyExistsException would be raised in this situation, resulting in job failure.

I don't try to change task commit process for dynamic partition overwrite, like adding attempt id to task working dir for each attempts and committing to final output dir via a new outputCommitCoordinator, here is reason:

  1. FileOutputCommitter already has commit coordinator for each task attempts, we can leverage it rather than build a new one.
  2. To say the least, we implement a coordinator solving task attempts commit conflict, suppose a severe case, application master failover, tasks with same attempt id and same task id would commit to same files, the FileAlreadyExistsException risk still exists

In this pr, I leverage FileOutputCommitter to solve the problem:

  1. when initing a write job description, set outputPath/.spark-staging-{jobId} as the output dir
  2. each task attempt writes output to outputPath/.spark-staging-{jobId}/_temporary/${appAttemptId}/_temporary/${taskAttemptId}/{partitionId}/part-{taskId}-{jobId}{ext}
  3. leverage FileOutputCommitter coordinator, write job firstly commits output to outputPath/.spark-staging-{jobId}/{partitionId}
  4. for dynamic partition overwrite, write job finally move outputPath/.spark-staging-{jobId}/{partitionId} to outputPath/{partitionId}

Why are the changes needed?

Without this pr, dynamic partition overwrite would fail

Does this PR introduce any user-facing change?

No.

How was this patch tested?

added UT.

@WinkerDu WinkerDu changed the title [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic parti… [SPARK-27194][SPARK-29302][SQL] Fix commit collision in dynamic partition overwrite mode Jul 5, 2020
@WinkerDu
Copy link
Contributor Author

WinkerDu commented Jul 5, 2020

gentle ping @cloud-fan @dongjoon-hyun @xuanyuanking @turboFei @LuciferYang @advancedxy
plz have a review, thx

@dongjoon-hyun
Copy link
Member

ok to test

@SparkQA
Copy link

SparkQA commented Jul 5, 2020

Test build #124977 has finished for PR 29000 at commit a4d99d0.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WinkerDu WinkerDu force-pushed the master-fix-dynamic-partition-multi-commit branch from a4d99d0 to 6921f22 Compare July 6, 2020 01:03
@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #124990 has finished for PR 29000 at commit 6921f22.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

val stagingDir = new File(d, ".spark-staging-jobId")
stagingDir.mkdirs()
val conflictTaskFile = new File(stagingDir, "part-00000-jobId-c000.snappy.parquet")
conflictTaskFile.createNewFile()
Copy link
Member

@turboFei turboFei Jul 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, the ut is wrong, I have fixed it.
val stagingDir = new File(d, ".spark-staging-jobId")
stagingDir.mkdirs()
val stagingPartDir = new File(stagingDir, "p1=2")
stagingPartDir.mkdirs()
val conflictTaskFile = new File(stagingPartDir, "part-00000-jobId.c000.snappy.parquet")
conflictTaskFile.createNewFile()

I also recreated another PR, #28989,
In this Pr, I define a Spark staging output committer to leverage OutputCommitCoordinator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for reminding, this UT is based on your pr #26339 , I'll correct it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, @turboFei you would prefer #28989 ? If so, maybe a bit clarification in this pr thread should be added.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@advancedxy thanks

@WinkerDu WinkerDu force-pushed the master-fix-dynamic-partition-multi-commit branch from 6921f22 to 9b4f8f3 Compare July 6, 2020 02:48
@SparkQA
Copy link

SparkQA commented Jul 6, 2020

Test build #125006 has finished for PR 29000 at commit 9b4f8f3.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@LuciferYang
Copy link
Contributor

···
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-install-plugin:3.0.0-M1:install (default-cli) on project spark-parent_2.12: ArtifactInstallerException: Failed to install metadata org.apache.spark:spark-parent_2.12/maven-metadata.xml: Could not parse metadata /home/jenkins/.m2/repository/org/apache/spark/spark-parent_2.12/maven-metadata-local.xml: in epilog non whitespace content is not allowed but got t (position: END_TAG seen ...\nt... @26:2) -> [Help 1]
···
Is there any problem of CI environment @dongjoon-hyun

@turboFei
Copy link
Member

turboFei commented Jul 6, 2020

Just left some comments.

This PR did resolve the issue, it also involve some costs.
In this pr, for dynamic partition overwrite mode.
Each task might create multi partition paths under a unique task attempt output path.
In fact, Dynamic partition overwrite always cause too many small files if user does not repartition by dynamic partition columns.
So, I am afraid that this pr might cause lots of directories during runtime.

I prefer #28989, in this PR, I define a Spark staging output committer based on the current implementation of HadoopMapReduceCommitProtocol.

@WinkerDu
Copy link
Contributor Author

WinkerDu commented Jul 7, 2020

@turboFei Thanks for your comments. Actually I think there is no partition-explosion cost. Commit task output dir is generated by FileFormatDataWriter.newOutputWriter, no matter the task is static-partitioned or dynamic-partitioned, one commit task only deal with one partition dir.

@turboFei
Copy link
Member

turboFei commented Jul 9, 2020

Thanks for your reply @WinkerDu

I am wrong about that, In #28989 I thought the taskAttemptContext.getTaskAttemptId.getId is same with the taskAttemptId of spark and it would at most create several(the largest task attempt number) staging partition dir.
But the taskAttemptContext.getTaskAttemptId.getId is also a uniq id, so #28989 would also create multi staging partition dir for each task. I will close that PR.

@WinkerDu
Copy link
Contributor Author

WinkerDu commented Jul 9, 2020

Thanks for your reply @turboFei
Gentle ping @dongjoon-hyun @cloud-fan @xuanyuanking to review this pr, thank you

@cloud-fan
Copy link
Contributor

ok to test

@SparkQA
Copy link

SparkQA commented Jul 9, 2020

Test build #125501 has finished for PR 29000 at commit 9b4f8f3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WinkerDu
Copy link
Contributor Author

retest this please @cloud-fan @dongjoon-hyun

@WinkerDu
Copy link
Contributor Author

gentle ping @cloud-fan @dongjoon-hyun @SparkQA to retest this pr

@turboFei
Copy link
Member

gentle ping @cloud-fan @dongjoon-hyun @SparkQA to retest this pr

You can push a new commit(like two commits, the second one revert the first one) to trigger jenkins job.

@WinkerDu
Copy link
Contributor Author

You can push a new commit(like two commits, the second one revert the first one) to trigger jenkins job.

thanks for your advice, will try it :)

@SparkQA
Copy link

SparkQA commented Jul 17, 2020

Test build #126035 has finished for PR 29000 at commit 4766830.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WinkerDu
Copy link
Contributor Author

gentle ping @dongjoon-hyun @cloud-fan @xuanyuanking to review this pr :)

Copy link
Member

@xuanyuanking xuanyuanking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause of this issue is that the speculative task and normal task sharing the final output dir in dynamic overwrite mode. Please emphasize this in the PR description.

this.stagingDir
// For FileOutputCommitter it has its own staging path called "work path".
case f: FileOutputCommitter =>
handleDynamicPartitionOverwrite(dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

new Path(Option(f.getWorkPath).map(_.toString).getOrElse(path))
case _ => new Path(path)
case _ =>
handleDynamicPartitionOverwrite(dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both case branches need to call handleDynamicPartitionOverwrite, then we can call it outside case match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this pr leverages FileOutputCommitter to deal with commit collision, only handle dynamic partition overwrite in the first branch.
BTW, it seems all committers practically used in Spark are derived from 'FIleOutputCommiter'

}

// For SPARK-27194 unit test, we try to set constant jobId carried by options
val jobId = options.getOrElse("test.jobId", java.util.UUID.randomUUID().toString)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we try to reproduce the file collision without adding this extra option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I'll try to pre-create a partition file with a customized file commit protocol to keep this code clean

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126618 has finished for PR 29000 at commit 2e3d03c.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126621 has finished for PR 29000 at commit 84b7093.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 27, 2020

Test build #126629 has finished for PR 29000 at commit 5865f51.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35954/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Test build #131350 has finished for PR 29000 at commit 85aa12a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35967/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35967/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Test build #131363 has finished for PR 29000 at commit 6efba79.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35972/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/35972/

@SparkQA
Copy link

SparkQA commented Nov 19, 2020

Test build #131368 has finished for PR 29000 at commit b50ca37.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

retest this please

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36155/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36155/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Test build #131552 has finished for PR 29000 at commit b50ca37.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36162/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36162/

@SparkQA
Copy link

SparkQA commented Nov 23, 2020

Test build #131560 has finished for PR 29000 at commit 0d57d92.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131607 has finished for PR 29000 at commit 0d57d92.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131634 has finished for PR 29000 at commit 45f8ea5.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@WinkerDu WinkerDu force-pushed the master-fix-dynamic-partition-multi-commit branch from 45f8ea5 to 85aa12a Compare November 24, 2020 16:38
@SparkQA
Copy link

SparkQA commented Nov 24, 2020

Test build #131682 has finished for PR 29000 at commit 85aa12a.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

GA passed, merging to master!

@cloud-fan cloud-fan closed this in 7c59aee Nov 25, 2020
@cloud-fan
Copy link
Contributor

@WinkerDu do you have a JIRA account?

@WinkerDu
Copy link
Contributor Author

@cloud-fan yes, I have a JIRA account named 'duripeng'

@WinkerDu
Copy link
Contributor Author

thank all for patch review!

@xuanyuanking
Copy link
Member

Congrats on your first contribution, Ripeng! :)

@LuciferYang
Copy link
Contributor

Congrats on your first contribution, Ripeng! :) +1

* {appAttemptId}/{taskId}/a=1/b=1,
* then move them to
* /path/to/outputPath/.spark-staging-{jobId}/a=1/b=1.
* 2. When [[FileOutputCommitter]] algorithm version set to 2,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this isn't the normal behavior of the algorithm version 2, right? Normally it writes the task files directly to the final output location. The whole point of algorithm 2 is to prevent all of the extra moves on the driver at the end of the job. For large jobs this time can be huge. I'm not sure the benefit here of algorithm 2 because that is all happening distributed on each task?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

v2 isn't safe in the presence of failures during task commit; at least here if the entire job fails then, provided job ids are unique, the output doesn't become visible. it is essentially a second attempt at the v1 rename algorithm with (hopefully) smaller output datasets.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.