Skip to content

Conversation

@sharkdtu
Copy link
Contributor

writeShard in saveAsNewAPIHadoopDataset always committed its tasks without question. The problem is that when speculation is enabled sometimes this can result in multiple tasks committing their output to the same path, which may lead to task temporary paths exist in output path after saveAsNewAPIHadoopFile completes.

-rw-r--r--    3   user group       0   2017-02-11 19:36 hdfs://.../output/_SUCCESS
drwxr-xr-x    -   user group       0   2017-02-11 19:36 hdfs://.../output/attempt_201702111936_32487_r_000044_0
-rw-r--r--    3   user group    8952   2017-02-11 19:36 hdfs://.../output/part-r-00000
-rw-r--r--    3   user group    7878   2017-02-11 19:36 hdfs://.../output/part-r-00001

Assume there are two attempt tasks that commit at the same time, The two attempt tasks maybe rename their task attempt paths to task committed path at the same time. When one task's rename operation completes, the other task's rename operation will let its task attempt path under the task committed path.

Anyway, it is not recommended that writeShard in saveAsNewAPIHadoopDataset always committed its tasks without question. Similar question in SPARK-4879 triggered by calling saveAsHadoopFile has been solved. Newest master has solved it too. This PR just fix 2.1

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@lw-lin
Copy link
Contributor

lw-lin commented Feb 14, 2017

To me this PR aims to also use driver to coordinate Hadoop output committing for saveAsNewAPIHadoopFile -- actually the same was added for saveAsHadoopFile back in #4066.

Seems like issues has been reported with the current saveAsNewAPIHadoopFile -- like in #4066 by @matrixlibing. But this issue only exists prior to 2.2.0.

So @JoshRosen would you share your thoughts on this? Thanks!

@HyukjinKwon
Copy link
Member

Hi @sharkdtu, is there any opinion on ^? If it is inactive, I would rather like to propose to close this.

@sharkdtu sharkdtu closed this May 13, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants