-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed #25795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a77be4c to
3e8b69b
Compare
core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala
Show resolved
Hide resolved
6b3f40d to
82d2173
Compare
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala
Outdated
Show resolved
Hide resolved
|
ok to test |
|
Test build #110617 has finished for PR 25795 at commit
|
| } | ||
| } | ||
|
|
||
| test("SPARK-29037: Spark gives duplicate result when an application was killed") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this. This looks like a good IT. If possible,
- Can we have a concise UT at
coremodule because you are touchingHadoopMapReduceCommitProtocol? - Otherwise, can we move this to
sql/coremodule instead ofsql/hivebecause you are usingusing parquet?
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ur, @turboFei . Unfortunately, the UT seems to pass without your PR. Could you check that?
$ git diff master --stat
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
$ build/sbt "hive/testOnly *.HiveQuerySuite" -Phive
...
[info] - SPARK-29037: Spark gives duplicate result when an application was killed (1 second, 177 milliseconds)
[info] ScalaTest
[info] Run completed in 51 seconds, 894 milliseconds.
[info] Total number of tests run: 137
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 137, failed 0, canceled 0, ignored 2, pending 0
[info] All tests passed.
[info] Passed: Total 137, Failed 0, Errors 0, Passed 137, Ignored 2
[success] Total time: 233 s, completed Sep 15, 2019 2:21:42 PM
|
retest this please |
8c5dfe4 to
656668e
Compare
|
@dongjoon-hyun |
|
Test build #110637 has finished for PR 25795 at commit
|
|
The failed test |
|
Test build #110725 has finished for PR 25795 at commit
|
|
Hi, @cloud-fan and @gatorsmile . |
| val taskAttemptContext = new TaskAttemptContextImpl(jobContext.getConfiguration, taskAttemptId) | ||
| committer = setupCommitter(taskAttemptContext) | ||
| committer.setupJob(jobContext) | ||
| if (!dynamicPartitionOverwrite) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to add comments to explain it. It looks to me that the hadoop output committer doesn't support concurrent writing to the same directory by design, so there is nothing we can do at Spark side.
The fix here is to avoid using the hadoop output committer when dynamicPartitionOverwrite=true. I'm fine with this fix.
BTW, when writing partitioned table with dynamicPartitionOverwrite=false, can we support it as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also cc @advancedxy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and for non-partitioned table, can we clean up the staging dir when the job is killed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@advancedxy has discussed with me offline about writing partitioned table with dynamicPartitionOverwrite=false .
He proposed a suggestion that, we can add JobAttemptPath(_temporary/0) existence check when dynamicPartitionOverwrite=false.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a non-partitioned table, dynamicPartitionOverwrite is false, and the staging dir is under JobAttemptPath(_temporary/0), I think the staging dir will be cleaned up by FileOutputCommitter.abortJob().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the staging dir will be cleaned up by FileOutputCommitter.abortJob().
Why it can't be cleaned when dynamicPartitionOverwrite=true?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When a job is killed, its staging dir can be cleaned up by abortJob method.
But when an application is killed, its job's staging dir would not be cleaned up gracefully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the staging dir will be cleaned up by FileOutputCommitter.abortJob().
Why it can't be cleaned when
dynamicPartitionOverwrite=true?
For the case in PR description, It is happened when appA(static partition overwrite) is killed and its staging dir is not cleaned up gracefully, then appB commits parts result of appA.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, so we can't rely on the job cleanup. And ideally we should use different staging dir for each job.
That said, seems we can't fix the problem for non-partitioned table if we continue to use the hadoop output committer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, a solution proposed by advancedxy is adding job attempt path existence check for non-partitioned table and static partition overwrite operation.
And the implementation of InsertIntoHiveTable uses different staging dir for each job.
|
Test build #110760 has finished for PR 25795 at commit
|
|
thanks @steveloughran for the detailed explanation! Seems it's very complicated to make Spark support concurrent and object-store friendly files writing. We should ask users to try object-store-first Data structures such as Delta and Iceberg. I'd like to narrow down the scope. Spark should at least guarantee data correctness. Duplicated data should not happen. @turboFei @advancedxy is it possible to detect concurrent writes and fail the query? |
|
I think it is feasible to detect concurrent writes.
|
|
when |
|
BTW |
No, setupCommitter will not create the
Yes, UUID is not a concern, but we can check the directory name whether started with |
Yeah, this is what we discussed offline. To fail fast, Spark should check output existence to detect concurrent writes to the same partitions.
Yes, it's hard for Spark to support concurrent writes. But the scope of #25739 and this one is to support concurrent writes to different partitions, which can be a nice feature and not that hard to achieve. |
|
@advancedxy can you give a completed proposal for it? Basically the requirement:
|
…Overwrite is false firstly
|
Test build #110799 has finished for PR 25795 at commit
|
All right, I think the requirements can be split into two parts:
I believe the approach proposal should covers concurrent writes and case in this pr. WDYT @cloud-fan, @turboFei and @wangyum |
|
SGTM |
|
So #25739 is to support concurrent writes to different locations, and this PR is to detect concurrent writes to the same location and fail fast? |
|
With my knowledge. |
|
I have discussed with advancedxy offline, and I am clearly for the solution now. |
How does it work for |
Looks like there is no easy way to detect concurrent writes with User won't get duplicated result in this case, but the result could be messed(some part replaced by other part, while some part remains) when writing concurrently since no transaction is involved. |
|
@cloud-fan wrote
Mostly it's about S3, that being the hard one to work with. GCS is consistent; file rename O(1); dir rename atomic O(files). Azure Datalake Gen2 (ABFS connector) works exactly like a real filesystem. But they all have latency issues, directory listing overheads, and the other little costs that make switching to them tricky. S3 is really a fault injection framework for all your assumptions about what a store is.
+1 What those object-store-first Data structures do promise is not just correctness in the presence of that fault injection layer, they scale well in world where enumerating files under a directory tree is no longer a low-to-medium cost operation. w.r.t dynamic partition updates, for AWS S3 the best thing to do may be to say "don't"; at least not this way. But you still have to get it right everywhere else.
I concur. Now, one thing stepping through the FileOutputFormat code in a debugger while taking notes showed me is that the critical commit algorithms everyone relies on are barely documented and under-tested when it comes to all their failure paths. And once you realise that the v2 commit algorithm doesn't meet the expectations of MR or Spark, you worry about the correctness of any application which has used it. Now seems a good time, to not just get the dynamic partition overwrite code to work -but to specify that algorithm enough to be able to convince others of its correctness on a consistent file system. I did try to start with a TLA+ spec of what S3 did, but after Lamport told me I misunderstood the nature of "eventually" I gave up. I hear nice things about Isabelle/HOL these days -maybe someone sufficiently motivated could try to use it. BTW, if you haven't read it, look at the Stocator: A High Performance Object Store Connector for Spark by @gilv et al, who use task ID information in the file names to know what to roll back in the presence of failure. That is: rather than trying to get commit right, they get abort right. (returns to debugging a console log where, somehow, cached 404s in the S3 load balancer are still causing problems despite HADOOP-16490 fixing it). |
|
@cloud-fan We can unify the staging dir for both dynamic and static insert overwrite, base on the specifed static partition keyValue pairs. private def stagingDir = {
val stagingPath = ".spark-staging-" + jobId + "/" +
staticPartitionKVS.map(kv => "sp_" + kv._1 + "=" +
kv._2).mkString(File.separator)
new Path(path, stagingPath)
}For example below:
insert overwrite table ta partition(p1=1,p2,p3) select ...
// stagingDir: .spark-staging-${UUID}/sp_p1=1
insert overwrite table ta partition(p1=1,p2=2,p3) select ...
// stagingDir: .spark-staging-${UUID}/sp_p1=1/sp_p2=2
insert overwrite table ta select ...
// stagingDir: .spark-staging-${UUID}The stagingDir will be cleaned up after job finished. Before per insert, we should check the path whose name is started with For two paths A and B. If A is fully contained by B or B is fully contained by A, they can not concurrent write. For example:
About whether use a UUID in staging dir, It is determined by the implementation of #25739 . In addition, this PR is dedicated to resolve the duplicat result issue originally. I have created a new PR #25863 for duplicate result issue, hope it can make sense. Thanks. |
The problem here is that, how do you detect |
We will find a longest path with |
|
Is there any way to block on the commit process to ensure that exactly one can be committing at the same time, eg. some lease file with a timestamp inside whose (unexpired) presence is a sign someone else is committing? for all filesystems where (I'm ignoring inconsistent storage without CRUD consistency or atomic creates, obviously) |
|
Thanks @steveloughran for mentioning our paper on Stocator. (This is a better reference -- Stocator: Providing High Performance and Fault Tolerance for Apache Spark Over Object Storage.) It is better to characterize Stocator wrt commit rather than abort. Stocator only commits. The last commit wins and any parts that do not belong are ignored when the partition is read (and they can also be deleted at that time to reclaim space or space can be reclaimed by an independent garbage collection process). This can be achieved by writing a manifest at the time of commit that has a list of the parts that belong to the commit, e.g., by extending the _SUCCESS file/object. And then when the partition is read, the parts listed in the manifest of the most recent commit are read. Maybe a similar solution would be easier here, rather than trying to ensure only one commit at a time. |
|
@hkkolodner -thanks for the clarification. FWIW, in apache/hadoop#1442 I am actually stripping back on enumerating all files in the _SUCCESS manifest I've been creating (for test validation only) because at a sufficiently large terasort and TCP-DS scale, you end up with memory issues in job commits, which makes my colleagues trying to do these things unhappy. I may be overreacting (it's happening at an earlier stage), but I'm just being thorough in reviewing datastructures built in a commit. And even if there's enough heap, I don't want to force a many-MB upload at the end of every query, as that becomes a bottleneck of its own. Like I say, the post-directory-listing table layouts are the inevitable future. That directory tree has been great: tool neutral, easy to navigate by hand, etc, but it hits scale limits even in HDFS, serious perf limits in higher-latency stores and the commit-by-rename mechanism is running out of steam. I don't have enough experience of any of the alternatives to have any strong opinions, except to conclude that yes, they are inevitable. |
|
We're closing this PR because it hasn't been updated in a while. If you'd like to revive this PR, please reopen it! |
What changes were proposed in this pull request?
Case:
Application appA insert overwrite table table_a with static partition overwrite.
But it was killed when committing tasks, because one task is hang.
And parts of its committed tasks output is kept under
/path/table_a/_temporary/0/.Then we run application appB insert overwrite table table_a with dynamic partition overwrite.
It executes successfully.
But it also commit the data under
/path/table_a/_temporary/0/to destination dir.In this PR, we skip the FileOutputCommitter.{setupJob, commitJob, abortJob} operations when dynamicPartitionOverwrite.
Why are the changes needed?
Data may corrupt without this PR.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Existing Unit Test.