[SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table. #25863

turboFei · 2019-09-20T03:05:42Z

What changes were proposed in this pull request?

For InsertIntoHadoopFsRelation operations.

Case A:
Application appA insert overwrite table table_a with static partition overwrite.
But it was killed when committing tasks, because one task is hang.
And parts of its committed tasks output is kept under /path/table_a/_temporary/0/.

Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
It executes successfully.
But it also commit the data reminded by killed application to destination dir.

Case B:

Application appA insert overwrite table table_a.

Application appB insert overwrite table table_a, too.

They execute concurrently, and they may all use /path/table_a/_temporary/0/ as workPath.

And their result may be corruptted.

In this PR, we set a staging output path for InsertIntoHadoopFsRelation operation.

The output path is a multi level path and is composed of specified partition key-value pairs formatted .spark-staging-${depth}/p1=v1/p2=v2/.../pn=vn.

We can detect the conflict by checking the existence of relative staging output path.

For example:

table_a(c1 string, a string, b string,...) partitioned by (a, b , ...)

When writing to partition a=1, we need to check the existence of

.spark-staging-0
.spark-staging-1/a=1
.spark-staging-2/a=1
...
.spark-staging-numPartitions/a=1

When we found relative staging output path is existed, an exception would be thrown.

Thanks @advancedxy for providing the prototype of solution and valuable suggestions.

Why are the changes needed?

Data may be corrupted without this PR.

Does this PR introduce any user-facing change?

User can see the exists of staging output path and may need clean up these path manually, when this path is belong to a killed application and has not been cleaned up gracefully.

How was this patch tested?

Added unit test.

adrian-wang · 2019-09-20T06:22:51Z

How about just clean up the staging dir before write job starts?

wangyum · 2019-09-20T06:28:32Z

ok to test

SparkQA · 2019-09-20T06:34:02Z

Test build #111047 has finished for PR 25863 at commit fa60fc9.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

manuzhang · 2019-09-20T06:49:50Z

@turboFei it seems your description does not reflect changes in this PR.

turboFei · 2019-09-20T08:55:54Z

How about just clean up the staging dir before write job starts?

The staging dir may be used by another job, so we can not clean up it directly.

SparkQA · 2019-09-20T11:56:09Z

Test build #111061 has finished for PR 25863 at commit 0b1323d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-20T16:35:48Z

Test build #111073 has finished for PR 25863 at commit 9fb68e7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-20T16:40:09Z

Test build #111075 has finished for PR 25863 at commit eedad23.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-21T07:05:02Z

Test build #111109 has finished for PR 25863 at commit a4f33d9.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-09-21T10:06:49Z

In this PR, for the partition overwrite operation, I set a unique output for that.

I have test the UT failed before locally, they all passed.

turboFei · 2019-09-21T10:07:59Z

retest this please.

SparkQA · 2019-09-21T12:23:40Z

Test build #111116 has finished for PR 25863 at commit 98aa212.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

turboFei · 2019-09-21T13:03:36Z

@cloud-fan

In this PR, I set a unique output dir for partition overwrite operation, both dynamic and static partition overwrite.

cc @advancedxy @wangyum

turboFei · 2019-09-21T14:57:57Z

With this PR, the InsertOverwrite operation for partitioned table will not reuse _temporary/0.

InsertInto operation for partitioned table and InsertOverwrite/InsertInto for non-partitioned table will still use _temporary/0 as job attempt path, we can fix it in future, which need mergePaths(refer FileOutputCommitter.mergePaths()).

turboFei · 2019-09-21T19:15:06Z

In the new commit.
Not only for partition overwrite operation, but for all InsertIntoHadoopFsRelation operation, I set a unique output path for it.

SparkQA · 2019-09-21T21:35:01Z

Test build #111125 has finished for PR 25863 at commit 6ee11b8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-09-28T06:36:14Z

Test build #111516 has finished for PR 25863 at commit 8098a8f.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
case class FileSourceWriteDesc(

advancedxy · 2019-09-29T02:47:59Z

core/src/main/scala/org/apache/spark/internal/io/FileSourceWriteDesc.scala

+ * @param isInsertIntoHadoopFsRelation whether is a InsertIntoHadoopFsRelation operation
+ * @param escapedStaticPartitionKVs static partition key and value pairs, which have been escaped
+ */
+case class FileSourceWriteDesc(


Since this might be used/subclassed by user, we may just use a normal class.

advancedxy · 2019-09-29T02:51:42Z

core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

+          " dynamicPartitionOverwrite)")
+        instantiate(className, jobId, outputPath, dynamicPartitionOverwrite)
+    }
+  }


Kind of, but I think we can also add dynamicPartitionOverwrite to fileSourceWriteDesc.

For user defined FileCommitProtocol class, We can extract fileSourceWriteDesc and pass it to the old instantiate method.

advancedxy · 2019-09-29T03:06:12Z

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

+      context.getConfiguration.set(FileOutputFormat.OUTDIR, stagingOutputPath.toString)
+      logWarning("Set file output committer algorithm version to 2 implicitly," +
+        " for that the task output would be committed to staging output path firstly," +
+        " which is equivalent to algorithm 1.")


15843 is a lot, however, it would be not that much inside one spark application.
One way to solve this, is to use an object level counter to only log the first warning log(or logs).
But I am not sure if that's worth it. Also, the head of logs may get rotated and discarded...

Or use logDebug is fine, but normally user won't set log level to DEBUG.

I am not sure which one is better. It's up to you then.

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

advancedxy · 2019-09-29T03:29:15Z

sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala

+        checkAnswer(spark.sql("select a, b from t where b = 1"), df1)
+        checkAnswer(spark.sql("select a, b from t where b = 2"), df2)
+      }
+    }


Ah, that's a limitation of DataFrameWriter, we may need to extend DataFrameWriter to support that.

But currently, i think we can simply use the SQL syntax since we can use spark.sql and get the same behaviour.

advancedxy

Sorry for the late reply.

Also, the comments should be updated.

spark/core/src/main/scala/org/apache/spark/internal/io/FileCommitProtocol.scala

Lines 34 to 37 in 8098a8f

    
            * 2. Implementations should have a constructor with 2 or 3 arguments: 
        
            *      (jobId: String, path: String) or 
        
            *      (jobId: String, path: String, dynamicPartitionOverwrite: Boolean) 
        
            * 3. A committer should not be reused across multiple Spark jobs.

turboFei · 2019-09-29T15:01:17Z

refactor the code.

turboFei · 2019-09-29T15:02:57Z

retest this please.

SparkQA · 2019-09-29T17:34:35Z

Test build #111569 has finished for PR 25863 at commit 81aff54.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class FileSourceWriteDesc(

steveloughran · 2019-09-30T11:21:21Z

This scares me. The FileOutputCommitter code scares me already. It's why when I added a new committer extension point I chose not to go near that code and instead add a new superclass. It's scary code because

it's co-recursive in a way that hard to follow.
it's used in the MRv1 and V2 APIs
it currently implements two different commit algorithms in those co-recursive methods
there is no documentation other than that code.
one of those commit algorithms, doesn't have the failure semantics spark expects (network partition during task commit)
it's been widely subclassed in ways people aren't aware of.
there's lots of references to static methods which you can't subclass when writing your own committer.

mergePaths() was one of the most obtuse bits of code I've ever had to step through with a debugger and a pen and paper. And I've stepped through how Windows NT boots.

You are proposing using reflection to get at private superclass methods that are not part of the public API, where we aren't even going to be aware of the fact that they are being used this way and have no guarantees of stability whatsoever.

And of course, for anything which implement the new superclass, PathOutputCommitProtocol it's not going to work.

Overall then doing things during commit operations for which it is going to be very hard to prove correctness -and vulnerable to people doing things in the MR codebase we could fundamentally break things, either a link time, or worse, in the semantics of the operations you are trying to perform.

I don't see an easy solution here, but as someone with write access to the MR codebase, I'm happy to discuss how we could make things easier.

For example, someone could actually pull out the mergePath code into its own class, with its own tests which could then be reused. We can also look about whether we could extend PathOutputCommitter to offer more here. However mergePath contains some big assumptions about the nature of the destination and what the commit algorithms do.

In ideal world, how we commit should be opaque to the caller. You commit a task and its work is promoted into the set of committed tasks work, or you don't get a response from the task's process, in which case you can safely commit another task, confident that irrespective of what has gone wrong with the task which failed/partitioned/hung only one tasks output will ever become visible -and after the job has been committed or aborted the output of a partitioned task will never appear. V2 doesn't offer those guarantees, which is why I consider it broken.

One thing I have thought about adding to PathOutputCommitter is a predicate isTaskCommitRepeatable() which would at least tell the spark driver, MR AM etc whether it is safe to attempt to retry a task if it failed what timed out during that commit. the V2 protocol would say false, as should the new EMR spark committer. Not sure if that would help here.

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala

steveloughran · 2019-09-30T14:04:56Z

I've looked at the code a bit more. As I noted earlier, this scares me.

in the FileOutputCommitter, the FS itself is synchronization point, with assumptions about atomicity and performance implicitly implemented in the code. The application which use the committer have their own assumptions about atomicity and performance which are derived transitively from those of the file system and then extended by assumptions about the correctness of the algorithms.

Things are bad enough as they are.

I am not convinced that relying on the internals of FileOutputCommitter Versions is the safe way to do this.

I think you better off specifying a commit protocol which is explicitly designed for writing files into the destination, and then implementing it. For S3A, knowing the destination path lets is initiate but not complete the upload to the final path; we would then propagate the information needed to manifest that file to the job committer. The "Magic" committer does exactly this by recognising that when someone writes to dest/__magic/$job_attemptID/$task-attempt-id/__basepath/year=2019/month=10/day=1/part-0001.csv is to be have a final destination of dest/year=2019/month=10/day=1/part-0001.csv and the output stream, rather than create that file o close(), is only to save the metadata for the task commit to find.

That is very much black magic in the FS connector. Having a commit protol where you could ask the committer for an output path to use when writing to a specific destination we let you eliminate that trick completely and possibly help with this problem.

The other thing to consider is that spark permits committed tasks to pass serialized data back to the driver, as an alternative to relying on the file system. Scale issues notwithstanding, task committers should be able to provide the information needed to include the task in the jobs final output. For example, rather have task commit renaming files, it could just return a path too where it has stored its list of files to commit. Job commit becomes a matter of loading those files and moving the output into the final destination. Again, this is essentially what we do with the S3A committer -we just save, propagate and reload those files within the committer.

This all gets complicated fast -even without worrying about concurrent jobs. But trying to use FileOutputCommitter internals to do what you want to do here strikes me is very very dangerous. I have issues with one of the commit algorithms implemented in it already; trying to use it for other purposes is only going to make things worse.

turboFei · 2019-09-30T15:19:21Z

Thanks @steveloughran
I will try my best to understand the content of you comments, thanks a lot.

SparkQA · 2019-10-15T19:19:15Z

Test build #112119 has finished for PR 25863 at commit fac7b7b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

…duplicate result and support concurrent file source write operations write to different partitions in the same table.

add concurrent ut extract doInsertion and set logDebug and modify comment

SparkQA · 2019-12-05T06:00:44Z

Test build #114890 has finished for PR 25863 at commit fa66a5b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-05T08:05:02Z

Test build #114891 has finished for PR 25863 at commit 79d59bd.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-12-05T14:45:23Z

Test build #114901 has finished for PR 25863 at commit 0433977.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

SparkQA · 2019-12-09T19:05:08Z

Test build #115045 has finished for PR 25863 at commit f45ca9b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

turboFei mentioned this pull request Sep 20, 2019

[WIP][SPARK-29037][Core] Spark gives duplicate result when an application was killed #25795

Closed

turboFei changed the title ~~[SPARK-29037] For static partition overwrite, spark may give duplicate result.~~ [WIP][SPARK-29037] For static partition overwrite, spark may give duplicate result. Sep 20, 2019

turboFei force-pushed the SPARK-29037-static branch from e5deb55 to fa60fc9 Compare September 20, 2019 03:40

turboFei changed the title ~~[WIP][SPARK-29037] For static partition overwrite, spark may give duplicate result.~~ [SPARK-29037] For static partition overwrite, spark may give duplicate result. Sep 20, 2019

dongjoon-hyun changed the title ~~[SPARK-29037] For static partition overwrite, spark may give duplicate result.~~ [SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 20, 2019

dongjoon-hyun added SPARK CORE SQL labels Sep 20, 2019

turboFei changed the title ~~[SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result.~~ [WIP][SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 20, 2019

turboFei force-pushed the SPARK-29037-static branch 3 times, most recently from c3791f7 to eedad23 Compare September 20, 2019 14:47

turboFei force-pushed the SPARK-29037-static branch from a4f33d9 to 98aa212 Compare September 21, 2019 10:04

turboFei changed the title ~~[WIP][SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result.~~ [SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 21, 2019

turboFei changed the title ~~[SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result.~~ [WIP][SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 21, 2019

turboFei force-pushed the SPARK-29037-static branch from f87aeb3 to 8098a8f Compare September 28, 2019 03:59

advancedxy reviewed Sep 29, 2019

View reviewed changes

advancedxy suggested changes Sep 29, 2019

View reviewed changes

steveloughran reviewed Sep 30, 2019

View reviewed changes

core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala Outdated Show resolved Hide resolved

steveloughran reviewed Sep 30, 2019

View reviewed changes

...ain/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala Show resolved Hide resolved

turboFei added 7 commits December 5, 2019 11:57

[SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives …

fd8a8b0

…duplicate result and support concurrent file source write operations write to different partitions in the same table.

escape dynamicPartitionOverwrite in FileSourceWriteDesc

f2f330f

add concurrent ut extract doInsertion and set logDebug and modify comment

fix invoke getFileStatus multi times

e75441e

reduce invoke filestatus

cb150c3

diy merge paths

cbe10f2

report all confilicted info

b3ab0bb

rebase master

fa66a5b

turboFei force-pushed the SPARK-29037-static branch from fac7b7b to fa66a5b Compare December 5, 2019 04:03

fix code

79d59bd

fix style

a348a40

fix code

0433977

fix code

f45ca9b

turboFei closed this Jan 6, 2020

	* 2. Implementations should have a constructor with 2 or 3 arguments:
	* (jobId: String, path: String) or
	* (jobId: String, path: String, dynamicPartitionOverwrite: Boolean)
	* 3. A committer should not be reused across multiple Spark jobs.

[SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table. #25863

[SPARK-28945][SPARK-29037][CORE][SQL] Fix the issue that spark gives duplicate result and support concurrent file source write operations write to different partitions in the same table. #25863

Uh oh!

Conversation

turboFei commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

adrian-wang commented Sep 20, 2019

Uh oh!

wangyum commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

manuzhang commented Sep 20, 2019

Uh oh!

turboFei commented Sep 20, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 20, 2019

Uh oh!

SparkQA commented Sep 21, 2019

Uh oh!

turboFei commented Sep 21, 2019

Uh oh!

turboFei commented Sep 21, 2019

Uh oh!

SparkQA commented Sep 21, 2019

Uh oh!

turboFei commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboFei commented Sep 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

turboFei commented Sep 21, 2019

Uh oh!

SparkQA commented Sep 21, 2019

Uh oh!

SparkQA commented Sep 28, 2019

Uh oh!

advancedxy Sep 29, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy Sep 29, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy Sep 29, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

advancedxy Sep 29, 2019

Choose a reason for hiding this comment

Uh oh!

advancedxy left a comment

Choose a reason for hiding this comment

Uh oh!

turboFei commented Sep 29, 2019

Uh oh!

turboFei commented Sep 29, 2019

Uh oh!

SparkQA commented Sep 29, 2019

Uh oh!

steveloughran commented Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steveloughran commented Sep 30, 2019

Uh oh!

turboFei commented Sep 30, 2019

Uh oh!

SparkQA commented Oct 15, 2019

turboFei commented Sep 20, 2019 •

edited

Loading

turboFei commented Sep 20, 2019 •

edited

Loading

turboFei commented Sep 21, 2019 •

edited

Loading

turboFei commented Sep 21, 2019 •

edited

Loading

steveloughran commented Sep 30, 2019 •

edited

Loading