Skip to content

Conversation

@turboFei
Copy link
Member

@turboFei turboFei commented Sep 20, 2019

What changes were proposed in this pull request?

For InsertIntoHadoopFsRelation operations.

Case A:
Application appA insert overwrite table table_a with static partition overwrite.
But it was killed when committing tasks, because one task is hang.
And parts of its committed tasks output is kept under /path/table_a/_temporary/0/.

Then we rerun appA. It will reuse the staging dir /path/table_a/_temporary/0/.
It executes successfully.
But it also commit the data reminded by killed application to destination dir.

Case B:

Application appA insert overwrite table table_a.

Application appB insert overwrite table table_a, too.

They execute concurrently, and they may all use /path/table_a/_temporary/0/ as workPath.

And their result may be corruptted.

In this PR, we set a staging output path for InsertIntoHadoopFsRelation operation.

The output path is a multi level path and is composed of specified partition key-value pairs formatted .spark-staging-${depth}/p1=v1/p2=v2/.../pn=vn.

We can detect the conflict by checking the existence of relative staging output path.

For example:

table_a(c1 string, a string, b string,...) partitioned by (a, b , ...)

When writing to partition a=1, we need to check the existence of

  1. .spark-staging-0
  2. .spark-staging-1/a=1
  3. .spark-staging-2/a=1
    ...
  4. .spark-staging-numPartitions/a=1

When we found relative staging output path is existed, an exception would be thrown.

Thanks @advancedxy for providing the prototype of solution and valuable suggestions.

Why are the changes needed?

Data may be corrupted without this PR.

Does this PR introduce any user-facing change?

User can see the exists of staging output path and may need clean up these path manually, when this path is belong to a killed application and has not been cleaned up gracefully.

How was this patch tested?

Added unit test.

@turboFei turboFei changed the title [SPARK-29037] For static partition overwrite, spark may give duplicate result. [WIP][SPARK-29037] For static partition overwrite, spark may give duplicate result. Sep 20, 2019
@turboFei turboFei changed the title [WIP][SPARK-29037] For static partition overwrite, spark may give duplicate result. [SPARK-29037] For static partition overwrite, spark may give duplicate result. Sep 20, 2019
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-29037] For static partition overwrite, spark may give duplicate result. [SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 20, 2019
@adrian-wang
Copy link
Contributor

How about just clean up the staging dir before write job starts?

@wangyum
Copy link
Member

wangyum commented Sep 20, 2019

ok to test

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111047 has finished for PR 25863 at commit fa60fc9.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@manuzhang
Copy link
Member

@turboFei it seems your description does not reflect changes in this PR.

@turboFei turboFei changed the title [SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. [WIP][SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 20, 2019
@turboFei
Copy link
Member Author

turboFei commented Sep 20, 2019

How about just clean up the staging dir before write job starts?

The staging dir may be used by another job, so we can not clean up it directly.

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111061 has finished for PR 25863 at commit 0b1323d.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei turboFei force-pushed the SPARK-29037-static branch 3 times, most recently from c3791f7 to eedad23 Compare September 20, 2019 14:47
@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111073 has finished for PR 25863 at commit 9fb68e7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 20, 2019

Test build #111075 has finished for PR 25863 at commit eedad23.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 21, 2019

Test build #111109 has finished for PR 25863 at commit a4f33d9.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei turboFei changed the title [WIP][SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. [SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 21, 2019
@turboFei
Copy link
Member Author

In this PR, for the partition overwrite operation, I set a unique output for that.

I have test the UT failed before locally, they all passed.

@turboFei
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 21, 2019

Test build #111116 has finished for PR 25863 at commit 98aa212.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei
Copy link
Member Author

turboFei commented Sep 21, 2019

@cloud-fan

In this PR, I set a unique output dir for partition overwrite operation, both dynamic and static partition overwrite.

cc @advancedxy @wangyum

@turboFei
Copy link
Member Author

turboFei commented Sep 21, 2019

With this PR, the InsertOverwrite operation for partitioned table will not reuse _temporary/0.

InsertInto operation for partitioned table and InsertOverwrite/InsertInto for non-partitioned table will still use _temporary/0 as job attempt path, we can fix it in future, which need mergePaths(refer FileOutputCommitter.mergePaths()).

@turboFei
Copy link
Member Author

In the new commit.
Not only for partition overwrite operation, but for all InsertIntoHadoopFsRelation operation, I set a unique output path for it.

@SparkQA
Copy link

SparkQA commented Sep 21, 2019

Test build #111125 has finished for PR 25863 at commit 6ee11b8.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@turboFei turboFei changed the title [SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. [WIP][SPARK-29037][CORE][SQL] For static partition overwrite, spark may give duplicate result. Sep 21, 2019
@SparkQA
Copy link

SparkQA commented Sep 28, 2019

Test build #111516 has finished for PR 25863 at commit 8098a8f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • case class FileSourceWriteDesc(

* @param isInsertIntoHadoopFsRelation whether is a InsertIntoHadoopFsRelation operation
* @param escapedStaticPartitionKVs static partition key and value pairs, which have been escaped
*/
case class FileSourceWriteDesc(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this might be used/subclassed by user, we may just use a normal class.

" dynamicPartitionOverwrite)")
instantiate(className, jobId, outputPath, dynamicPartitionOverwrite)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kind of, but I think we can also add dynamicPartitionOverwrite to fileSourceWriteDesc.

For user defined FileCommitProtocol class, We can extract fileSourceWriteDesc and pass it to the old instantiate method.

context.getConfiguration.set(FileOutputFormat.OUTDIR, stagingOutputPath.toString)
logWarning("Set file output committer algorithm version to 2 implicitly," +
" for that the task output would be committed to staging output path firstly," +
" which is equivalent to algorithm 1.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

15843 is a lot, however, it would be not that much inside one spark application.
One way to solve this, is to use an object level counter to only log the first warning log(or logs).
But I am not sure if that's worth it. Also, the head of logs may get rotated and discarded...

Or use logDebug is fine, but normally user won't set log level to DEBUG.

I am not sure which one is better. It's up to you then.

checkAnswer(spark.sql("select a, b from t where b = 1"), df1)
checkAnswer(spark.sql("select a, b from t where b = 2"), df2)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a limitation of DataFrameWriter, we may need to extend DataFrameWriter to support that.

But currently, i think we can simply use the SQL syntax since we can use spark.sql and get the same behaviour.

Copy link
Contributor

@advancedxy advancedxy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply.

Also, the comments should be updated.

* 2. Implementations should have a constructor with 2 or 3 arguments:
* (jobId: String, path: String) or
* (jobId: String, path: String, dynamicPartitionOverwrite: Boolean)
* 3. A committer should not be reused across multiple Spark jobs.

@turboFei
Copy link
Member Author

refactor the code.

@turboFei
Copy link
Member Author

retest this please.

@SparkQA
Copy link

SparkQA commented Sep 29, 2019

Test build #111569 has finished for PR 25863 at commit 81aff54.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class FileSourceWriteDesc(

@steveloughran
Copy link
Contributor

steveloughran commented Sep 30, 2019

This scares me. The FileOutputCommitter code scares me already. It's why when I added a new committer extension point I chose not to go near that code and instead add a new superclass. It's scary code because

  • it's co-recursive in a way that hard to follow.
  • it's used in the MRv1 and V2 APIs
  • it currently implements two different commit algorithms in those co-recursive methods
  • there is no documentation other than that code.
  • one of those commit algorithms, doesn't have the failure semantics spark expects (network partition during task commit)
  • it's been widely subclassed in ways people aren't aware of.
  • there's lots of references to static methods which you can't subclass when writing your own committer.

mergePaths() was one of the most obtuse bits of code I've ever had to step through with a debugger and a pen and paper. And I've stepped through how Windows NT boots.

You are proposing using reflection to get at private superclass methods that are not part of the public API, where we aren't even going to be aware of the fact that they are being used this way and have no guarantees of stability whatsoever.

And of course, for anything which implement the new superclass, PathOutputCommitProtocol it's not going to work.

Overall then doing things during commit operations for which it is going to be very hard to prove correctness -and vulnerable to people doing things in the MR codebase we could fundamentally break things, either a link time, or worse, in the semantics of the operations you are trying to perform.

I don't see an easy solution here, but as someone with write access to the MR codebase, I'm happy to discuss how we could make things easier.

For example, someone could actually pull out the mergePath code into its own class, with its own tests which could then be reused. We can also look about whether we could extend PathOutputCommitter to offer more here. However mergePath contains some big assumptions about the nature of the destination and what the commit algorithms do.

In ideal world, how we commit should be opaque to the caller. You commit a task and its work is promoted into the set of committed tasks work, or you don't get a response from the task's process, in which case you can safely commit another task, confident that irrespective of what has gone wrong with the task which failed/partitioned/hung only one tasks output will ever become visible -and after the job has been committed or aborted the output of a partitioned task will never appear. V2 doesn't offer those guarantees, which is why I consider it broken.

One thing I have thought about adding to PathOutputCommitter is a predicate isTaskCommitRepeatable() which would at least tell the spark driver, MR AM etc whether it is safe to attempt to retry a task if it failed what timed out during that commit. the V2 protocol would say false, as should the new EMR spark committer. Not sure if that would help here.

@steveloughran
Copy link
Contributor

I've looked at the code a bit more. As I noted earlier, this scares me.

in the FileOutputCommitter, the FS itself is synchronization point, with assumptions about atomicity and performance implicitly implemented in the code. The application which use the committer have their own assumptions about atomicity and performance which are derived transitively from those of the file system and then extended by assumptions about the correctness of the algorithms.

Things are bad enough as they are.

I am not convinced that relying on the internals of FileOutputCommitter Versions is the safe way to do this.

I think you better off specifying a commit protocol which is explicitly designed for writing files into the destination, and then implementing it. For S3A, knowing the destination path lets is initiate but not complete the upload to the final path; we would then propagate the information needed to manifest that file to the job committer. The "Magic" committer does exactly this by recognising that when someone writes to dest/__magic/$job_attemptID/$task-attempt-id/__basepath/year=2019/month=10/day=1/part-0001.csv is to be have a final destination of dest/year=2019/month=10/day=1/part-0001.csv and the output stream, rather than create that file o close(), is only to save the metadata for the task commit to find.

That is very much black magic in the FS connector. Having a commit protol where you could ask the committer for an output path to use when writing to a specific destination we let you eliminate that trick completely and possibly help with this problem.

The other thing to consider is that spark permits committed tasks to pass serialized data back to the driver, as an alternative to relying on the file system. Scale issues notwithstanding, task committers should be able to provide the information needed to include the task in the jobs final output. For example, rather have task commit renaming files, it could just return a path too where it has stored its list of files to commit. Job commit becomes a matter of loading those files and moving the output into the final destination. Again, this is essentially what we do with the S3A committer -we just save, propagate and reload those files within the committer.

This all gets complicated fast -even without worrying about concurrent jobs. But trying to use FileOutputCommitter internals to do what you want to do here strikes me is very very dangerous. I have issues with one of the commit algorithms implemented in it already; trying to use it for other purposes is only going to make things worse.

@turboFei
Copy link
Member Author

Thanks @steveloughran
I will try my best to understand the content of you comments, thanks a lot.

@SparkQA
Copy link

SparkQA commented Oct 15, 2019

Test build #112119 has finished for PR 25863 at commit fac7b7b.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

…duplicate result and support concurrent file source write operations write to different partitions in the same table.
add concurrent ut

extract doInsertion and set logDebug and modify comment
@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114890 has finished for PR 25863 at commit fa66a5b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114891 has finished for PR 25863 at commit 79d59bd.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 5, 2019

Test build #114901 has finished for PR 25863 at commit 0433977.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 9, 2019

Test build #115045 has finished for PR 25863 at commit f45ca9b.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@turboFei turboFei closed this Jan 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants