-
Notifications
You must be signed in to change notification settings - Fork 29k
[WIP][SPARK-28945][CORE][SQL] Support concurrent dynamic partition writes to different partitions in the same table #25739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -26,7 +26,7 @@ import scala.util.Try | |||||||||||||||||||
| import org.apache.hadoop.conf.Configurable | ||||||||||||||||||||
| import org.apache.hadoop.fs.Path | ||||||||||||||||||||
| import org.apache.hadoop.mapreduce._ | ||||||||||||||||||||
| import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter | ||||||||||||||||||||
| import org.apache.hadoop.mapreduce.lib.output.{FileOutputCommitter, FileOutputFormat} | ||||||||||||||||||||
| import org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl | ||||||||||||||||||||
|
|
||||||||||||||||||||
| import org.apache.spark.internal.Logging | ||||||||||||||||||||
|
|
@@ -91,7 +91,31 @@ class HadoopMapReduceCommitProtocol( | |||||||||||||||||||
| */ | ||||||||||||||||||||
| private def stagingDir = new Path(path, ".spark-staging-" + jobId) | ||||||||||||||||||||
|
|
||||||||||||||||||||
| /** | ||||||||||||||||||||
| * Get the desired output path for the job. The output will be [[path]] when | ||||||||||||||||||||
| * dynamicPartitionOverwrite is disabled, otherwise, it will be [[stagingDir]]. We choose | ||||||||||||||||||||
| * [[stagingDir]] over [[path]] to avoid potential collision of concurrent write jobs as the same | ||||||||||||||||||||
| * output will be specified when writing to the same table dynamically. | ||||||||||||||||||||
| * | ||||||||||||||||||||
| * @return Path the desired output path. | ||||||||||||||||||||
| */ | ||||||||||||||||||||
| protected def getOutputPath(context: TaskAttemptContext): Path = { | ||||||||||||||||||||
| if (dynamicPartitionOverwrite) { | ||||||||||||||||||||
| val conf = context.getConfiguration | ||||||||||||||||||||
| val outputPath = stagingDir.getFileSystem(conf).makeQualified(stagingDir) | ||||||||||||||||||||
| outputPath | ||||||||||||||||||||
| } else { | ||||||||||||||||||||
| new Path(path) | ||||||||||||||||||||
| } | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| protected def setupCommitter(context: TaskAttemptContext): OutputCommitter = { | ||||||||||||||||||||
| // set output path to stagingDir to avoid potential collision of multiple concurrent write tasks | ||||||||||||||||||||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. when In fact, I don't see how the committer is related to the staging dir. If you look at
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, we manually commit files in the staging dir. The problem is in the spark/core/src/main/scala/org/apache/spark/internal/io/HadoopMapReduceCommitProtocol.scala Lines 190 to 198 in 1de7d30
The OutputCommitter cannot work correctly if multiple OutputCommitter working on the same output path( concurrent writes to different partition to the same table, as the output would be the same: the table output location). After changing the output path to the staging dir, concurrent jobs can have different output dirs. |
||||||||||||||||||||
| if (dynamicPartitionOverwrite) { | ||||||||||||||||||||
| val newOutputPath = getOutputPath(context) | ||||||||||||||||||||
| context.getConfiguration.set(FileOutputFormat.OUTDIR, newOutputPath.toString) | ||||||||||||||||||||
| } | ||||||||||||||||||||
|
|
||||||||||||||||||||
| val format = context.getOutputFormatClass.getConstructor().newInstance() | ||||||||||||||||||||
| // If OutputFormat is Configurable, we should set conf to it. | ||||||||||||||||||||
| format match { | ||||||||||||||||||||
|
|
||||||||||||||||||||
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output will be [[path]]what does path mean here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the path is defined in the class parameter, and the comment for that is: