-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-10063][SQL] Remove DirectParquetOutputCommitter #12229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -129,16 +129,17 @@ private[sql] abstract class BaseWriterContainer( | |
| outputWriterFactory.newInstance(path, bucketId, dataSchema, taskAttemptContext) | ||
| } catch { | ||
| case e: org.apache.hadoop.fs.FileAlreadyExistsException => | ||
| if (outputCommitter.isInstanceOf[parquet.DirectParquetOutputCommitter]) { | ||
| // Spark-11382: DirectParquetOutputCommitter is not idempotent, meaning on retry | ||
| if (outputCommitter.getClass.getName.contains("Direct")) { | ||
| // SPARK-11382: DirectParquetOutputCommitter is not idempotent, meaning on retry | ||
| // attempts, the task will fail because the output file is created from a prior attempt. | ||
| // This often means the most visible error to the user is misleading. Augment the error | ||
| // to tell the user to look for the actual error. | ||
| throw new SparkException("The output file already exists but this could be due to a " + | ||
| "failure from an earlier attempt. Look through the earlier logs or stage page for " + | ||
| "the first error.\n File exists error: " + e) | ||
| "the first error.\n File exists error: " + e.getLocalizedMessage, e) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
In #12004 I'm proposing a module to add optional cloud tests, documentation, and make sure the relevant hadoop artifacts for s3, openstack and azure got pulled in to releases. It's got a doc page too. Assuming that gets in, the exception text could include a link to it (or better, the ASF link redirector). I'd add a section on commit problems there ....
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
| } else { | ||
| throw e | ||
| } | ||
| throw e | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -156,15 +157,6 @@ private[sql] abstract class BaseWriterContainer( | |
| s"Using default output committer ${defaultOutputCommitter.getClass.getCanonicalName} " + | ||
| "for appending.") | ||
| defaultOutputCommitter | ||
| } else if (speculationEnabled) { | ||
| // When speculation is enabled, it's not safe to use customized output committer classes, | ||
| // especially direct output committers (e.g. `DirectParquetOutputCommitter`). | ||
| // | ||
| // See SPARK-9899 for more details. | ||
| logInfo( | ||
| s"Using default output committer ${defaultOutputCommitter.getClass.getCanonicalName} " + | ||
| "because spark.speculation is configured to be true.") | ||
| defaultOutputCommitter | ||
| } else { | ||
| val configuration = context.getConfiguration | ||
| val committerClass = configuration.getClass( | ||
|
|
||
This file was deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is pretty brittle/ugly. Is there any other way, such as having an interface covering commit semantics. then the code can go
.isInstanceOf[NonAtomicCommitter]?