-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions #37468
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions #37468
Conversation
…ite. Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. This patch has unit tests but not integration tests; really needs to test the SQL commands through the manifest committer into gcs/abfs, or at least local fs. That would be possible once hadoop 3.3.5 is out... Change-Id: I5cbc391bc021b4dd177374e82de9fc33137ac319 Change-Id: I772caf861d6c92f0da6d9a02d9f899236ddaddf9
878fedd to
47bc229
Compare
…nsupported I believe this was always implicit; only committers with dynamic partition overwrite would be asked for absolute path temp files(*). With this change it is explicit, with tests. (*) certainly nobody has ever complained about it not working with the s3a committers Change-Id: I57c2a02ad799f7ab5d9d0a3053da24f960bad289
06f3853 to
545f294
Compare
…p versions If mapreduce-core BindingPathOutputCommitter doesn't implement StreamCapabilities the probes for dynamic commit support through the parquet committer don't work. So skip that bit of the test case Change-Id: I5225c70a54c63adf858a9f429fddad251b79783e
|
this should interest @sunchao and @dongjoon-hyun -know that this doesn't add support to the s3a committers; s3 itself doesn't do the right thing (rename()). Does for abfs and gcs through the manifest committer. |
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly a bunch of Nits so far.
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Outdated
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Outdated
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Outdated
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Outdated
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Outdated
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Show resolved
Hide resolved
...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala
Show resolved
Hide resolved
...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala
Outdated
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Outdated
Show resolved
Hide resolved
Change-Id: I6ddc92f56d8762cebb76857a30a5b9dd4fe4948d
* Section in cloud-integration docs * Add references to committers and papers related to them. * Remove hadoop openstack reference (it's going to be cut soon) No mention of the Intermediate Manifest Committer until it is shipped in an ASF release. It is in Cloudera CDH and has been trouble free, unlike FileOutputCommitter with abfs (scale) and gcs (correctness). Change-Id: I97bf56336f6fd6cbd6d56e87c911e62a6deff9c8
|
the hadoop side of this change is now merged in. @attilapiros do you have any time to review this again? |
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Show resolved
Hide resolved
...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala
Show resolved
Hide resolved
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Show resolved
Hide resolved
...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala
Show resolved
Hide resolved
1. docs 2. tests Change-Id: Ia31cf91999157057f1a85061826da74db7f1713e
|
added a new test case and updated the docs. I've not yet rebased it/merged it with your committer work, but the docs shouldn't clash. once this PR is in I will set my local build up to run your tests against s3 london |
attilapiros
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some checkstyle issues:
- error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala message=File line length exceeds 100 characters line=213
- error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala message=No space after token = line=177 column=32
- error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala message=File line length exceeds 100 characters line=202
Otherwise LGTM.
| This deliviers performance and scalability on the object stores. | ||
|
|
||
| It is not critical for job correctness to use this with Azure storage; the | ||
| classic FileOutputCommitter is safe there -however this new committer scales |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit. -however -> - however
...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala
Outdated
Show resolved
Hide resolved
...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala
Show resolved
Hide resolved
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @sunchao, @viirya , @cloud-fan
Change-Id: I71a15ba3909a8912351987ad2dfbba8dca83b5b8
|
BTW, when is the ETA for Apache Hadoop 3.3.5, @steveloughran ? |
|
@dongjoon-hyun I'm off on vacation next week; we will fork off the branch the week after. things i'd like in if anyone has the time
|
|
Merged to master for Apache Spark 3.4.0. |
|
Thank you, @steveloughran and @attilapiros . |
|
@attilapiros FYI, there is a new PR for this area. |
|
This is reverted from |
What changes were proposed in this pull request?
Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.
This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...
Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.
Why are the changes needed?
Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores).
The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it.
Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so.
(apache/hadoop#4728)
Does this PR introduce any user-facing change?
No.
There is documentation on the feature in the hadoop manifest committer docs.
How was this patch tested?
Those new integration tests include
Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites)