[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions #37468

steveloughran · 2022-08-10T14:37:49Z

What changes were proposed in this pull request?

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

Why are the changes needed?

Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores).

The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it.

Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so.
(apache/hadoop#4728)

Does this PR introduce any user-facing change?

No.

There is documentation on the feature in the hadoop manifest committer docs.

How was this patch tested?

Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change.
New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility

Those new integration tests include

spark sql test derived from spark's own CloudRelationBasicSuite.scala#L212
Dataset tests extended to verify support for/rejection of dynamic partition overwrite AbstractCommitDataframeSuite.scala#L151

Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites)

…ite. Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a PathOutputCommitter is compatible with dynamic partition overwrite. This patch has unit tests but not integration tests; really needs to test the SQL commands through the manifest committer into gcs/abfs, or at least local fs. That would be possible once hadoop 3.3.5 is out... Change-Id: I5cbc391bc021b4dd177374e82de9fc33137ac319 Change-Id: I772caf861d6c92f0da6d9a02d9f899236ddaddf9

…nsupported I believe this was always implicit; only committers with dynamic partition overwrite would be asked for absolute path temp files(*). With this change it is explicit, with tests. (*) certainly nobody has ever complained about it not working with the s3a committers Change-Id: I57c2a02ad799f7ab5d9d0a3053da24f960bad289

…p versions If mapreduce-core BindingPathOutputCommitter doesn't implement StreamCapabilities the probes for dynamic commit support through the parquet committer don't work. So skip that bit of the test case Change-Id: I5225c70a54c63adf858a9f429fddad251b79783e

steveloughran · 2022-08-16T18:46:00Z

this should interest @sunchao and @dongjoon-hyun -know that this doesn't add support to the s3a committers; s3 itself doesn't do the right thing (rename()). Does for abfs and gcs through the manifest committer.

attilapiros

Mostly a bunch of Nits so far.

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala

...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala

Change-Id: I6ddc92f56d8762cebb76857a30a5b9dd4fe4948d

* Section in cloud-integration docs * Add references to committers and papers related to them. * Remove hadoop openstack reference (it's going to be cut soon) No mention of the Intermediate Manifest Committer until it is shipped in an ASF release. It is in Cloudera CDH and has been trouble free, unlike FileOutputCommitter with abfs (scale) and gcs (correctness). Change-Id: I97bf56336f6fd6cbd6d56e87c911e62a6deff9c8

steveloughran · 2022-08-26T14:35:29Z

the hadoop side of this change is now merged in.

@attilapiros do you have any time to review this again?

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala

...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala

...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala

1. docs 2. tests Change-Id: Ia31cf91999157057f1a85061826da74db7f1713e

steveloughran · 2022-09-07T15:13:15Z

added a new test case and updated the docs. I've not yet rebased it/merged it with your committer work, but the docs shouldn't clash. once this PR is in I will set my local build up to run your tests against s3 london

attilapiros

Some checkstyle issues:

error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala message=File line length exceeds 100 characters line=213
error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala message=No space after token = line=177 column=32
error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala message=File line length exceeds 100 characters line=202

Otherwise LGTM.

dongjoon-hyun · 2022-09-08T02:49:55Z

docs/cloud-integration.md

+This deliviers performance and scalability on the object stores.
+
+It is not critical for job correctness to use this with Azure storage; the
+classic FileOutputCommitter is safe there -however this new committer scales


nit. -however -> - however

docs/cloud-integration.md

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala

...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala

dongjoon-hyun

cc @sunchao, @viirya , @cloud-fan

Change-Id: I71a15ba3909a8912351987ad2dfbba8dca83b5b8

dongjoon-hyun · 2022-09-09T06:42:09Z

BTW, when is the ETA for Apache Hadoop 3.3.5, @steveloughran ?

steveloughran · 2022-09-09T09:45:22Z

@dongjoon-hyun I'm off on vacation next week; we will fork off the branch the week after.

things i'd like in if anyone has the time

upgraded shaded parquet
the shaded avro PR
get that arm64 docker image working for a release there

dongjoon-hyun · 2022-09-09T18:29:55Z

Merged to master for Apache Spark 3.4.0.

dongjoon-hyun · 2022-09-09T18:33:15Z

Thank you, @steveloughran and @attilapiros .

dongjoon-hyun · 2023-02-28T17:23:55Z

@attilapiros FYI, there is a new PR for this area.

[SPARK-41551][SQL] Dynamic/absolute path support in PathOutputCommitters #40221

dongjoon-hyun · 2023-02-28T17:44:28Z

This is reverted from branch-3.4 only while being kept in master branch for Apache Spark 3.5.

57aa3d1

steveloughran marked this pull request as draft August 11, 2022 12:05

steveloughran changed the title ~~[WIP][SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions~~ [SPARK-40034][SQL][WIP] PathOutputCommitters to support dynamic partitions Aug 11, 2022

steveloughran force-pushed the SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning branch from 878fedd to 47bc229 Compare August 11, 2022 19:32

steveloughran force-pushed the SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning branch from 06f3853 to 545f294 Compare August 15, 2022 19:37

steveloughran mentioned this pull request Aug 15, 2022

MAPREDUCE-7403. manifest-committer dynamic partitioning support. apache/hadoop#4728

Merged

4 tasks

steveloughran changed the title ~~[SPARK-40034][SQL][WIP] PathOutputCommitters to support dynamic partitions~~ [SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions Aug 16, 2022

steveloughran marked this pull request as ready for review August 18, 2022 10:56

attilapiros reviewed Aug 18, 2022

View reviewed changes

steveloughran added 2 commits August 19, 2022 13:55

SPARK-40034. address review feedback

fdf4cf4

Change-Id: I6ddc92f56d8762cebb76857a30a5b9dd4fe4948d

github-actions bot added the DOCS label Aug 24, 2022

attilapiros reviewed Sep 6, 2022

View reviewed changes

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala Show resolved Hide resolved

attilapiros reviewed Sep 6, 2022

View reviewed changes

...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala Show resolved Hide resolved

attilapiros reviewed Sep 6, 2022

View reviewed changes

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala Show resolved Hide resolved

attilapiros reviewed Sep 7, 2022

View reviewed changes

...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala Show resolved Hide resolved

SPARK-40034. review feedback

2d50183

1. docs 2. tests Change-Id: Ia31cf91999157057f1a85061826da74db7f1713e

attilapiros approved these changes Sep 7, 2022

View reviewed changes

dongjoon-hyun reviewed Sep 8, 2022

View reviewed changes

docs/cloud-integration.md Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 8, 2022

View reviewed changes

...ud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Sep 8, 2022

View reviewed changes

...cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala Show resolved Hide resolved

dongjoon-hyun approved these changes Sep 8, 2022

View reviewed changes

SPARK-40034. review feedback

42b3fb0

Change-Id: I71a15ba3909a8912351987ad2dfbba8dca83b5b8

dongjoon-hyun closed this in 5a599de Sep 9, 2022

[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions #37468

[SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions #37468

Uh oh!

Conversation

steveloughran commented Aug 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

steveloughran commented Aug 16, 2022

Uh oh!

attilapiros left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steveloughran commented Aug 26, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

steveloughran commented Sep 7, 2022

Uh oh!

attilapiros left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Sep 8, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Sep 9, 2022

Uh oh!

steveloughran commented Sep 9, 2022

Uh oh!

dongjoon-hyun commented Sep 9, 2022

Uh oh!

dongjoon-hyun commented Sep 9, 2022

Uh oh!

dongjoon-hyun commented Feb 28, 2023

Uh oh!

dongjoon-hyun commented Feb 28, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

steveloughran commented Aug 10, 2022 •

edited

Loading