Skip to content

Conversation

@steveloughran
Copy link
Contributor

@steveloughran steveloughran commented Aug 10, 2022

What changes were proposed in this pull request?

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

Why are the changes needed?

Hadoop 3.3.5 adds a new committer in mapreduce-core which works fast and correctly on azure and gcs. (it would also work on hdfs, but its optimised for the cloud stores).

The stores and the committer do meet the requirements of Spark SQL Dynamic Partition Overwrite, so it is OK to for spark to work through it.

Spark does not know this; MAPREDUCE-7403 adds a way for any PathOutputCommitter to declare that they are compatible; the IntermediateManifestCommitter will do so.
(apache/hadoop#4728)

Does this PR introduce any user-facing change?

No.

There is documentation on the feature in the hadoop manifest committer docs.

How was this patch tested?

  1. Unit tests in hadoop-cloud which work with hadoop versions with/without the matching change.
  2. New integration tests in https://github.com/hortonworks-spark/cloud-integration which require spark to be built against hadoop with the manifest committer declaring compatibility

Those new integration tests include

Tested against azure cardiff with the manifest committer; s3 london (s3a committers reject dynamic partition overwrites)

@steveloughran steveloughran marked this pull request as draft August 11, 2022 12:05
@steveloughran steveloughran changed the title [WIP][SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions [SPARK-40034][SQL][WIP] PathOutputCommitters to support dynamic partitions Aug 11, 2022
…ite.

Uses the StreamCapabilities probe in MAPREDUCE-7403 to identify when a
PathOutputCommitter is compatible with dynamic partition overwrite.

This patch has unit tests but not integration tests; really needs
to test the SQL commands through the manifest committer into gcs/abfs,
or at least local fs. That would be possible once hadoop 3.3.5 is out...
Change-Id: I5cbc391bc021b4dd177374e82de9fc33137ac319

Change-Id: I772caf861d6c92f0da6d9a02d9f899236ddaddf9
@steveloughran steveloughran force-pushed the SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning branch from 878fedd to 47bc229 Compare August 11, 2022 19:32
…nsupported

I believe this was always implicit; only committers with dynamic partition
overwrite would be asked for absolute path temp files(*). With this change
it is explicit, with tests.

(*) certainly nobody has ever complained about it not working with
the s3a committers

Change-Id: I57c2a02ad799f7ab5d9d0a3053da24f960bad289
@steveloughran steveloughran force-pushed the SPARK-40034-MAPREDUCE-7403-manifest-committer-partitioning branch from 06f3853 to 545f294 Compare August 15, 2022 19:37
…p versions

If mapreduce-core BindingPathOutputCommitter doesn't implement StreamCapabilities
the probes for dynamic commit support through the parquet committer
don't work. So skip that bit of the test case

Change-Id: I5225c70a54c63adf858a9f429fddad251b79783e
@steveloughran steveloughran changed the title [SPARK-40034][SQL][WIP] PathOutputCommitters to support dynamic partitions [SPARK-40034][SQL] PathOutputCommitters to support dynamic partitions Aug 16, 2022
@steveloughran
Copy link
Contributor Author

this should interest @sunchao and @dongjoon-hyun -know that this doesn't add support to the s3a committers; s3 itself doesn't do the right thing (rename()). Does for abfs and gcs through the manifest committer.

@steveloughran steveloughran marked this pull request as ready for review August 18, 2022 10:56
Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly a bunch of Nits so far.

Change-Id: I6ddc92f56d8762cebb76857a30a5b9dd4fe4948d
* Section in cloud-integration docs
* Add references to committers and papers related to them.
* Remove hadoop openstack reference (it's going to be cut soon)

No mention of the Intermediate Manifest Committer until it is shipped
in an ASF release. It is in Cloudera CDH and has been trouble free,
unlike FileOutputCommitter with abfs (scale) and gcs (correctness).

Change-Id: I97bf56336f6fd6cbd6d56e87c911e62a6deff9c8
@github-actions github-actions bot added the DOCS label Aug 24, 2022
@steveloughran
Copy link
Contributor Author

the hadoop side of this change is now merged in.

@attilapiros do you have any time to review this again?

1. docs
2. tests

Change-Id: Ia31cf91999157057f1a85061826da74db7f1713e
@steveloughran
Copy link
Contributor Author

added a new test case and updated the docs. I've not yet rebased it/merged it with your committer work, but the docs shouldn't clash. once this PR is in I will set my local build up to run your tests against s3 london

Copy link
Contributor

@attilapiros attilapiros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some checkstyle issues:

  • error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala message=File line length exceeds 100 characters line=213
  • error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/PathOutputCommitProtocol.scala message=No space after token = line=177 column=32
  • error file=/Users/attilazsoltpiros/git/attilapiros/spark-review/hadoop-cloud/src/hadoop-3/test/scala/org/apache/spark/internal/io/cloud/CommitterBindingSuite.scala message=File line length exceeds 100 characters line=202

Otherwise LGTM.

This deliviers performance and scalability on the object stores.

It is not critical for job correctness to use this with Azure storage; the
classic FileOutputCommitter is safe there -however this new committer scales
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. -however -> - however

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change-Id: I71a15ba3909a8912351987ad2dfbba8dca83b5b8
@dongjoon-hyun
Copy link
Member

BTW, when is the ETA for Apache Hadoop 3.3.5, @steveloughran ?

@steveloughran
Copy link
Contributor Author

@dongjoon-hyun I'm off on vacation next week; we will fork off the branch the week after.

things i'd like in if anyone has the time

  1. upgraded shaded parquet
  2. the shaded avro PR
  3. get that arm64 docker image working for a release there

@dongjoon-hyun
Copy link
Member

Merged to master for Apache Spark 3.4.0.

@dongjoon-hyun
Copy link
Member

Thank you, @steveloughran and @attilapiros .

@dongjoon-hyun
Copy link
Member

@attilapiros FYI, there is a new PR for this area.

@dongjoon-hyun
Copy link
Member

This is reverted from branch-3.4 only while being kept in master branch for Apache Spark 3.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants