[SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default #28897

dongjoon-hyun · 2020-06-22T22:14:29Z

What changes were proposed in this pull request?

According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020.

Item	Default Hadoop Dependency
Apache Spark Website	3.2.0
Apache Download Site	3.2.0
Apache Snapshot	3.2.0
Maven Central	3.2.0
PyPI	2.7.4 (We will switch later)
CRAN	2.7.4 (We will switch later)
Homebrew	3.2.0 (already)

In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for Apache Spark 3.1.0 scheduled on December 2020.

Why are the changes needed?

Apache Hadoop 3.2 has many fixes and new cloud-friendly features.

Reference

2017-08-04: https://hadoop.apache.org/release/2.7.4.html
2019-01-16: https://hadoop.apache.org/release/3.2.0.html

Does this PR introduce any user-facing change?

Since the default Hadoop dependency changes, the users will get a better support in a cloud environment.

How was this patch tested?

Pass the Jenkins.

SparkQA · 2020-06-23T01:22:12Z

Test build #124371 has finished for PR 28897 at commit 9663de5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-23T01:29:53Z

Test build #124372 has finished for PR 28897 at commit caf50d1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-23T02:53:04Z

Hi, @srowen , @HyukjinKwon , @cloud-fan , @gatorsmile .
Could you review this please?

gatorsmile · 2020-06-23T03:02:11Z

the users will get a better support in a cloud environment.

Can you explain the details?

dongjoon-hyun · 2020-06-23T04:44:41Z

For that, I'm think about new features like the followings, but the required items varies based on the users situation.

HADOOP-13786 Add S3A committers for zero-rename commits to S3 endpoints
HADOOP-13075 Add support for SSE-KMS and SSE-C in s3a filesystem
HADOOP-13578 Add Codec for ZStandard Compression (This is not cloud-specific)

dongjoon-hyun · 2020-06-23T05:13:07Z

BTW, please note that the default version is very important. For example, PySpark is downloaded 1,333,883 times last week, but it's only Spark distribution with Hadoop 2.7.4.

https://pypistats.org/packages/pyspark

HyukjinKwon · 2020-06-23T06:42:47Z

^ I target to have a way to control it in Spark 3.1 FWIW at SPARK-32017

gatorsmile · 2020-06-23T06:52:41Z

Yes. As you said, the default version is very important for PySpark users. I am afraid there are breaking changes in Hadoop 3.x releases.

We should avoid making this change until we can resolve https://issues.apache.org/jira/browse/SPARK-32017

dongjoon-hyun · 2020-06-23T06:58:42Z

@gatorsmile . Why that blocks this? Technically, this supersedes it, doesn't it?

We should avoid making this change until we can resolve https://issues.apache.org/jira/browse/SPARK-32017

Switching the default is the real one. For example, we released Scala 2.12 in Spark 2.4.x lines for a while, but we didn't notice the Scala function issue until 3.0.0 release.

Also, we can switch back to Hadoop 2.7 before December if we want.

dongjoon-hyun · 2020-06-23T07:00:00Z

BTW, if you want to have Hadoop 2.7 variant in Hadoop 3.2 (default) environment, we had better revise the JIRA issue.

gatorsmile · 2020-06-23T07:04:52Z

We should avoid forcing the current PySpark users to upgrade their Hadoop versions. If we change the default, will it impact them? If YES, I think we should not do it until it is ready and they have a workaround.

dongjoon-hyun · 2020-06-23T07:07:26Z

I'm wondering what impact are you worrying specifically, @gatorsmile ?

gatorsmile · 2020-06-23T07:12:56Z

Will the PySpark users hit the migration issue if they upgrade from Spark 3.0 to 3.1 due to this PR? For example, some incompatibility issues introduced by Hadoop 3.x.

This PR did not answer this important question in the PR description. We need to answer this before doing any further action.

dongjoon-hyun · 2020-06-23T08:57:12Z

For PySpark, if we need to keep in Hadoop 2.7 distribution. That's very easy. I can remove the following one-line change from this PR. Since PyPi uploading is a manual process, we can keep PySpark with Hadoop 2.7 in PyPi.

- BINARY_PKGS_EXTRA["hadoop2.7"]="withpip,withr"
+ BINARY_PKGS_EXTRA["hadoop3.2"]="withpip,withr"

I didn't see specific complains about the followings. Instead, I've seen many complains about Hadoop 2.7.4 dependency for a long time.

Will the PySpark users hit the migration issue if they upgrade from Spark 3.0 to 3.1 due to this PR?

I'm wondering if I miss any things in the mailing thread. It would be great if you can answer my question, too. Do you have a specific issue? Could you share it with the community? If possible, on the dev mailing list? Then, we can try to fix it together in order to move forward.

We need to answer this before doing any further action.

In short, let's focus on non-PyPi scope because I provide a workaround with BINARY_PKGS_EXTRA. Do we need to stick on Hadoop 2.7 in Apache Spark 3.1.0 when we have both Hadoop 2.7 and Hadoop 3.2 distribution and PySpark can be be the same with Spark 3.0.0?

SparkQA · 2020-06-24T06:07:00Z

Test build #124445 has finished for PR 28897 at commit 2434365.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-25T23:19:50Z

Retest this please.

dongjoon-hyun · 2020-06-25T23:35:11Z

Hi, @srowen , @HyukjinKwon , @gatorsmile , @holdenk , @dbtsai .
According to your comments and advices, I updated the PR description clearly and focused on only Apache-side. Can we make Apache Spark 3.1 move forward? Thank you in advance.

SparkQA · 2020-06-26T01:59:46Z

Test build #124523 has finished for PR 28897 at commit 2434365.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2020-06-26T21:05:28Z

Gentle ping once again.

dbtsai · 2020-06-26T21:11:26Z

+1 from me. Users still have the option to use Hadoop 2.7, so I feel it's safe.

dongjoon-hyun · 2020-06-26T21:24:34Z

Thank you so much, @dbtsai !

holdenk · 2020-06-26T21:49:20Z

LGTM, we can continue the PyPI discussion separately.

dongjoon-hyun · 2020-06-26T22:18:52Z

Thank you so much, @holdenk ! Yes, we can discuss and improve it separately later.

dongjoon-hyun · 2020-06-27T02:42:50Z

Thank you all. Merged to master.

According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020. | Item | Default Hadoop Dependency | |------|-----------------------------| | Apache Spark Website | 3.2.0 | | Apache Download Site | 3.2.0 | | Apache Snapshot | 3.2.0 | | Maven Central | 3.2.0 | | PyPI | 2.7.4 (We will switch later) | | CRAN | 2.7.4 (We will switch later) | | Homebrew | 3.2.0 (already) | In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for [Apache Spark 3.1.0 scheduled on December 2020](https://spark.apache.org/versioning-policy.html). Apache Hadoop 3.2 has many fixes and new cloud-friendly features. **Reference** - 2017-08-04: https://hadoop.apache.org/release/2.7.4.html - 2019-01-16: https://hadoop.apache.org/release/3.2.0.html Since the default Hadoop dependency changes, the users will get a better support in a cloud environment. Pass the Jenkins. Closes apache#28897 from dongjoon-hyun/SPARK-32058. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default

9663de5

probot-autolabeler bot added BUILD INFRA KUBERNETES labels Jun 22, 2020

Use hadoop-3.2 in PRbuilder by default

caf50d1

dongjoon-hyun marked this pull request as ready for review June 23, 2020 02:52

Address Xiao's comment

2434365

dongjoon-hyun closed this in 9c134b5 Jun 27, 2020

HyukjinKwon mentioned this pull request Sep 10, 2020

[SPARK-32841][BUILD] Use Apache Hadoop 3.2.0 for PyPI and CRAN #29704

Closed

sunchao mentioned this pull request Jul 1, 2021

[WIP][SPARK-35954] Upgrade Apache Curator Dependency to 4.2.0 #33157

Closed

[SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default #28897

[SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default #28897

Uh oh!

Conversation

dongjoon-hyun commented Jun 22, 2020 • edited by HyukjinKwon Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

SparkQA commented Jun 23, 2020

Uh oh!

dongjoon-hyun commented Jun 23, 2020

Uh oh!

gatorsmile commented Jun 23, 2020

Uh oh!

dongjoon-hyun commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 23, 2020

Uh oh!

gatorsmile commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dongjoon-hyun commented Jun 23, 2020

Uh oh!

dongjoon-hyun commented Jun 23, 2020

Uh oh!

gatorsmile commented Jun 23, 2020

Uh oh!

dongjoon-hyun commented Jun 23, 2020

Uh oh!

gatorsmile commented Jun 23, 2020

Uh oh!

dongjoon-hyun commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jun 24, 2020

Uh oh!

dongjoon-hyun commented Jun 25, 2020

Uh oh!

dongjoon-hyun commented Jun 25, 2020

Uh oh!

SparkQA commented Jun 26, 2020

Uh oh!

dongjoon-hyun commented Jun 26, 2020

Uh oh!

dbtsai commented Jun 26, 2020

Uh oh!

dongjoon-hyun commented Jun 26, 2020

Uh oh!

holdenk commented Jun 26, 2020

Uh oh!

dongjoon-hyun commented Jun 26, 2020

Uh oh!

dongjoon-hyun commented Jun 27, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

dongjoon-hyun commented Jun 22, 2020 •

edited by HyukjinKwon

Loading

dongjoon-hyun commented Jun 23, 2020 •

edited

Loading

dongjoon-hyun commented Jun 23, 2020 •

edited

Loading

gatorsmile commented Jun 23, 2020 •

edited

Loading

dongjoon-hyun commented Jun 23, 2020 •

edited

Loading