Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Jun 22, 2020

What changes were proposed in this pull request?

According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020.

Item Default Hadoop Dependency
Apache Spark Website 3.2.0
Apache Download Site 3.2.0
Apache Snapshot 3.2.0
Maven Central 3.2.0
PyPI 2.7.4 (We will switch later)
CRAN 2.7.4 (We will switch later)
Homebrew 3.2.0 (already)

In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for Apache Spark 3.1.0 scheduled on December 2020.

Why are the changes needed?

Apache Hadoop 3.2 has many fixes and new cloud-friendly features.

Reference

Does this PR introduce any user-facing change?

Since the default Hadoop dependency changes, the users will get a better support in a cloud environment.

How was this patch tested?

Pass the Jenkins.

@SparkQA
Copy link

SparkQA commented Jun 23, 2020

Test build #124371 has finished for PR 28897 at commit 9663de5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 23, 2020

Test build #124372 has finished for PR 28897 at commit caf50d1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review June 23, 2020 02:52
@dongjoon-hyun
Copy link
Member Author

Hi, @srowen , @HyukjinKwon , @cloud-fan , @gatorsmile .
Could you review this please?

@gatorsmile
Copy link
Member

the users will get a better support in a cloud environment.

Can you explain the details?

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 23, 2020

For that, I'm think about new features like the followings, but the required items varies based on the users situation.

  • HADOOP-13786 Add S3A committers for zero-rename commits to S3 endpoints
  • HADOOP-13075 Add support for SSE-KMS and SSE-C in s3a filesystem
  • HADOOP-13578 Add Codec for ZStandard Compression (This is not cloud-specific)

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 23, 2020

BTW, please note that the default version is very important. For example, PySpark is downloaded 1,333,883 times last week, but it's only Spark distribution with Hadoop 2.7.4.

@HyukjinKwon
Copy link
Member

^ I target to have a way to control it in Spark 3.1 FWIW at SPARK-32017

@gatorsmile
Copy link
Member

gatorsmile commented Jun 23, 2020

Yes. As you said, the default version is very important for PySpark users. I am afraid there are breaking changes in Hadoop 3.x releases.

We should avoid making this change until we can resolve https://issues.apache.org/jira/browse/SPARK-32017

@dongjoon-hyun
Copy link
Member Author

@gatorsmile . Why that blocks this? Technically, this supersedes it, doesn't it?

We should avoid making this change until we can resolve https://issues.apache.org/jira/browse/SPARK-32017

Switching the default is the real one. For example, we released Scala 2.12 in Spark 2.4.x lines for a while, but we didn't notice the Scala function issue until 3.0.0 release.

Also, we can switch back to Hadoop 2.7 before December if we want.

@dongjoon-hyun
Copy link
Member Author

BTW, if you want to have Hadoop 2.7 variant in Hadoop 3.2 (default) environment, we had better revise the JIRA issue.

@gatorsmile
Copy link
Member

We should avoid forcing the current PySpark users to upgrade their Hadoop versions. If we change the default, will it impact them? If YES, I think we should not do it until it is ready and they have a workaround.

@dongjoon-hyun
Copy link
Member Author

I'm wondering what impact are you worrying specifically, @gatorsmile ?

@gatorsmile
Copy link
Member

Will the PySpark users hit the migration issue if they upgrade from Spark 3.0 to 3.1 due to this PR? For example, some incompatibility issues introduced by Hadoop 3.x.

This PR did not answer this important question in the PR description. We need to answer this before doing any further action.

@dongjoon-hyun
Copy link
Member Author

dongjoon-hyun commented Jun 23, 2020

For PySpark, if we need to keep in Hadoop 2.7 distribution. That's very easy. I can remove the following one-line change from this PR. Since PyPi uploading is a manual process, we can keep PySpark with Hadoop 2.7 in PyPi.

- BINARY_PKGS_EXTRA["hadoop2.7"]="withpip,withr"
+ BINARY_PKGS_EXTRA["hadoop3.2"]="withpip,withr"

I didn't see specific complains about the followings. Instead, I've seen many complains about Hadoop 2.7.4 dependency for a long time.

Will the PySpark users hit the migration issue if they upgrade from Spark 3.0 to 3.1 due to this PR?

I'm wondering if I miss any things in the mailing thread. It would be great if you can answer my question, too. Do you have a specific issue? Could you share it with the community? If possible, on the dev mailing list? Then, we can try to fix it together in order to move forward.

We need to answer this before doing any further action.

In short, let's focus on non-PyPi scope because I provide a workaround with BINARY_PKGS_EXTRA. Do we need to stick on Hadoop 2.7 in Apache Spark 3.1.0 when we have both Hadoop 2.7 and Hadoop 3.2 distribution and PySpark can be be the same with Spark 3.0.0?

@SparkQA
Copy link

SparkQA commented Jun 24, 2020

Test build #124445 has finished for PR 28897 at commit 2434365.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Retest this please.

@dongjoon-hyun
Copy link
Member Author

Hi, @srowen , @HyukjinKwon , @gatorsmile , @holdenk , @dbtsai .
According to your comments and advices, I updated the PR description clearly and focused on only Apache-side. Can we make Apache Spark 3.1 move forward? Thank you in advance.

@SparkQA
Copy link

SparkQA commented Jun 26, 2020

Test build #124523 has finished for PR 28897 at commit 2434365.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member Author

Gentle ping once again.

@dbtsai
Copy link
Member

dbtsai commented Jun 26, 2020

+1 from me. Users still have the option to use Hadoop 2.7, so I feel it's safe.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @dbtsai !

@holdenk
Copy link
Contributor

holdenk commented Jun 26, 2020

LGTM, we can continue the PyPI discussion separately.

@dongjoon-hyun
Copy link
Member Author

Thank you so much, @holdenk ! Yes, we can discuss and improve it separately later.

@dongjoon-hyun
Copy link
Member Author

Thank you all. Merged to master.

holdenk pushed a commit to holdenk/spark that referenced this pull request Oct 27, 2020
According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020.

| Item | Default Hadoop Dependency |
|------|-----------------------------|
| Apache Spark Website | 3.2.0 |
| Apache Download Site | 3.2.0 |
| Apache Snapshot | 3.2.0 |
| Maven Central | 3.2.0 |
| PyPI | 2.7.4 (We will switch later) |
| CRAN | 2.7.4 (We will switch later) |
| Homebrew | 3.2.0 (already) |

In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for [Apache Spark 3.1.0 scheduled on December 2020](https://spark.apache.org/versioning-policy.html).

Apache Hadoop 3.2 has many fixes and new cloud-friendly features.

**Reference**
- 2017-08-04: https://hadoop.apache.org/release/2.7.4.html
- 2019-01-16: https://hadoop.apache.org/release/3.2.0.html

Since the default Hadoop dependency changes, the users will get a better support in a cloud environment.

Pass the Jenkins.

Closes apache#28897 from dongjoon-hyun/SPARK-32058.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants