-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-32058][BUILD] Use Apache Hadoop 3.2.0 dependency by default #28897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #124371 has finished for PR 28897 at commit
|
|
Test build #124372 has finished for PR 28897 at commit
|
|
Hi, @srowen , @HyukjinKwon , @cloud-fan , @gatorsmile . |
Can you explain the details? |
|
For that, I'm think about new features like the followings, but the required items varies based on the users situation.
|
|
BTW, please note that the default version is very important. For example, PySpark is downloaded 1,333,883 times last week, but it's only Spark distribution with |
|
^ I target to have a way to control it in Spark 3.1 FWIW at SPARK-32017 |
|
Yes. As you said, the default version is very important for PySpark users. I am afraid there are breaking changes in Hadoop 3.x releases. We should avoid making this change until we can resolve https://issues.apache.org/jira/browse/SPARK-32017 |
|
@gatorsmile . Why that blocks this? Technically, this supersedes it, doesn't it?
Switching the default is the real one. For example, we released Scala 2.12 in Spark 2.4.x lines for a while, but we didn't notice the Scala function issue until 3.0.0 release. Also, we can switch back to |
|
BTW, if you want to have |
|
We should avoid forcing the current PySpark users to upgrade their Hadoop versions. If we change the default, will it impact them? If YES, I think we should not do it until it is ready and they have a workaround. |
|
I'm wondering what impact are you worrying specifically, @gatorsmile ? |
|
Will the PySpark users hit the migration issue if they upgrade from Spark 3.0 to 3.1 due to this PR? For example, some incompatibility issues introduced by Hadoop 3.x. This PR did not answer this important question in the PR description. We need to answer this before doing any further action. |
|
For PySpark, if we need to keep in I didn't see specific complains about the followings. Instead, I've seen many complains about Hadoop 2.7.4 dependency for a long time.
I'm wondering if I miss any things in the mailing thread. It would be great if you can answer my question, too. Do you have a specific issue? Could you share it with the community? If possible, on the dev mailing list? Then, we can try to fix it together in order to move forward.
In short, let's focus on |
|
Test build #124445 has finished for PR 28897 at commit
|
|
Retest this please. |
|
Hi, @srowen , @HyukjinKwon , @gatorsmile , @holdenk , @dbtsai . |
|
Test build #124523 has finished for PR 28897 at commit
|
|
Gentle ping once again. |
|
+1 from me. Users still have the option to use Hadoop 2.7, so I feel it's safe. |
|
Thank you so much, @dbtsai ! |
|
LGTM, we can continue the PyPI discussion separately. |
|
Thank you so much, @holdenk ! Yes, we can discuss and improve it separately later. |
|
Thank you all. Merged to master. |
According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020. | Item | Default Hadoop Dependency | |------|-----------------------------| | Apache Spark Website | 3.2.0 | | Apache Download Site | 3.2.0 | | Apache Snapshot | 3.2.0 | | Maven Central | 3.2.0 | | PyPI | 2.7.4 (We will switch later) | | CRAN | 2.7.4 (We will switch later) | | Homebrew | 3.2.0 (already) | In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for [Apache Spark 3.1.0 scheduled on December 2020](https://spark.apache.org/versioning-policy.html). Apache Hadoop 3.2 has many fixes and new cloud-friendly features. **Reference** - 2017-08-04: https://hadoop.apache.org/release/2.7.4.html - 2019-01-16: https://hadoop.apache.org/release/3.2.0.html Since the default Hadoop dependency changes, the users will get a better support in a cloud environment. Pass the Jenkins. Closes apache#28897 from dongjoon-hyun/SPARK-32058. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
According to the dev mailing list discussion, this PR aims to switch the default Apache Hadoop dependency from 2.7.4 to 3.2.0 for Apache Spark 3.1.0 on December 2020.
In Apache Spark 3.0.0 release, we focused on the other features. This PR targets for Apache Spark 3.1.0 scheduled on December 2020.
Why are the changes needed?
Apache Hadoop 3.2 has many fixes and new cloud-friendly features.
Reference
Does this PR introduce any user-facing change?
Since the default Hadoop dependency changes, the users will get a better support in a cloud environment.
How was this patch tested?
Pass the Jenkins.