[SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml` #40847

LuciferYang · 2023-04-19T11:13:33Z

What changes were proposed in this pull request?

SPARK-36835 introduced hadoop-client-api.artifact, adoop-client-runtime.artifact and hadoop-client-minicluster.artifact to be compatible with the dependency definitions of Hadoop 2 and Hadoop 3.

After SPARK-42452, Spark no longer supports Hadoop 2, so this pr inline these properties to simplify the dependency definition.

Why are the changes needed?

No longer requires compatibility with Hadoop 2

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Actions

HyukjinKwon · 2023-04-20T00:35:26Z

cc @sunchao

pan3793 · 2023-04-20T07:09:08Z

~~Does it mean we drop support for building against vanilla Hadoop3 client?~~

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala

Lines 118 to 124 in 09a4353

    
           def supportsHadoopShadedClient(hadoopVersion: String): Boolean = { 
        
             VersionUtils.majorMinorPatchVersion(hadoopVersion).exists { 
        
               case (3, 2, v) if v >= 2 => true 
        
               case (3, 3, v) if v >= 1 => true 
        
               case _ => false 
        
             } 
        
           }

Update: leave it, #33160 didn't get in, Spark does not support for building against vanilla Hadoop3 client

LuciferYang · 2023-04-20T07:22:58Z

~~Does it mean we drop support for building against vanilla Hadoop3 client?~~

spark/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/IsolatedClientLoader.scala

Lines 118 to 124 in 09a4353

def supportsHadoopShadedClient(hadoopVersion: String): Boolean = {

VersionUtils.majorMinorPatchVersion(hadoopVersion).exists {

case (3, 2, v) if v >= 2 => true

case (3, 3, v) if v >= 1 => true

case _ => false

}

}

Update: leave it, #33160 didn't get in, Spark does not support for building against vanilla Hadoop3 client

friendly ping @sunchao

sunchao

LGTM.

Note this also makes it impossible for users to pick a Hadoop version without shaded client support, like Hadoop 3.1.x. Previous they can do:

./mvn package -Dhadoop.version=3.1.2 -Dhadoop-client-api.artifact=hadoop-client ...

cc @xkrogen too (I vaguely remember you did something similar).

sunchao · 2023-04-20T16:25:08Z

Some related JIRA: https://issues.apache.org/jira/browse/SPARK-37994

pan3793 · 2023-04-20T16:50:33Z

@sunchao so the current supported Hadoop version is 3.2.2+ and 3.3.1+? there is some code for Hadoop 3.0 and 3.1, should we remove it then?

sunchao · 2023-04-20T17:01:16Z

yea, the shaded Hadoop client only work for Hadoop 3.2.2+ and 3.3.1+. I'm not sure if there're people that still use Hadoop 3.0/3.1 with Spark though.

I'm not aware of any code in Spark that specifically depend on Hadoop 3.0/3.1. Could you point to them for me?

pan3793 · 2023-04-20T17:26:15Z

I'm not aware of any code in Spark that specifically depend on Hadoop 3.0/3.1. Could you point to them for me?

@sunchao I find two examples

SPARK-13704 added a workaround for YARN-9332 (fixed in Hadoop 3.2.1/3.3.0)
SPARK-32256 added a workaround for HADOOP-14067 (fixed in Hadoop 3.1.0)

xkrogen · 2023-04-20T17:27:05Z

No strong opinion on this, but we should make it clear that this PR is explicitly dropping support for Hadoop 3.0/3.1 and earlier versions of 3.2

cc @mridulm

LuciferYang · 2023-04-21T02:29:18Z

@xkrogen @sunchao @pan3793 I would like to clarify, actually, no longer using Hadoop 3.0/3.1 support for build ant test is not the original intention of this PR.

So if there is an way to build and test Hadoop 3.0/3.1 successfully before this pr, but it loses after this pr, I think we should stop this work because Apache Spark has not previously stated on any occasion that it no longer supports Hadoop 3.0/3.1, right ?

@xkrogen @sunchao @pan3793 Can you give a command that can be used for build & test with Hadoop 3.0/3.1? I want to manually check it, thanks ~

sunchao · 2023-04-21T17:28:15Z

So if there is an way to build and test Hadoop 3.0/3.1 successfully before this pr, but it loses after this pr, I think we should stop this work because Apache Spark has not previously stated on any occasion that it no longer supports Hadoop 3.0/3.1, right ?

Yes, I think that's probably a sensible thing to do.

@xkrogen @sunchao @pan3793 Can you give a command that can be used for build & test with Hadoop 3.0/3.1? I want to manually check it, thanks ~

You can check this JIRA for the command to build: https://issues.apache.org/jira/browse/SPARK-37994

LuciferYang · 2023-04-23T03:25:58Z

So if there is an way to build and test Hadoop 3.0/3.1 successfully before this pr, but it loses after this pr, I think we should stop this work because Apache Spark has not previously stated on any occasion that it no longer supports Hadoop 3.0/3.1, right ?

Yes, I think that's probably a sensible thing to do.

@xkrogen @sunchao @pan3793 Can you give a command that can be used for build & test with Hadoop 3.0/3.1? I want to manually check it, thanks ~

You can check this JIRA for the command to build: https://issues.apache.org/jira/browse/SPARK-37994

I encountered the following error while compiling hadoop-cloud module during build with hadoop 3.1.x:

[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:34: value hasPathCapability is not a member of org.apache.hadoop.fs.FileContext
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:34: not found: value CommonPathCapabilities
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:52: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:66: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream

Due to the fixed version of HADOOP-15691 being 3.3.0, 3.2.2, 3.2.3 and the fixed version of HADOOP-16906 being 3.3.1, so it is definitely not possible to build hadoop-cloud` module using Hadoop 3.1. x. I would like to remove this module to continue my experiment

LuciferYang · 2023-04-23T16:13:37Z

convert to draft to avoid accidental merging

LuciferYang · 2023-04-24T14:02:03Z

@xkrogen @sunchao @pan3793 Synchronize my experimental results

Before building, we need to add the following content to resource-managers/yarn/pom.xml refer to https://github.com/apache/spark/pull/33160/files:

    <profile>
      <id>no-shaded-hadoop-client</id>
      <dependencies>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-api</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-common</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-server-web-proxy</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-client</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
          <scope>test</scope>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-server-tests</artifactId>
          <classifier>tests</classifier>
          <scope>test</scope>
        </dependency>
      </dependencies>
    </profile>

otherwise, the following compilation error will occurred with -Pyarn:

[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:30: object MiniYARNCluster is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:65: not found: type MiniYARNCluster
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:111: not found: type MiniYARNCluster
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:38: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:39: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:40: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:41: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:249: not found: type RMContext
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:251: not found: type RMApp
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:260: not found: type RMApplicationHistoryWriter
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:262: not found: type SystemMetricsPublisher
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:266: not found: type RMAppManager
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:271: not found: type ClientRMService

With the above change， master can build success with hadoop 3.1.x as following

build/mvn clean install -Dhadoop.version=3.1.4 -Dhadoop-client-api.artifact=hadoop-client -Dhadoop-client-runtime.artifact=hadoop-yarn-api -Dhadoop-client-minicluster.artifact=hadoop-client -DskipTests -Pno-shaded-hadoop-client -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive

Otherwise, cannot build yarn module with hadoop 3.1.x.

hadoop-cloud can't build with hadoop 3.1.x due to [SPARK-43185][BUILD] Inline hadoop-client related properties in pom.xml #40847 (comment)

Overall, the current master cannot compile yarn and hadoop-cloud modules using hadoop 3.1.x without any changes, and all other modules are ok

LuciferYang · 2023-04-24T15:10:27Z

More

The conclusion using hadoop 3.0.x and hadoop 3.1.x is the same
Use hadoop 3.2.x can't build hadoop-cloud module too

[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:34: value ABORTABLE_STREAM is not a member of object org.apache.hadoop.fs.CommonPathCapabilities
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:52: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:66: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream
[ERROR] three errors found

Currently, only hadoop 3.3.x can build all modules

sunchao · 2023-04-24T23:37:16Z

Interesting, thanks for the detailed analysis @LuciferYang !

Use hadoop 3.2.x can't build hadoop-cloud module too

This is Hadoop 3.2.2 ? I remember at some point we started to enable hadoop-cloud in Spark release, so I wonder why this didn't cause any error back in the time ..

LuciferYang · 2023-04-25T02:28:07Z

Interesting, thanks for the detailed analysis @LuciferYang !

Use hadoop 3.2.x can't build hadoop-cloud module too

This is Hadoop 3.2.2 ? I remember at some point we started to enable hadoop-cloud in Spark release, so I wonder why this didn't cause any error back in the time ..

I test with hadoop 3.2.4. AbortableStreamBasedCheckpointFileManager was introduced in SPARK-40039, and it uses APIs that are only available in Hadoop 3.3.1+(HADOOP-16906 FSDataOutputStream#abort())

srowen · 2023-05-05T15:34:36Z

Just to be clear are we saying this is OK to merge or there are issues with hadoop-cloud?

pan3793

I think it's OK to get in, because

Spark does not claim to support building against vanilla Hadoop 3 client officially, and it does not work before this change, so this PR breaks nothing
before and after this PR, Spark supports building against Hadoop 3.3.1+ shaded client.
before and after this PR, Spark can NOT build against Hadoop 3.2.x shaded client because of SPARK-40039, it's another issue if we want to restore the support for Hadoop 3.2.x shaded client

pan3793 · 2023-05-05T15:46:09Z

No strong opinion on this, but we should make it clear that this PR is explicitly dropping support for Hadoop 3.0/3.1 and earlier versions of 3.2

@srowen I'm also +1 that we should document clearly the Hadoop client version support strategy

sunchao

I'm OK with the PR too given that Spark already doesn't support most of other Hadoop versions before 3.3.1.

srowen · 2023-05-05T16:58:00Z

Merged to master

dongjoon-hyun

+1, LGTM.

…m.xml` ### What changes were proposed in this pull request? SPARK-36835 introduced `hadoop-client-api.artifact`, `adoop-client-runtime.artifact` and `hadoop-client-minicluster.artifact` to be compatible with the dependency definitions of Hadoop 2 and Hadoop 3. After [SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452), Spark no longer supports Hadoop 2, so this pr inline these properties to simplify the dependency definition. ### Why are the changes needed? No longer requires compatibility with Hadoop 2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions Closes apache#40847 from LuciferYang/SPARK-43185. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: Sean Owen <[email protected]>

fix

e0db54c

github-actions bot added BUILD CORE SQL YARN labels Apr 19, 2023

HyukjinKwon approved these changes Apr 20, 2023

View reviewed changes

Merge branch 'apache:master' into SPARK-43185

9179e2d

sunchao approved these changes Apr 20, 2023

View reviewed changes

LuciferYang marked this pull request as draft April 23, 2023 16:12

LuciferYang marked this pull request as ready for review April 24, 2023 14:01

srowen approved these changes May 4, 2023

View reviewed changes

pan3793 approved these changes May 5, 2023

View reviewed changes

sunchao approved these changes May 5, 2023

View reviewed changes

srowen closed this in 1b54b01 May 5, 2023

dongjoon-hyun reviewed May 5, 2023

View reviewed changes

sunchao mentioned this pull request May 18, 2023

[SPARK-43548][SS] Remove workaround for HADOOP-16255 #41209

Closed

[SPARK-43185][BUILD] Inline hadoop-client related properties in pom.xml #40847

[SPARK-43185][BUILD] Inline hadoop-client related properties in pom.xml #40847

Uh oh!

Conversation

LuciferYang commented Apr 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Apr 20, 2023

Uh oh!

pan3793 commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LuciferYang commented Apr 20, 2023

Uh oh!

sunchao left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sunchao commented Apr 20, 2023

Uh oh!

pan3793 commented Apr 20, 2023

Uh oh!

sunchao commented Apr 20, 2023

Uh oh!

pan3793 commented Apr 20, 2023

Uh oh!

xkrogen commented Apr 20, 2023

Uh oh!

LuciferYang commented Apr 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunchao commented Apr 21, 2023

Uh oh!

LuciferYang commented Apr 23, 2023

Uh oh!

LuciferYang commented Apr 23, 2023

Uh oh!

LuciferYang commented Apr 24, 2023

Uh oh!

LuciferYang commented Apr 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sunchao commented Apr 24, 2023

Uh oh!

LuciferYang commented Apr 25, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

srowen commented May 5, 2023

Uh oh!

pan3793 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pan3793 commented May 5, 2023

Uh oh!

sunchao left a comment

Choose a reason for hiding this comment

Uh oh!

srowen commented May 5, 2023

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

[SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml` #40847

[SPARK-43185][BUILD] Inline `hadoop-client` related properties in `pom.xml` #40847

LuciferYang commented Apr 19, 2023 •

edited

Loading

pan3793 commented Apr 20, 2023 •

edited

Loading

sunchao left a comment •

edited

Loading

LuciferYang commented Apr 21, 2023 •

edited

Loading

LuciferYang commented Apr 24, 2023 •

edited

Loading

LuciferYang commented Apr 25, 2023 •

edited

Loading

pan3793 left a comment •

edited

Loading