Skip to content

Conversation

@LuciferYang
Copy link
Contributor

@LuciferYang LuciferYang commented Apr 19, 2023

What changes were proposed in this pull request?

SPARK-36835 introduced hadoop-client-api.artifact, adoop-client-runtime.artifact and hadoop-client-minicluster.artifact to be compatible with the dependency definitions of Hadoop 2 and Hadoop 3.

After SPARK-42452, Spark no longer supports Hadoop 2, so this pr inline these properties to simplify the dependency definition.

Why are the changes needed?

No longer requires compatibility with Hadoop 2

Does this PR introduce any user-facing change?

No

How was this patch tested?

Pass GitHub Actions

@HyukjinKwon
Copy link
Member

cc @sunchao

@pan3793
Copy link
Member

pan3793 commented Apr 20, 2023

Does it mean we drop support for building against vanilla Hadoop3 client?

def supportsHadoopShadedClient(hadoopVersion: String): Boolean = {
VersionUtils.majorMinorPatchVersion(hadoopVersion).exists {
case (3, 2, v) if v >= 2 => true
case (3, 3, v) if v >= 1 => true
case _ => false
}
}

Update: leave it, #33160 didn't get in, Spark does not support for building against vanilla Hadoop3 client

@LuciferYang
Copy link
Contributor Author

Does it mean we drop support for building against vanilla Hadoop3 client?

def supportsHadoopShadedClient(hadoopVersion: String): Boolean = {
VersionUtils.majorMinorPatchVersion(hadoopVersion).exists {
case (3, 2, v) if v >= 2 => true
case (3, 3, v) if v >= 1 => true
case _ => false
}
}

Update: leave it, #33160 didn't get in, Spark does not support for building against vanilla Hadoop3 client

friendly ping @sunchao

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Note this also makes it impossible for users to pick a Hadoop version without shaded client support, like Hadoop 3.1.x. Previous they can do:

./mvn package -Dhadoop.version=3.1.2 -Dhadoop-client-api.artifact=hadoop-client ...

cc @xkrogen too (I vaguely remember you did something similar).

@sunchao
Copy link
Member

sunchao commented Apr 20, 2023

Some related JIRA: https://issues.apache.org/jira/browse/SPARK-37994

@pan3793
Copy link
Member

pan3793 commented Apr 20, 2023

@sunchao so the current supported Hadoop version is 3.2.2+ and 3.3.1+? there is some code for Hadoop 3.0 and 3.1, should we remove it then?

@sunchao
Copy link
Member

sunchao commented Apr 20, 2023

yea, the shaded Hadoop client only work for Hadoop 3.2.2+ and 3.3.1+. I'm not sure if there're people that still use Hadoop 3.0/3.1 with Spark though.

I'm not aware of any code in Spark that specifically depend on Hadoop 3.0/3.1. Could you point to them for me?

@pan3793
Copy link
Member

pan3793 commented Apr 20, 2023

I'm not aware of any code in Spark that specifically depend on Hadoop 3.0/3.1. Could you point to them for me?

@sunchao I find two examples

@xkrogen
Copy link
Contributor

xkrogen commented Apr 20, 2023

No strong opinion on this, but we should make it clear that this PR is explicitly dropping support for Hadoop 3.0/3.1 and earlier versions of 3.2

cc @mridulm

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Apr 21, 2023

@xkrogen @sunchao @pan3793 I would like to clarify, actually, no longer using Hadoop 3.0/3.1 support for build ant test is not the original intention of this PR.

So if there is an way to build and test Hadoop 3.0/3.1 successfully before this pr, but it loses after this pr, I think we should stop this work because Apache Spark has not previously stated on any occasion that it no longer supports Hadoop 3.0/3.1, right ?

@xkrogen @sunchao @pan3793 Can you give a command that can be used for build & test with Hadoop 3.0/3.1? I want to manually check it, thanks ~

@sunchao
Copy link
Member

sunchao commented Apr 21, 2023

So if there is an way to build and test Hadoop 3.0/3.1 successfully before this pr, but it loses after this pr, I think we should stop this work because Apache Spark has not previously stated on any occasion that it no longer supports Hadoop 3.0/3.1, right ?

Yes, I think that's probably a sensible thing to do.

@xkrogen @sunchao @pan3793 Can you give a command that can be used for build & test with Hadoop 3.0/3.1? I want to manually check it, thanks ~

You can check this JIRA for the command to build: https://issues.apache.org/jira/browse/SPARK-37994

@LuciferYang
Copy link
Contributor Author

So if there is an way to build and test Hadoop 3.0/3.1 successfully before this pr, but it loses after this pr, I think we should stop this work because Apache Spark has not previously stated on any occasion that it no longer supports Hadoop 3.0/3.1, right ?

Yes, I think that's probably a sensible thing to do.

@xkrogen @sunchao @pan3793 Can you give a command that can be used for build & test with Hadoop 3.0/3.1? I want to manually check it, thanks ~

You can check this JIRA for the command to build: https://issues.apache.org/jira/browse/SPARK-37994

I encountered the following error while compiling hadoop-cloud module during build with hadoop 3.1.x:

[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:34: value hasPathCapability is not a member of org.apache.hadoop.fs.FileContext
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:34: not found: value CommonPathCapabilities
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:52: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:66: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream

Due to the fixed version of HADOOP-15691 being 3.3.0, 3.2.2, 3.2.3 and the fixed version of HADOOP-16906 being 3.3.1, so it is definitely not possible to build hadoop-cloud` module using Hadoop 3.1. x. I would like to remove this module to continue my experiment

@LuciferYang LuciferYang marked this pull request as draft April 23, 2023 16:12
@LuciferYang
Copy link
Contributor Author

convert to draft to avoid accidental merging

@LuciferYang LuciferYang marked this pull request as ready for review April 24, 2023 14:01
@LuciferYang
Copy link
Contributor Author

@xkrogen @sunchao @pan3793 Synchronize my experimental results

  1. Before building, we need to add the following content to resource-managers/yarn/pom.xml refer to https://github.com/apache/spark/pull/33160/files:
    <profile>
      <id>no-shaded-hadoop-client</id>
      <dependencies>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-api</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-common</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-server-web-proxy</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-client</artifactId>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-server-resourcemanager</artifactId>
          <scope>test</scope>
        </dependency>
        <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-yarn-server-tests</artifactId>
          <classifier>tests</classifier>
          <scope>test</scope>
        </dependency>
      </dependencies>
    </profile>

otherwise, the following compilation error will occurred with -Pyarn:

[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:30: object MiniYARNCluster is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:65: not found: type MiniYARNCluster
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/BaseYarnClusterSuite.scala:111: not found: type MiniYARNCluster
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:38: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:39: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:40: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:41: object resourcemanager is not a member of package org.apache.hadoop.yarn.server
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:249: not found: type RMContext
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:251: not found: type RMApp
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:260: not found: type RMApplicationHistoryWriter
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:262: not found: type SystemMetricsPublisher
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:266: not found: type RMAppManager
[ERROR] [Error] /${spark-source-dir}/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/ClientSuite.scala:271: not found: type ClientRMService
  1. With the above change, master can build success with hadoop 3.1.x as following
build/mvn clean install -Dhadoop.version=3.1.4 -Dhadoop-client-api.artifact=hadoop-client -Dhadoop-client-runtime.artifact=hadoop-yarn-api -Dhadoop-client-minicluster.artifact=hadoop-client -DskipTests -Pno-shaded-hadoop-client -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive

Otherwise, cannot build yarn module with hadoop 3.1.x.

  1. hadoop-cloud can't build with hadoop 3.1.x due to [SPARK-43185][BUILD] Inline hadoop-client related properties in pom.xml #40847 (comment)

Overall, the current master cannot compile yarn and hadoop-cloud modules using hadoop 3.1.x without any changes, and all other modules are ok

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Apr 24, 2023

More

  1. The conclusion using hadoop 3.0.x and hadoop 3.1.x is the same
  2. Use hadoop 3.2.x can't build hadoop-cloud module too
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:34: value ABORTABLE_STREAM is not a member of object org.apache.hadoop.fs.CommonPathCapabilities
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:52: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream
[ERROR] [Error] /${spark-source-dir}/hadoop-cloud/src/hadoop-3/main/scala/org/apache/spark/internal/io/cloud/AbortableStreamBasedCheckpointFileManager.scala:66: value abort is not a member of org.apache.hadoop.fs.FSDataOutputStream
[ERROR] three errors found
  1. Currently, only hadoop 3.3.x can build all modules

@sunchao
Copy link
Member

sunchao commented Apr 24, 2023

Interesting, thanks for the detailed analysis @LuciferYang !

Use hadoop 3.2.x can't build hadoop-cloud module too

This is Hadoop 3.2.2 ? I remember at some point we started to enable hadoop-cloud in Spark release, so I wonder why this didn't cause any error back in the time ..

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Apr 25, 2023

Interesting, thanks for the detailed analysis @LuciferYang !

Use hadoop 3.2.x can't build hadoop-cloud module too

This is Hadoop 3.2.2 ? I remember at some point we started to enable hadoop-cloud in Spark release, so I wonder why this didn't cause any error back in the time ..

I test with hadoop 3.2.4. AbortableStreamBasedCheckpointFileManager was introduced in SPARK-40039, and it uses APIs that are only available in Hadoop 3.3.1+(HADOOP-16906 FSDataOutputStream#abort())

@srowen
Copy link
Member

srowen commented May 5, 2023

Just to be clear are we saying this is OK to merge or there are issues with hadoop-cloud?

Copy link
Member

@pan3793 pan3793 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to get in, because

  1. Spark does not claim to support building against vanilla Hadoop 3 client officially, and it does not work before this change, so this PR breaks nothing
  2. before and after this PR, Spark supports building against Hadoop 3.3.1+ shaded client.
  3. before and after this PR, Spark can NOT build against Hadoop 3.2.x shaded client because of SPARK-40039, it's another issue if we want to restore the support for Hadoop 3.2.x shaded client

@pan3793
Copy link
Member

pan3793 commented May 5, 2023

No strong opinion on this, but we should make it clear that this PR is explicitly dropping support for Hadoop 3.0/3.1 and earlier versions of 3.2

@srowen I'm also +1 that we should document clearly the Hadoop client version support strategy

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the PR too given that Spark already doesn't support most of other Hadoop versions before 3.3.1.

@srowen srowen closed this in 1b54b01 May 5, 2023
@srowen
Copy link
Member

srowen commented May 5, 2023

Merged to master

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

LuciferYang added a commit to LuciferYang/spark that referenced this pull request May 10, 2023
…m.xml`

### What changes were proposed in this pull request?
SPARK-36835 introduced `hadoop-client-api.artifact`, `adoop-client-runtime.artifact` and `hadoop-client-minicluster.artifact` to be compatible with the dependency definitions of Hadoop 2 and Hadoop 3.

After [SPARK-42452](https://issues.apache.org/jira/browse/SPARK-42452), Spark no longer supports Hadoop 2, so this pr inline these properties to simplify the dependency definition.

### Why are the changes needed?
No longer requires compatibility with Hadoop 2

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

Closes apache#40847 from LuciferYang/SPARK-43185.

Lead-authored-by: yangjie01 <[email protected]>
Co-authored-by: YangJie <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants