Skip to content

Conversation

@sunchao
Copy link
Member

@sunchao sunchao commented Jun 30, 2021

What changes were proposed in this pull request?

Add a new Maven profile no-shaded-hadoop-client that, when activated, switches to non-shaded Hadoop client (e.g., hadoop-client, hadoop-yarn-client, etc).

Why are the changes needed?

Currently Spark uses Hadoop shaded client by default. However, if Spark users want to build Spark with older version of Hadoop, such as 3.1.x, the shaded client cannot be used as it currently it only support Hadoop 3.2.2+ and 3.3.1+). Therefore, this proposes to offer a new Maven profile "no-shaded-hadoop-client" for this use case.

Does this PR introduce any user-facing change?

Yes, now users can choose to build Apache Spark with non-shaded Hadoop client, e.g.:

build/mvn package -DskipTests -Dhadoop.version=3.1.1 -Pno-shaded-hadoop-client

How was this patch tested?

Existing tests.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44983/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44983/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140469 has finished for PR 33160 at commit 0bf197b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao sunchao changed the title [SPARK-35959][BUILD] Add a new Maven profile "no-shaded-client" for older Hadoop 3.x versions [SPARK-35959][BUILD] Add a new Maven profile "no-shaded-hadoop-client" for older Hadoop 3.x versions Jul 1, 2021
@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/44999/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140487 has finished for PR 33160 at commit 37130a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @sunchao .

To verify via CI, could you make the profile active by default? After testing, we should remove it.

@dongjoon-hyun
Copy link
Member

FYI, if you enable it by default, the dependency files are required to be updated accordingly.

@sunchao sunchao changed the title [SPARK-35959][BUILD] Add a new Maven profile "no-shaded-hadoop-client" for older Hadoop 3.x versions [SPARK-35959][BUILD] Add a new Maven profile "no-shaded-hadoop-client" for Hadoop versions older than 3.2.2/3.3.1 Jul 1, 2021
@sunchao
Copy link
Member Author

sunchao commented Jul 1, 2021

To verify via CI, could you make the profile active by default? After testing, we should remove it.

Thanks @dongjoon-hyun . Will do.

@sunchao sunchao marked this pull request as draft July 1, 2021 17:51
@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140531 has finished for PR 33160 at commit bf51f50.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45044/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45044/

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45049/

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 1, 2021

For Hadoop 2 build, I noticed that GitHub Action job used sbt directly. So, I verified that combination compilation manually on top of that GitHub Action job command.

$ ./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Phadoop-2.7 -Pno-shaded-hadoop-client compile test:compile
...
[info] compiling 18 Scala sources to /Users/dongjoon/APACHE/spark-merge/sql/hive-thriftserver/target/scala-2.12/test-classes ...
[success] Total time: 142 s (02:22), completed Jul 1, 2021 1:48:49 PM

For the other part, it looks good to me. After one hour, if GitHub Action passed, let's revert dev/run-tests.py and merge this PR.

@SparkQA
Copy link

SparkQA commented Jul 1, 2021

Test build #140537 has finished for PR 33160 at commit 13932bf.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@sunchao
Copy link
Member Author

sunchao commented Jul 1, 2021

Cause: java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
[info]   at org.apache.hadoop.http.HttpServer2.initializeWebServer(HttpServer2.java:707)
[info]   at org.apache.hadoop.http.HttpServer2.<init>(HttpServer2.java:687)

Hmm for some reason it is still using Hadoop 3.3.1 classes which is only compatible with jetty 9.4+. Let me check why it happens.

@dongjoon-hyun
Copy link
Member

Let me check the PR builder.

@dongjoon-hyun
Copy link
Member

So, SBT with the following conf still fails?

-Phadoop-3.2 -Dhadoop.version=3.1.1 -Pno-shaded-hadoop-client -Phive-2.3 
-Pmesos -Pspark-ganglia-lgpl -Pyarn -Pdocker-integration-tests -Pkubernetes
-Phive-thriftserver -Pkinesis-asl -Phive -Phadoop-cloud test:package streaming-kinesis-asl-assembly/assembly

@sunchao
Copy link
Member Author

sunchao commented Jul 1, 2021

Yeah, but somehow it references Hadoop 3.3.1 class like org.apache.hadoop.http.HttpServer2 according to the line number.

@dongjoon-hyun
Copy link
Member

It seems to be a Spark bug which works only at maven and not on sbt.

$ build/mvn dependency:tree -Phadoop-3.2 -Dhadoop.version=3.1.1 -pl core | grep hadoop.client
exec: curl --silent --show-error -L https://downloads.lightbend.com/scala/2.12.14/scala-2.12.14.tgz
Using `mvn` from path: /opt/homebrew/bin/mvn
[INFO] +- org.apache.hadoop:hadoop-client-api:jar:3.1.1:compile
[INFO] +- org.apache.hadoop:hadoop-client-runtime:jar:3.1.1:compile
$ build/sbt "core/dependencyTree" -Phadoop-3.2 | grep hadoop.client
[info]   +-org.apache.hadoop:hadoop-client-api:3.3.1
[info]   +-org.apache.hadoop:hadoop-client-runtime:3.3.1
[info]   | +-org.apache.hadoop:hadoop-client-api:3.3.1

@dongjoon-hyun
Copy link
Member

In this case, we used to use [test-hadoop3.2][test-java11] combination which is still working in Jenkins. However, Jenkins PR Builder seems to be broken for [test-java11] for a while. Only the backend Jenkins job is working.

@dongjoon-hyun
Copy link
Member

Let me try.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-35959][BUILD] Add a new Maven profile "no-shaded-hadoop-client" for Hadoop versions older than 3.2.2/3.3.1 [SPARK-35959][BUILD][test-hadoop3.2][test-java11] Add a new Maven profile "no-shaded-hadoop-client" for Hadoop versions older than 3.2.2/3.3.1 Jul 2, 2021
@dongjoon-hyun
Copy link
Member

Retest this please

@sunchao
Copy link
Member Author

sunchao commented Jul 2, 2021

@dongjoon-hyun ahh I think you are right! seems sbt doesn't parse the -Dhadoop.version parameter:

build/sbt "core/dependencyTree" -Phadoop-3.2 -Dhadoop.version=3.1.1 | grep hadoop.client

[info]   +-org.apache.hadoop:hadoop-client-api:3.3.1
[info]   +-org.apache.hadoop:hadoop-client-runtime:3.3.1
[info]   | +-org.apache.hadoop:hadoop-client-api:3.3.1

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Test build #140560 has finished for PR 33160 at commit b1e0583.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 2, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45072/

@github-actions github-actions bot added the SQL label Jul 6, 2021
@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45225/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45225/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Test build #140714 has finished for PR 33160 at commit a22c1e7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45232/

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Test build #140721 has finished for PR 33160 at commit 3ffecf8.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 6, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/45232/

@sunchao
Copy link
Member Author

sunchao commented Jul 7, 2021

It seems Spark can't use non-shaded Hadoop 3.3.1 client as it is because of jetty-server incompatibility issue: Hadoop 3.3.1 uses Jetty 9.4.40 while Spark master uses 9.4.42 (upgraded via this #33053). The method SessionHandler.setHttpOnly was removed in 9.4.42 and therefore we'll get exception if trying to use the non-shaded Hadoop client:

sbt.ForkMain$ForkError: org.apache.hadoop.yarn.exceptions.YarnRuntimeException: java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.setHttpOnly(Z)V
	at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:384)
	at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:129)
	at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:500)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
	at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:122)
	at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:333)
	at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)

@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/46877/

@SparkQA
Copy link

SparkQA commented Aug 12, 2021

Test build #142369 has finished for PR 33160 at commit 4bf4533.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@steveloughran
Copy link
Contributor

Jetty 9.4.40 while Spark master uses 9.4.42

could move hadoop 3.3.2 to the same jetty version; if we get that out then things will briefly be in sync

@sunchao
Copy link
Member Author

sunchao commented Aug 19, 2021

@steveloughran yes we can, this is only an issue when Spark uses the non-shaded client though so I think it's OK, since it's better to just use the shaded client.

@SparkQA
Copy link

SparkQA commented Sep 7, 2021

Kubernetes integration test unable to build dist.

exiting with code: 1
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47538/

@steveloughran
Copy link
Contributor

you got any plans to update that hadoop jetty version alongside this?

@sunchao
Copy link
Member Author

sunchao commented Sep 9, 2021

@steveloughran you mean upgrade jetty version in Hadoop? yea I can check, but anyways Spark is not blocked by the jetty thing.

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants