Skip to content

Conversation

@LuciferYang
Copy link
Contributor

What changes were proposed in this pull request?

This PR aims to use Utils.localHostNameForURI instead of Utils.localCanonicalHostName in the following suites which changed in #36866

  • MasterSuite
  • MasterWebUISuite
  • RocksDBBackendHistoryServerSuite

Why are the changes needed?

These test cases fails when we run with SPARK_LOCAL_IP=::1 and -Djava.net.preferIPv6Addresses=true

Does this PR introduce any user-facing change?

No

How was this patch tested?

  • Pass GA
  • Manual test:
  1. export SPARK_LOCAL_IP=::1
echo $SPARK_LOCAL_IP
::1
  1. add -Djava.net.preferIPv6Addresses=true to MAVEN_OPTS, for example:
diff --git a/pom.xml b/pom.xml
index 1ce3b43faf..3356622985 100644
--- a/pom.xml
+++ b/pom.xml
@@ -2943,7 +2943,7 @@
               <include>**/*Suite.java</include>
             </includes>
             <reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
-            <argLine>-ea -Xmx4g -Xss4m -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=${CodeCacheSize} ${extraJavaTestArgs} -Dio.netty.tryReflectionSetAccessible=true</argLine>
+            <argLine>-ea -Xmx4g -Xss4m -XX:MaxMetaspaceSize=2g -XX:ReservedCodeCacheSize=${CodeCacheSize} ${extraJavaTestArgs} -Dio.netty.tryReflectionSetAccessible=true -Djava.net.preferIPv6Addresses=true</argLine>
             <environmentVariables>
               <!--
                 Setting SPARK_DIST_CLASSPATH is a simple way to make sure any child processes
  1. maven test RocksDBBackendHistoryServerSuite, MasterSuite and MasterWebUISuite
mvn clean install -DskipTests -pl core -am
mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.RocksDBBackendHistoryServerSuite
mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.master.MasterSuite
mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.master.ui.MasterWebUISuite

Before

RocksDBBackendHistoryServerSuite:

- Redirect to the root page when accessed to /history/ *** FAILED ***
  java.net.ConnectException: Connection refused (Connection refused)
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
  at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  at java.net.Socket.connect(Socket.java:613)
  at java.net.Socket.connect(Socket.java:561)
  at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
  ...
Run completed in 31 seconds, 745 milliseconds.
Total number of tests run: 73
Suites: completed 2, aborted 0
Tests: succeeded 3, failed 70, canceled 0, ignored 0, pending 0
*** 70 TESTS FAILED ***

MasterSuite:

- master/worker web ui available behind front-end reverseProxy *** FAILED ***
  The code passed to eventually never returned normally. Attempted 487 times over 50.079685917 seconds. Last failure message: Connection refused (Connection refused). (MasterSuite.scala:405)
Run completed in 3 minutes, 48 seconds.
Total number of tests run: 32
Suites: completed 2, aborted 0
Tests: succeeded 29, failed 3, canceled 0, ignored 0, pending 0
*** 3 TESTS FAILED *** 

MasterWebUISuite:

- Kill multiple hosts *** FAILED ***
  java.net.ConnectException: Connection refused (Connection refused)
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
  at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
  at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
  at java.net.Socket.connect(Socket.java:613)
  at java.net.Socket.connect(Socket.java:561)
  at sun.net.NetworkClient.doConnect(NetworkClient.java:180)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:463)
  at sun.net.www.http.HttpClient.openServer(HttpClient.java:558)
  ...
Run completed in 7 seconds, 83 milliseconds.
Total number of tests run: 4
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 4, canceled 0, ignored 0, pending 0
*** 4 TESTS FAILED ***

After

RocksDBBackendHistoryServerSuite:

Run completed in 38 seconds, 205 milliseconds.
Total number of tests run: 73
Suites: completed 2, aborted 0
Tests: succeeded 73, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

MasterSuite:

Run completed in 1 minute, 10 seconds.
Total number of tests run: 32
Suites: completed 2, aborted 0
Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

MasterWebUISuite:

Run completed in 6 seconds, 330 milliseconds.
Total number of tests run: 4
Suites: completed 2, aborted 0
Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
All tests passed.

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Jun 15, 2022

@dongjoon-hyun Is the following steps of manually verifying SPARK-39464 correct?

  1. export SPARK_LOCAL_IP=::1
  2. add -Djava.net.preferIPv6Addresses=true to MAVEN_OPTS
  3. mvn clean install -DskipTests -pl core -am
  4. mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.RocksDBBackendHistoryServerSuite
  5. mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.master.MasterSuite
  6. mvn clean test -pl core -Dtest=none -DwildcardSuites=org.apache.spark.deploy.master.ui.MasterWebUISuite

I found the master branch still test failed with above commands.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the master branch still test failed with above commands.

For your question, as I wrote in #36868 , I verified it with the following commands.

$ SERIAL_SBT_TESTS=1 SPARK_LOCAL_HOSTNAME='[2600:.(omitted)..:60cd]' build/sbt "core/test" -Djava.net.preferIPv6Addresses=true -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest
...
[info] Run completed in 18 minutes, 43 seconds.
[info] Total number of tests run: 2950
[info] Suites: completed 284, aborted 0
[info] Tests: succeeded 2950, failed 0, canceled 4, ignored 8, pending 0
[info] All tests passed.
[info] Passed: Total 3214, Failed 0, Errors 0, Passed 3214, Ignored 8, Canceled 4
[success] Total time: 1189 s (19:49), completed Jun 14, 2022, 4:45:55 PM

Here is my Jenkins result on IPv6-only machine.
Screenshot 2022-06-15 at 8 43 03 AM

@LuciferYang
Copy link
Contributor Author

I found the master branch still test failed with above commands.

For your question, as I wrote in #36868 , I verified it with the following commands.

$ SERIAL_SBT_TESTS=1 SPARK_LOCAL_HOSTNAME='[2600:.(omitted)..:60cd]' build/sbt "core/test" -Djava.net.preferIPv6Addresses=true -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest
...
[info] Run completed in 18 minutes, 43 seconds.
[info] Total number of tests run: 2950
[info] Suites: completed 284, aborted 0
[info] Tests: succeeded 2950, failed 0, canceled 4, ignored 8, pending 0
[info] All tests passed.
[info] Passed: Total 3214, Failed 0, Errors 0, Passed 3214, Ignored 8, Canceled 4
[success] Total time: 1189 s (19:49), completed Jun 14, 2022, 4:45:55 PM

Here is my Jenkins result on IPv6-only machine. Screenshot 2022-06-15 at 8 43 03 AM

Thank you for your reply. Let me have a try


test("Kill one host") {
testKillWorkers(Seq("${Utils.localCanonicalHostName()}"))
testKillWorkers(Seq(s"${Utils.localHostNameForURI()}"))
Copy link
Contributor Author

@LuciferYang LuciferYang Jun 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line be changed to Seq(s"${Utils.localCanonicalHostName()}")

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jun 15, 2022

For this PR, I'll revisit today, @LuciferYang .

And, since you are interested in this work, here is the detail of my Jenkins conf on Java 17.

export SPARK_LOCAL_HOSTNAME="[$(ifconfig | grep temporary | awk '{print $2}')]"
export DEFAULT_ARTIFACT_REPOSITORY=https://ipv6.repo1.maven.org/maven2/
export MAVEN_OPTS="-Djava.net.preferIPv6Addresses=true -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest"
export SBT_OPTS="-Djava.net.preferIPv6Addresses=true -Dtest.exclude.tags=org.apache.spark.tags.ExtendedLevelDBTest"
export SKIP_MIMA=true
export SKIP_UNIDOC=true
export SERIAL_SBT_TESTS=1

The following modules are passed already.

  • core
  • unsafe
  • kvstore
  • network-common
  • network-shuffle
  • repl,launcher,examples,sketch
  • graphx
  • catalyst
  • avro
  • mllib-local,mllib
  • yarn
  • mesos
  • kubernetes
  • hadoop-cloud
  • spark-ganglia-lgpl

@dongjoon-hyun
Copy link
Member

@LuciferYang . Here is a doc PR for your question.

@dongjoon-hyun
Copy link
Member

BTW, if you are using Mac, please disable Firewall during testing.

@LuciferYang
Copy link
Contributor Author

Firewall

Thanks, fell asleep yesterday... I'll try this today

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Jun 16, 2022

Yes, using export SPARK_LOCAL_HOSTNAME ="[fe80::e63d:1aff:fe28:...]" can pass the suites on linux, but using export SPARK_LOCAL_IP=::1 is still not, so using export SPARK_LOCAL_IP=::1 is a wrong way?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this patch still, @LuciferYang ?

On master branch, could you confirm the following again? It works for me.

$ SPARK_LOCAL_IP=::1 SBT_OPTS=-Djava.net.preferIPv6Addresses=true build/sbt "core/testOnly org.apache.spark.deploy.master.MasterSuite"
[info] MasterSuite:
[info] - can use a custom recovery mode factory (524 milliseconds)
[info] - master correctly recover the application (56 milliseconds)
[info] - master/worker web ui available (392 milliseconds)
[info] - master/worker web ui available with reverseProxy (30 seconds, 160 milliseconds)
[info] - master/worker web ui available behind front-end reverseProxy (30 seconds, 127 milliseconds)
[info] - basic scheduling - spread out (22 milliseconds)
[info] - basic scheduling - no spread out (12 milliseconds)
[info] - basic scheduling with more memory - spread out (10 milliseconds)
[info] - basic scheduling with more memory - no spread out (9 milliseconds)
[info] - scheduling with max cores - spread out (9 milliseconds)
[info] - scheduling with max cores - no spread out (29 milliseconds)
[info] - scheduling with cores per executor - spread out (9 milliseconds)
[info] - scheduling with cores per executor - no spread out (9 milliseconds)
[info] - scheduling with cores per executor AND max cores - spread out (10 milliseconds)
[info] - scheduling with cores per executor AND max cores - no spread out (9 milliseconds)
[info] - scheduling with executor limit - spread out (9 milliseconds)
[info] - scheduling with executor limit - no spread out (8 milliseconds)
[info] - scheduling with executor limit AND max cores - spread out (8 milliseconds)
[info] - scheduling with executor limit AND max cores - no spread out (9 milliseconds)
[info] - scheduling with executor limit AND cores per executor - spread out (8 milliseconds)
[info] - scheduling with executor limit AND cores per executor - no spread out (8 milliseconds)
[info] - scheduling with executor limit AND cores per executor AND max cores - spread out (8 milliseconds)
[info] - scheduling with executor limit AND cores per executor AND max cores - no spread out (8 milliseconds)
[info] - SPARK-13604: Master should ask Worker kill unknown executors and drivers (23 milliseconds)
[info] - SPARK-20529: Master should reply the address received from worker (19 milliseconds)
[info] - SPARK-27510: Master should avoid dead loop while launching executor failed in Worker (46 milliseconds)
[info] - All workers on a host should be decommissioned (37 milliseconds)
[info] - No workers should be decommissioned with invalid host (33 milliseconds)
[info] - Only worker on host should be decommissioned (23 milliseconds)
[info] - SPARK-19900: there should be a corresponding driver for the app after relaunching driver (2 seconds, 50 milliseconds)
[info] - assign/recycle resources to/from driver (30 milliseconds)
[info] - assign/recycle resources to/from executor (29 milliseconds)
[info] Run completed in 1 minute, 5 seconds.
[info] Total number of tests run: 32
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 32, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 71 s (01:11), completed Jun 17, 2022, 4:20:08 PM

@LuciferYang
Copy link
Contributor Author

LuciferYang commented Jun 18, 2022

@dongjoon-hyun with SPARK_LOCAL_IP=::1, SBT passed, Maven still not

@LuciferYang
Copy link
Contributor Author

In addition, I have disabled firewall. If Maven fails is expected, I will close this pr first

@dongjoon-hyun
Copy link
Member

I'm going to revisit this later after stabilizing SBT first.

@dongjoon-hyun
Copy link
Member

Just allow me more time to validate your patch, @LuciferYang . Thank you for your patience.

@dongjoon-hyun
Copy link
Member

BTW, may I ask if you did try to use SPARK_LOCAL_IP=[::1]? The bracket is required as the literal form of IPv6 technically (https://en.wikipedia.org/wiki/IPv6, https://datatracker.ietf.org/doc/html/rfc2732)

with SPARK_LOCAL_IP=::1, SBT passed, Maven still not
In addition, I have disabled firewall. If Maven fails is expected, I will close this pr first

@LuciferYang
Copy link
Contributor Author

BTW, may I ask if you did try to use SPARK_LOCAL_IP=[::1]? The bracket is required as the literal form of IPv6 technically (https://en.wikipedia.org/wiki/IPv6, https://datatracker.ietf.org/doc/html/rfc2732)

with SPARK_LOCAL_IP=::1, SBT passed, Maven still not
In addition, I have disabled firewall. If Maven fails is expected, I will close this pr first

Also fail. I am going to add some logs today to investigate the difference between maven and sbt

@dongjoon-hyun
Copy link
Member

Sorry for the too much delay . Let me merge this. Thanks, @LuciferYang .

@LuciferYang
Copy link
Contributor Author

thanks @dongjoon-hyun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants