Skip to content

Conversation

@Yikun
Copy link
Member

@Yikun Yikun commented Feb 7, 2022

What changes were proposed in this pull request?

This patch added volcano feature step to help user integrate spark with Volcano Scheduler.

  • Add a VolcanoFeatureStep, it can be used in driver and executor side.

After this patch, users can enable this featurestep by submiting job by using

--conf spark.kubernete.driver.scheduler.name=volcano \
--conf spark.kubernetes.driver.pod.featureSteps=org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep

A PodGroup will be created before driver started, annotations will be set to driver pod to added driver pod to this pod group. Then, Volcano scheduler will help driver pod scheduling instead of deafult kubernetes scheduler.

Why are the changes needed?

This PR help user integrate Spark with Volcano Scheduler.

See also: SPARK-36057

Does this PR introduce any user-facing change?

Yes, introduced a user feature step.

These are used by VolcanoFeatureStep, and also will be used by YunikornFeatureStep in future.

How was this patch tested?

  • UT
  • Integration test: Test without -Pvolcano (make sure exsiting integration test passed)
# 1. Test without -Pvolcano (make sure exsiting integration test passed)
# SBT
build/sbt -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
# Maven
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --exclude-tags minikube,r
  • Integration test: Test all VolcanoSuite (all kubernetes test with volcano + a new podgroup test) and KubernetesSuite
# Deploy Volcano (x86)
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
# Deploy Volcano (arm64)
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml
# Test all VolcanoSuite (all kubernetes test with volcano + a new podgroup test) and KubernetesSuite
build/sbt -Pvolcano -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"

@Yikun Yikun force-pushed the SPARK-36061-vc-step branch from 74b56c5 to 1126602 Compare February 7, 2022 13:11
@Yikun
Copy link
Member Author

Yikun commented Feb 7, 2022

also cc @dongjoon-hyun @attilapiros @holdenk

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for pinging me, @Yikun .

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide us some integration test from now, @Yikun ? At least, some procedure how to verify your contributions?

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, please propose the full config namespace design. Otherwise, the current proposal doesn't conform our policy.

spark.kubernetes.job.min.cpu
spark.kubernetes.job.min.memory

@Yikun
Copy link
Member Author

Yikun commented Feb 8, 2022

In addition, please propose the full config namespace design.

@dongjoon-hyun Thanks for reminder, and there are 5 configuration will be introduced:

spark.kubernetes.job.min.cpu:  the minimum cpu resources for running
spark.kubernetes.job.min.memory:  the minimum memory resources for running
spark.kubernetes.job.min.member: the minimum number of pods for running
spark.kubernetes.job.priorityClassName: the priority of the running job
spark.kubernetes.job.queue: the queue to which the running job belongs

Should I put these all in this PR? or make a separte PR first?

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Feb 8, 2022

Thanks. You can make separate PRs, but you cannot make an intermediate namespace min here according to your plan.
You need to change like min.cpu -> minCpu, min.memory -> minMemory. In addition, member looks too general to be used in that context. Please revise into a more meaningful name.

@william-wang
Copy link

william-wang commented Feb 8, 2022

Thanks. You can make separate PRs, but you cannot make an intermediate namespace min here according to your plan. You need to change like min.cpu -> minCpu, min.memory -> minMemory. In addition, member looks too general to be used in that context. Please revise into a more meaningful name.

@dongjoon-hyun Thanks for your comments. It's reasonable to change min.cpu -> minCPU.
How about change min.member -> minMember which keeps the same style with minCPU. It is more easier to understand.

@Yikun Yikun marked this pull request as draft February 8, 2022 10:58
@Yikun
Copy link
Member Author

Yikun commented Feb 8, 2022

It works with maven, but there are some failure on sbt build, scala-maven-plugin related exclude and dep configuration can't be recognize by sbt. Converted to WIP first.

@Yikun
Copy link
Member Author

Yikun commented Feb 8, 2022

Configuration related code has moved to #35436

@martin-g
Copy link
Member

martin-g commented Feb 8, 2022

It works with maven, but there are some failure on sbt build, scala-maven-plugin related exclude and dep configuration can't be recognize by sbt

How about this:

diff --git project/SparkBuild.scala project/SparkBuild.scala
index 3d3a65f3d2..63ce6cbbf5 100644
--- project/SparkBuild.scala
+++ project/SparkBuild.scala
@@ -429,6 +429,12 @@ object SparkBuild extends PomBuild {
     enable(SparkR.settings)(core)
   }
 
+  if (!profiles.contains("volcano")) {
+    enable(Seq(
+      Compile / unmanagedSources / excludeFilter := HiddenFileFilter || "VolcanoFeatureStep.scala"
+    ))(kubernetes)
+  }
+

@martin-g
Copy link
Member

martin-g commented Feb 8, 2022

In addition I think it would be good to add -Pvolcano to the CI config:

diff --git .github/workflows/build_and_test.yml .github/workflows/build_and_test.yml
index 4529cd9ba4..9edf5efd35 100644
--- .github/workflows/build_and_test.yml
+++ .github/workflows/build_and_test.yml
@@ -614,7 +614,7 @@ jobs:
         export MAVEN_CLI_OPTS="--no-transfer-progress"
         export JAVA_VERSION=${{ matrix.java }}
         # It uses Maven's 'install' intentionally, see https://github.com/apache/spark/pull/26414.
-        ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Djava.version=${JAVA_VERSION/-ea} install
+        ./build/mvn $MAVEN_CLI_OPTS -DskipTests -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Djava.version=${JAVA_VERSION/-ea} install
         rm -rf ~/.m2/repository/org/apache/spark
 
   scala-213:
@@ -660,7 +660,7 @@ jobs:
     - name: Build with SBT
       run: |
         ./dev/change-scala-version.sh 2.13
-        ./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile test:compile
+        ./build/sbt -Pyarn -Pmesos -Pkubernetes -Pvolcano -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pspark-ganglia-lgpl -Pscala-2.13 compile test:compile

@Yikun Yikun force-pushed the SPARK-36061-vc-step branch from facde87 to f38137a Compare February 8, 2022 15:08
@Yikun
Copy link
Member Author

Yikun commented Feb 8, 2022

@martin-g Thanks! I added sbt related code, and I will add -Pvolcano in action after I make sure nothing break.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To @Yikun .

  1. Thank you for adding profile
  2. No, we do not accept empty configurations PRs. All code should be working and have a test coverage.

To @william-wang, please don't use member.

And, +1 for @martin-g comment about adding new profiles to CIs.

@Yikun
Copy link
Member Author

Yikun commented Feb 9, 2022

No, we do not accept empty configurations PRs. All code should be working and have a test coverage.

@dongjoon-hyun lol, I just misunderstood before, I felt a little weird too. but it has no effect on this PR. This PR will only add the volcanofeaturestep and introduce volcano module (enable podgroup). then next PR will introduce all configurations and volcano implementions.

And, +1 for @martin-g comment about adding new profiles to CIs.

Sure, will address soon.

Thanks for help!

@Yikun Yikun force-pushed the SPARK-36061-vc-step branch from f38137a to c8d7f5c Compare February 9, 2022 03:59
@Yikun
Copy link
Member Author

Yikun commented Feb 11, 2022

Thanks for review, will update soon.

The current design is valid, but we may need to rename the module name itself from volcano to more neutral name.

The original thoughts was volcano as a separate module, volcano is enable when -Pvolcano enable, yunikorn is enable when -Pyunikorn enable.

@dongjoon-hyun IIUC, you meant puting them togother into a module like kubernetes-custom-scheduler?

@dongjoon-hyun
Copy link
Member

No, I didn't mean anything yet at this stage~ I must be clear on that. Sorry.
I told the above because you may get some community feedback later to rename it. :)

@Yikun
Copy link
Member Author

Yikun commented Feb 11, 2022

@dongjoon-hyun Fine, thanks for clarify

@dongjoon-hyun
Copy link
Member

Thank you for updates, @Yikun .

@Yikun
Copy link
Member Author

Yikun commented Feb 11, 2022

Test result:

1. build/sbt -Pvolcano -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
$ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml
$ build/sbt -Pvolcano -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r "kubernetes-integration-tests/test"
[info] KubernetesSuite:
[info] - Run SparkPi with no resources (11 seconds, 31 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (10 seconds, 898 milliseconds)
[info] - Run SparkPi with a very long application name. (11 seconds, 820 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (11 seconds, 666 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (11 seconds, 735 milliseconds)
[info] - Run SparkPi with an argument. (11 seconds, 746 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (11 seconds, 809 milliseconds)
[info] - All pods have the same service account by default (11 seconds, 739 milliseconds)
[info] - Run extraJVMOptions check on driver (6 seconds, 784 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (11 seconds, 911 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (19 seconds, 226 milliseconds)
[info] - Run SparkPi with env and mount secrets. (19 seconds, 579 milliseconds)
[info] - Run PySpark on simple pi.py example (12 seconds, 792 milliseconds)
[info] - Run PySpark to test a pyfiles example (14 seconds, 815 milliseconds)
[info] - Run PySpark with memory customization (11 seconds, 773 milliseconds)
[info] - Run in client mode. (8 seconds, 177 milliseconds)
[info] - Start pod creation from template (11 seconds, 917 milliseconds)
[info] - Test basic decommissioning (47 seconds, 37 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (46 seconds, 139 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 45 seconds)
[info] - Test decommissioning timeouts (46 seconds, 329 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 7 seconds)
[info] VolcanoSuite:
[info] - Run SparkPi with no resources (12 seconds, 797 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 734 milliseconds)
[info] - Run SparkPi with a very long application name. (11 seconds, 753 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (11 seconds, 768 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (12 seconds, 705 milliseconds)
[info] - Run SparkPi with an argument. (11 seconds, 740 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (11 seconds, 710 milliseconds)
[info] - All pods have the same service account by default (12 seconds, 724 milliseconds)
[info] - Run extraJVMOptions check on driver (5 seconds, 673 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (11 seconds, 746 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (20 seconds, 247 milliseconds)
[info] - Run SparkPi with env and mount secrets. (21 seconds, 711 milliseconds)
[info] - Run PySpark on simple pi.py example (13 seconds, 807 milliseconds)
[info] - Run PySpark to test a pyfiles example (15 seconds, 887 milliseconds)
[info] - Run PySpark with memory customization (13 seconds, 869 milliseconds)
[info] - Run in client mode. (8 seconds, 161 milliseconds)
[info] - Start pod creation from template (12 seconds, 922 milliseconds)
[info] - Test basic decommissioning (46 seconds, 836 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (49 seconds, 179 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 54 seconds)
[info] - Test decommissioning timeouts (48 seconds, 456 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 8 seconds)
[info] - Run SparkPi with volcano scheduler (16 seconds, 328 milliseconds)
[info] Run completed in 24 minutes, 52 seconds.
[info] Total number of tests run: 45
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 45, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 1609 s (26:49), completed Feb 11, 2022 1:46:17 PM
  1. All CI test without -Pvolcano passed: https://github.com/Yikun/spark/actions/runs/1827439940

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that the new tests verify the customer scheduler feature.

Do you think we can verify the further volcano scheduler features like the following?

  • Gang scheduling
  • Fair-share scheduling
  • Queue scheduling
  • Preemption scheduling
  • Topology-based scheduling
  • Reclaim
  • Backfill
  • Resource reservation

@dongjoon-hyun
Copy link
Member

It seems that the following is not working on ARM64 because the images are not multi-arch images. Could you elaborate how we verify this on ARM64?

$ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development.yaml

$ docker manifest inspect volcanosh/vc-scheduler:latest
{
	"schemaVersion": 2,
	"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
	"config": {
		"mediaType": "application/vnd.docker.container.image.v1+json",
		"size": 1885,
		"digest": "sha256:17e4e3e09cf79245501745cd254f35afa59e9a948d17da248e6c77757ae95ec7"
	},
	"layers": [
		{
			"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
			"size": 2818413,
			"digest": "sha256:59bf1c3509f33515622619af21ed55bbe26d24913cedbca106468a5fb37a50c3"
		},
		{
			"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
			"size": 25563591,
			"digest": "sha256:bcfdb76fd5f9ec35cb800c64e9471f5d94656b15ac4785e3bc36d8887463f060"
		}
	]
}

@Yikun
Copy link
Member Author

Yikun commented Feb 11, 2022

@dongjoon-hyun Try this:

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml

And I will also do validations on arm64. : )

@dongjoon-hyun
Copy link
Member

Thanks! Please add that into the PR description too.

@dongjoon-hyun
Copy link
Member

Ur, it seems to fail.

$ kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml
namespace/volcano-system created
configmap/volcano-scheduler-configmap created
serviceaccount/volcano-scheduler created
clusterrole.rbac.authorization.k8s.io/volcano-scheduler created
clusterrolebinding.rbac.authorization.k8s.io/volcano-scheduler-role created
deployment.apps/volcano-scheduler created
serviceaccount/volcano-admission created
clusterrole.rbac.authorization.k8s.io/volcano-admission created
clusterrolebinding.rbac.authorization.k8s.io/volcano-admission-role created
deployment.apps/volcano-admission created
service/volcano-admission-service created
job.batch/volcano-admission-init created
serviceaccount/volcano-controllers created
clusterrole.rbac.authorization.k8s.io/volcano-controllers created
clusterrolebinding.rbac.authorization.k8s.io/volcano-controllers-role created
deployment.apps/volcano-controllers created
error: error validating "https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml": error validating data: [ValidationError(CustomResourceDefinition.spec): unknown field "subresources" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "validation" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "version" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): missing required field "versions" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec]; if you choose to ignore these errors, turn validation off with --validate=false

@Yikun
Copy link
Member Author

Yikun commented Feb 11, 2022

Ur, it seems to fail.

@dongjoon-hyun yep, there are some failure on volcano arm-deploy yaml, but will adress soon by cc @william-wang volcano-sh/volcano#2010

Do you think we can verify the further volcano scheduler features like the following?

Yep, most of them are configuration introduce and testing work, below cases will be added in integration test and added in Spark 3.3, I'm working on these (It is in my next week road map):

  1. SPARK-38187: Resource reservation (it should be a separte PR with minCPU/minMemory introduced + integration test), driver and executor all started or nothing started, it is also often misunderstood as gang.
  2. SPARK-38188: Queue scheduling (a separte PR with queue introduced + integration test)
  3. SPARK-38189: Preemption scheduling (a separte PR with priorityClass introduced + integration test)
  4. SPARK-38190: Fair-share scheduling (a separte PR with integration test)

(later: I also updated the corresponding jira)

The topology, reclaim, backfill scheduling will also supports in theory, but the integration test would be low priority in Spark 3.3, because from the actual use of spark on k8s, these scenarios are slightly less used, but I could try to validate if have time.

@Yikun
Copy link
Member Author

Yikun commented Feb 11, 2022

Failed test are unrelated: - job with fetch failure *** FAILED *** (51 milliseconds), rerun.

@william-wang
Copy link

Ur, it seems to fail.
@dongjoon-hyun The issue is fixed :)

@martin-g
Copy link
Member

The issue is fixed :)

Confirmed! Works here!

@Yikun
Copy link
Member Author

Yikun commented Feb 12, 2022

@dongjoon-hyun @william-wang @martin-g Thanks for all helps, it's also works and passed all integration test on my arm64 env.

Env info:
ubuntu@yikun-aarch64:~$ uname -a
Linux yikun-aarch64 5.4.0-91-generic #102-Ubuntu SMP Fri Nov 5 16:30:45 UTC 2021 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@yikun-aarch64:~$ kubectl version
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:41:01Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/arm64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.1", GitCommit:"86ec240af8cbd1b60bcc4c03c20da9b98005b92e", GitTreeState:"clean", BuildDate:"2021-12-16T11:34:54Z", GoVersion:"go1.17.5", Compiler:"gc", Platform:"linux/arm64"}
Test result on arm64 env:
[info] KubernetesSuite:
[info] - Run SparkPi with no resources (18 seconds, 751 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (18 seconds, 422 milliseconds)
[info] - Run SparkPi with a very long application name. (18 seconds, 151 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (18 seconds, 450 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (18 seconds, 180 milliseconds)
[info] - Run SparkPi with an argument. (18 seconds, 295 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (18 seconds, 372 milliseconds)
[info] - All pods have the same service account by default (18 seconds, 694 milliseconds)
[info] - Run extraJVMOptions check on driver (10 seconds, 331 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (19 seconds, 330 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (31 seconds, 586 milliseconds)
[info] - Run SparkPi with env and mount secrets. (32 seconds, 28 milliseconds)
[info] - Run PySpark on simple pi.py example (20 seconds, 393 milliseconds)
[info] - Run PySpark to test a pyfiles example (23 seconds, 430 milliseconds)
[info] - Run PySpark with memory customization (19 seconds, 258 milliseconds)
[info] - Run in client mode. (12 seconds, 282 milliseconds)
[info] - Start pod creation from template (18 seconds, 549 milliseconds)
[info] - Test basic decommissioning (52 seconds, 144 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (52 seconds, 793 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 53 seconds)
[info] - Test decommissioning timeouts (51 seconds, 503 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 12 seconds)
[info] VolcanoSuite:
[info] - Run SparkPi with no resources (18 seconds, 506 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (18 seconds, 262 milliseconds)
[info] - Run SparkPi with a very long application name. (18 seconds, 283 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (18 seconds, 467 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (18 seconds, 303 milliseconds)
[info] - Run SparkPi with an argument. (19 seconds, 254 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (18 seconds, 223 milliseconds)
[info] - All pods have the same service account by default (18 seconds, 397 milliseconds)
[info] - Run extraJVMOptions check on driver (9 seconds, 304 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (19 seconds, 217 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (32 seconds, 603 milliseconds)
[info] - Run SparkPi with env and mount secrets. (35 seconds, 860 milliseconds)
[info] - Run PySpark on simple pi.py example (20 seconds, 385 milliseconds)
[info] - Run PySpark to test a pyfiles example (24 seconds, 638 milliseconds)
[info] - Run PySpark with memory customization (19 seconds, 467 milliseconds)
[info] - Run in client mode. (11 seconds, 180 milliseconds)
[info] - Start pod creation from template (18 seconds, 587 milliseconds)
[info] - Test basic decommissioning (53 seconds, 381 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (52 seconds, 668 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 53 seconds)
[info] - Test decommissioning timeouts (52 seconds, 605 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 12 seconds)
[info] - Run SparkPi with volcano scheduler (18 seconds, 529 milliseconds)

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for update, @Yikun . The plan looks reasonable to me.

To @william-wang , could you publish the images with multi-arch format? Since we need to document this officially in Apache Spark 3.3 document, it would be great if we can drop the complexity like -arm64 postfix.

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml

@william-wang
Copy link

To @william-wang , could you publish the images with multi-arch format? Since we need to document this officially in Apache Spark 3.3 document, it would be great if we can drop the complexity like -arm64 postfix.

kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/master/installer/volcano-development-arm64.yaml

@dongjoon-hyun That's a good idea :) .The multi-arch is already in volcano community roadmap. Especially, we are going to support it before Apache Spark 3.3 release, which will be in a seperate pr to tell user how to deploy Volcano in Spark on K8S.

@dongjoon-hyun
Copy link
Member

Got it. Thank you for the confirmation.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you everyone, @Yikun , @HyukjinKwon , @martin-g , and @william-wang .
Although there are many things we need to do in the future, I believe we can merge this PR to help the up-coming PRs easier. I'm in your side to help this efforts.

@Yikun
Copy link
Member Author

Yikun commented Feb 16, 2022

@dongjoon-hyun Much thanks for your efforts and key input.

And thanks for all helps! @HyukjinKwon , @martin-g and @william-wang

dongjoon-hyun pushed a commit that referenced this pull request Sep 5, 2024
… to manage the code for `volcano`

### What changes were proposed in this pull request?
The main changes in this pr are as follows:

1. In `resource-managers/kubernetes/core/pom.xml` and `resource-managers/kubernetes/integration-tests/pom.xml`, the `build-helper-maven-plugin` configuration has been added for the `volcano` profile to ensure that when the profile is activated with `-Pvolcano`, the `volcano/src/main/scala` directory is treated as an additional source path, and `volcano/src/test/scala` directory is treated as an additional test code path.

- `resource-managers/kubernetes/core/pom.xml`

```xml
      <build>
        <plugins>
          <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>build-helper-maven-plugin</artifactId>
            <executions>
              <execution>
                <id>add-volcano-source</id>
                <phase>generate-sources</phase>
                <goals>
                  <goal>add-source</goal>
                </goals>
                <configuration>
                  <sources>
                    <source>volcano/src/main/scala</source>
                  </sources>
                </configuration>
              </execution>
              <execution>
                <id>add-volcano-test-sources</id>
                <phase>generate-test-sources</phase>
                <goals>
                  <goal>add-test-source</goal>
                </goals>
                <configuration>
                  <sources>
                    <source>volcano/src/test/scala</source>
                  </sources>
                </configuration>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
```

- `resource-managers/kubernetes/integration-tests/pom.xml`

```xml
      <build>
        <plugins>
          <plugin>
            <groupId>org.codehaus.mojo</groupId>
            <artifactId>build-helper-maven-plugin</artifactId>
            <executions>
              <execution>
                <id>add-volcano-test-sources</id>
                <phase>generate-test-sources</phase>
                <goals>
                  <goal>add-test-source</goal>
                </goals>
                <configuration>
                  <sources>
                    <source>volcano/src/test/scala</source>
                  </sources>
                </configuration>
              </execution>
            </executions>
          </plugin>
        </plugins>
      </build>
```

2. Removed the management configuration for `volcano`-related source/test code in SPARK-36061 | #35422.

Since Spark uses the `sbt-pom-reader` plugin in its sbt configuration, the behavior of the `build-helper-maven-plugin` will also propagate to the sbt build process. Therefore, no additional configuration is required in `SparkBuild.scala` after this pr.

### Why are the changes needed?
The previous configuration way was not very friendly to IntelliJ developers: when debugging code in IntelliJ, regardless of whether they were needed or not, the `volcano` profile had to be activated; otherwise compilation errors would occur when running tests that depended on the `kubernetes` module's source code, for example `org.apache.spark.shuffle.ShuffleChecksumUtilsSuite` :

<img width="1465" alt="image" src="https://github.com/user-attachments/assets/e16e3eba-d85e-45ad-bbae-533bd2f8ce0b">

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
1. Pass GitHub Actions

- https://github.com/LuciferYang/spark/actions/runs/10714343021/job/29707907464

<img width="1109" alt="image" src="https://github.com/user-attachments/assets/d893accb-508f-47f5-b19e-e178f6eff128">

- https://github.com/LuciferYang/spark/actions/runs/10714343021/job/29707906573

<img width="1183" alt="image" src="https://github.com/user-attachments/assets/735e0dc7-7d2c-418f-8fcd-200ee10eda0d">

It can be seen that the test cases `VolcanoFeatureStepSuite ` and `VolcanoSuite` have been successfully executed.

2. Manual Testing Using sbt

- Run `build/sbt clean "kubernetes/testOnly *VolcanoFeatureStepSuite" -Pkubernetes`, and without `-Pvolcano`, no tests will be executed:

```
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[info] No tests to run for kubernetes / Test / testOnly
```

- Run `build/sbt clean "kubernetes/testOnly *VolcanoFeatureStepSuite" -Pkubernetes -Pvolcano`, and with `-Pvolcano`, `VolcanoFeatureStepSuite` will pass the tests:

```
[info] VolcanoFeatureStepSuite:
[info] - SPARK-36061: Driver Pod with Volcano PodGroup (74 milliseconds)
[info] - SPARK-36061: Executor Pod with Volcano PodGroup (8 milliseconds)
[info] - SPARK-38455: Support driver podgroup template (224 milliseconds)
[info] - SPARK-38503: return empty for executor pre resource (1 millisecond)
[info] Run completed in 1 second, 268 milliseconds.
[info] Total number of tests run: 4
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
```

- run `build/sbt clean  "kubernetes/package" -Pkubernetes -Pvolcano`, and with `-Pvolcano`, confirm that `spark-kubernetes_2.13-4.0.0-SNAPSHOT.jar` contains `VolcanoFeatureStep.class`

- run `build/sbt clean  "kubernetes/package" -Pkubernetes`, and without `-Pvolcano`, confirm that `spark-kubernetes_2.13-4.0.0-SNAPSHOT.jar` not contains `VolcanoFeatureStep.class`

3. Manual Testing Using Maven

- Run `build/mvn clean test -pl resource-managers/kubernetes/core -am -Dtest=none -DwildcardSuites=org.apache.spark.deploy.k8s.features.VolcanoFeatureStepSuite -Pkubernetes`, and without `-Pvolcano`, no tests will be executed:

```
Discovery starting.
Discovery completed in 80 milliseconds.
Run starting. Expected test count is: 0
DiscoverySuite:
Run completed in 99 milliseconds.
Total number of tests run: 0
Suites: completed 1, aborted 0
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
No tests were executed.
```

- Run `build/mvn clean test -pl resource-managers/kubernetes/core -am -Dtest=none -DwildcardSuites=org.apache.spark.deploy.k8s.features.VolcanoFeatureStepSuite -Pkubernetes -Pvolcano`, and with `-Pvolcano`, `VolcanoFeatureStepSuite` will pass the tests:

```
Discovery starting.
Discovery completed in 263 milliseconds.
Run starting. Expected test count is: 4
VolcanoFeatureStepSuite:
- SPARK-36061: Driver Pod with Volcano PodGroup
- SPARK-36061: Executor Pod with Volcano PodGroup
- SPARK-38455: Support driver podgroup template
- SPARK-38503: return empty for executor pre resource
Run completed in 624 milliseconds.
Total number of tests run: 4
Suites: completed 2, aborted 0
Tests: succeeded 4, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

- run `build/mvn clean package -pl resource-managers/kubernetes/core -am -DskipTests -Pkubernetes -Pvolcano` and with `-Pvolcano`, confirm that `spark-kubernetes_2.13-4.0.0-SNAPSHOT.jar` contains `VolcanoFeatureStep.class`

- run `build/mvn clean package -pl resource-managers/kubernetes/core -am -DskipTests -Pkubernetes` and without `-Pvolcano`, confirm that `spark-kubernetes_2.13-4.0.0-SNAPSHOT.jar` not contains `VolcanoFeatureStep.class`

4. Testing in IntelliJ (both imported as a Maven project and as an sbt project):
- By default, do not activate `volcano`, and confirm that `volcano`-related code is not recognized as source/test code, and does not affect the compilation and testing of other code.
- Manually activate `volcano`, and confirm that `volcano`-related code is recognized as source/test code, and can be compiled and tested normally.

5. Similar tests were conducted on `kubernetes-integration-tests` module to confirm the validity of the `volcano` profile.

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #47997 from LuciferYang/refactor-volcano.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants