Skip to content

Conversation

@Yikun
Copy link
Member

@Yikun Yikun commented Mar 11, 2022

What changes were proposed in this pull request?

Why are the changes needed?

In Volcano, weight should be a positive integer, so weight 0 is a wrong usage. As description for queue

  • weight is a soft constraint.
  • capability is a hard constraint.

We better to use capability to limit disable queue. This also fix the error requestBody.spec.weight: Invalid value: 0: queue weight must be a positive integer when running latest volcano image.

Does this PR introduce any user-facing change?

No

How was this patch tested?

# arm64
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.5.1/installer/volcano-development-arm64.yaml

# x86
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.5.1/installer/volcano-development.yaml

build/sbt -Pvolcano -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r  -Dtest.include.tags=volcano  -Dspark.kubernetes.test.namespace=default "kubernetes-integration-tests/testOnly"

@Yikun
Copy link
Member Author

Yikun commented Mar 11, 2022

cc @dongjoon-hyun

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the Volcano versioning? I don't believe this is sustainable to depend on latest. Please be specific about

  • which Volcano version has these enforcements
  • which Volcano version can delete the Open queue.

Without a specific volcano version, everything is meaningless because it's fragile.

@Yikun
Copy link
Member Author

Yikun commented Mar 12, 2022

Sorry for the trouble.

What is the Volcano versioning?

  • Volcano release docker image with specific tag, such as v1.5.0, next release would be v1.5.1
  • Fixes will merge to latest and then backport to v1.5.x.
  • Use release-1.5/installer/volcano-development-*.yaml with v1.5.x.beta.x image to do continusly validation as release candidation.

I don't believe this is sustainable to depend on latest

  • We use master latest images for spark IT test before. I have to say sorry about this, I also think it's my fault to make this decision, I just considered it's also very import for us to do continuesly validation but ignored the stable factor.
  • We need to use release-1.5/installer/volcano-development-*.yaml to do validation for spark 3.3, I also track issue on: Add beta release for v1.5.1 volcano-sh/volcano#2076 to release v1.5.1 beta release for continuesly validation, this issue will be resloved today.
  • I think most of the reasons should be due to my lack of consideration and communication, but the Volcano community is also actively help to improve, hope you can understand. Thanks for all your helps @william-wang @kevin-wangzefeng @Thor-wl

which Volcano version has these enforcements?

Always, but as I mentioned before, there was a compatiable problem in v1.5.0 webhook feature (admissionconrol, input parameter validation) are not work in v1.5.0 with K8S 1.22+, so we didn't see this enforcements. Epecially in kubernetes v1.22+ because of admissionregistration.k8s.io/v1beta1 drop, alread be fixed in volcano master volcano-sh/volcano#2063 and will be backported to v1.5.x soon.

which Volcano version can delete the Open queue?

  • Volcano can delete the Open queue improvation for ease of use since v1.5.1.beta.0 until queue close API supported.
  • Volcano will also supported close queue api in follow release.
  • No more impact on Spark in some level. For Spark 3.3, we can delete queue, For Spark followup release, I need add a queue close operation before delete queue in IT, and also bump volcano new version in followup Spark release.

TLDR:

@dongjoon-hyun
Copy link
Member

Thank you for the detail, @Yikun !

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-38524][K8S][TESTS] Use capability to limit disable queue [SPARK-38524][K8S][TESTS] Fix Volcano weight to be positive integer and use capability instead Mar 12, 2022
@dongjoon-hyun dongjoon-hyun changed the title [SPARK-38524][K8S][TESTS] Fix Volcano weight to be positive integer and use capability instead [SPARK-38524][K8S][TESTS] Fix Volcano weight to be positive integer and use cpu capability instead Mar 12, 2022
@dongjoon-hyun
Copy link
Member

Hi, @Yikun . Is there a pre-defined release cadence in Volcano community?

@Yikun
Copy link
Member Author

Yikun commented Mar 13, 2022

@dongjoon-hyun Thanks for notification, the beta image will be published in 1-2 hours. I also do a pre-validation on x86 and arm based on volcano latest branch, all IT passed as expected with this PR. I will ping you once it's ready.

@Thor-wl @william-wang is doing final check and then release the beta image, thank for help and efforts at the weekend.

Is there a pre-defined release cadence in Volcano community?

major version no certain time, minor releases about 3-6 months, patch release on demand.

@dongjoon-hyun
Copy link
Member

Thank you for the confirmation. Please note that Apache Spark cannot depend on SNAPSHOT or beta stuff.

@Yikun
Copy link
Member Author

Yikun commented Mar 13, 2022

@dongjoon-hyun Yes for sure, the target version would be v1.5.1.

As we dicussed before, the release time would be a time after Spark 3.3 branch cut down (15. March) before Spark 3.3 release RC (April).

@Yikun
Copy link
Member Author

Yikun commented Mar 13, 2022

All test passed with volcano release-1.5 (v1.5.1-beta.0 images) on x86 and arm64.

test on arm64 details
# arm64
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/release-1.5/installer/volcano-development-arm64.yaml

build/sbt -Pvolcano -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r  -Dtest.include.tags=volcano  -Dspark.kubernetes.test.namespace=default "kubernetes-integration-tests/testOnly"

[info] VolcanoSuite:
[info] - Run SparkPi with no resources (12 seconds, 370 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 520 milliseconds)
[info] - Run SparkPi with a very long application name. (11 seconds, 927 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (11 seconds, 1 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (10 seconds, 899 milliseconds)
[info] - Run SparkPi with an argument. (11 seconds, 943 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (11 seconds, 979 milliseconds)
[info] - All pods have the same service account by default (10 seconds, 962 milliseconds)
[info] - Run extraJVMOptions check on driver (6 seconds, 105 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (12 seconds, 17 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (19 seconds, 699 milliseconds)
[info] - Run SparkPi with env and mount secrets. (22 seconds, 314 milliseconds)
[info] - Run PySpark on simple pi.py example (13 seconds, 57 milliseconds)
[info] - Run PySpark to test a pyfiles example (15 seconds, 144 milliseconds)
[info] - Run PySpark with memory customization (12 seconds, 943 milliseconds)
[info] - Run in client mode. (8 seconds, 259 milliseconds)
[info] - Start pod creation from template (13 seconds, 95 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (11 seconds, 987 milliseconds)
[info] - Test basic decommissioning (46 seconds, 368 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (46 seconds, 699 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 46 seconds)
[info] - Test decommissioning timeouts (47 seconds, 204 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 8 seconds)
[info] - Run SparkPi with volcano scheduler (13 seconds, 27 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (38 seconds, 825 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (38 seconds, 911 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (19 seconds, 139 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (34 seconds, 263 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (18 seconds, 491 milliseconds)
[info] Run completed in 13 minutes, 16 seconds.
[info] Total number of tests run: 29
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 829 s (13:49), completed 2022-3-13 20:29:27
test on x86 details
# x86
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/release-1.5/installer/volcano-development.yaml

build/sbt -Pvolcano -Pkubernetes -Pkubernetes-integration-tests -Dtest.exclude.tags=minikube,r  -Dtest.include.tags=volcano  -Dspark.kubernetes.test.namespace=default "kubernetes-integration-tests/testOnly"

[info] VolcanoSuite:
[info] - Run SparkPi with no resources (12 seconds, 171 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (11 seconds, 741 milliseconds)
[info] - Run SparkPi with a very long application name. (12 seconds, 794 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (11 seconds, 740 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (11 seconds, 711 milliseconds)
[info] - Run SparkPi with an argument. (12 seconds, 789 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (11 seconds, 848 milliseconds)
[info] - All pods have the same service account by default (12 seconds, 756 milliseconds)
[info] - Run extraJVMOptions check on driver (6 seconds, 661 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (12 seconds, 881 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (19 seconds, 192 milliseconds)
[info] - Run SparkPi with env and mount secrets. (20 seconds, 721 milliseconds)
[info] - Run PySpark on simple pi.py example (13 seconds, 776 milliseconds)
[info] - Run PySpark to test a pyfiles example (15 seconds, 767 milliseconds)
[info] - Run PySpark with memory customization (13 seconds, 738 milliseconds)
[info] - Run in client mode. (9 seconds, 176 milliseconds)
[info] - Start pod creation from template (11 seconds, 792 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (12 seconds, 831 milliseconds)
[info] - Test basic decommissioning (47 seconds, 65 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (48 seconds, 200 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 46 seconds)
[info] - Test decommissioning timeouts (47 seconds, 423 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 10 seconds)
[info] - Run SparkPi with volcano scheduler (11 seconds, 695 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (34 seconds, 328 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (32 seconds, 235 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (18 seconds, 230 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (29 seconds, 370 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (19 seconds, 213 milliseconds)
[info] Run completed in 13 minutes, 13 seconds.
[info] Total number of tests run: 29
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 29, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 837 s (13:57), completed Mar 13, 2022 8:29:45 PM

Ready to go! @dongjoon-hyun

@Yikun Yikun marked this pull request as ready for review March 13, 2022 13:46
@dongjoon-hyun
Copy link
Member

@dongjoon-hyun Yes for sure, the target version would be v1.5.1.

As we dicussed before, the release time would be a time after Spark 3.3 branch cut down (15. March) before Spark 3.3 release RC (April).

Yes, For the test PR, it sounds good because March 15 is a deadline for Feature Freeze.

@dongjoon-hyun dongjoon-hyun marked this pull request as draft March 14, 2022 18:40
@dongjoon-hyun dongjoon-hyun marked this pull request as ready for review March 14, 2022 18:42
weight: 1
capability:
cpu: "1"
cpu: "0.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cpu: 0 is also invalid by definition?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cap 0 is valid, if capacity <= 0 represents no limit in Volcano queue.

This is different with Weight, weight is soft constraint to calculate some proportions according to weight value, so 0 bring some unexpected behavior, so it's invalid.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. If then, could you use more smaller value like 0.0001 than 0.1?

Copy link
Member Author

@Yikun Yikun Mar 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, unit of least precision is 0.001 as kuberentes. So we'd better just set to 0.001.

PR updated.

@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (11 seconds, 670 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minCPU (33 seconds, 564 milliseconds)
[info] - SPARK-38187: Run SparkPi Jobs with minMemory (32 seconds, 829 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (15 seconds, 918 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (31 seconds, 833 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (18 seconds, 571 milliseconds)
[info] Run completed in 2 minutes, 29 seconds.
[info] Total number of tests run: 6
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 6, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 303 s (05:03), completed 2022-3-15 11:28:21

@dongjoon-hyun
Copy link
Member

It seems that v1.5.1-beta.0 has another installation issue.

          image: volcanosh/vc-controller-manager-arm64:v1.5.1-beta.0
...
      message: 'Internal error occurred: failed calling webhook "mutatepod.volcano.sh":
        Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s":
        no endpoints available for service "volcano-admission-service"'

@dongjoon-hyun
Copy link
Member

I'm not sure what is going on Volcano project, but volcano installation and cleanup seems incomplete.
It screwed the EKS cluster itself.

22/03/14 23:32:49 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/03/14 23:32:49 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
22/03/14 23:32:50 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
22/03/14 23:32:57 ERROR Client: Please check "kubectl auth can-i create pod" first. It should be yes.
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: .... Message: 
Internal error occurred: failed calling webhook "mutatepod.volcano.sh": Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s": service "volcano-admission-service" not found. Received status: Status(apiVersion=v1, code=500, details=StatusDetails(causes=[StatusCause(field=null, message=failed calling webhook "mutatepod.volcano.sh": Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s": service "volcano-admission-service" not found, reason=null, additionalProperties={})], group=null, kind=null, name=null, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Internal error occurred: failed calling webhook "mutatepod.volcano.sh": Post "https://volcano-admission-service.volcano-system.svc:443/pods/mutate?timeout=10s": service "volcano-admission-service" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=InternalError, status=Failure, additionalProperties={}).

@dongjoon-hyun
Copy link
Member

It was my bad. I made a follow-up to fix it.

@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

@dongjoon-hyun See also: volcano-sh/volcano#2079, note that how to cleanup volcano completely

@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

Then after above clean operations, you can try just use release-1.5 (v1.5.1-beta.0) to test.

@dongjoon-hyun
Copy link
Member

@Yikun . I cannot verify your PR because surprisingly Volcano v1.5.0 fails to install on a new vanilla EKS cluster from the start.

$ k apply -f https://raw.githubusercontent.com/volcano-sh/volcano/v1.5.0/installer/volcano-development-arm64.yaml
namespace/volcano-system created
configmap/volcano-scheduler-configmap created
serviceaccount/volcano-scheduler created
clusterrole.rbac.authorization.k8s.io/volcano-scheduler created
clusterrolebinding.rbac.authorization.k8s.io/volcano-scheduler-role created
deployment.apps/volcano-scheduler created
serviceaccount/volcano-admission created
clusterrole.rbac.authorization.k8s.io/volcano-admission created
clusterrolebinding.rbac.authorization.k8s.io/volcano-admission-role created
deployment.apps/volcano-admission created
service/volcano-admission-service created
job.batch/volcano-admission-init created
serviceaccount/volcano-controllers created
clusterrole.rbac.authorization.k8s.io/volcano-controllers created
clusterrolebinding.rbac.authorization.k8s.io/volcano-controllers-role created
deployment.apps/volcano-controllers created
error: error validating "https://raw.githubusercontent.com/volcano-sh/volcano/v1.5.0/installer/volcano-development-arm64.yaml": error validating data: [ValidationError(CustomResourceDefinition.spec): unknown field "subresources" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "validation" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): unknown field "version" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec, ValidationError(CustomResourceDefinition.spec): missing required field "versions" in io.k8s.apiextensions-apiserver.pkg.apis.apiextensions.v1.CustomResourceDefinitionSpec]; if you choose to ignore these errors, turn validation off with --validate=false

@dongjoon-hyun
Copy link
Member

I'm trying to rebuilt the cluster. I'll let you know if I succeed again.

@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

I cannot verify your PR because surprisingly Volcano v1.5.0 fails to install on a new vanilla EKS cluster from the start.

As we meet before in #35422 (comment) (which is fixed in master and backport to v1.5.x in volcano-sh/volcano@42fd488)

k delete validatingwebhookconfigurations volcano-admission-service-jobs-validate volcano-admission-service-pods-validate volcano-admission-service-queues-validate
k delete mutatingwebhookconfigurations volcano-admission-service-jobs-mutate volcano-admission-service-podgroups-mutate volcano-admission-service-pods-mutate volcano-admission-service-queues-mutate
# Using beta
kubectl apply -f https://raw.githubusercontent.com/volcano-sh/volcano/release-1.5/installer/volcano-development-arm64.yaml

Try this, as I mentioned before, there are some bugs fixed after volcano v1.5.0 release. such as:

or you mean we have to first bump volcano to v1.5.1, then merge this PR?

@dongjoon-hyun
Copy link
Member

Ya, I found that I was hitting the original issue of lack of ARM64 support of Volcano.

@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

I found that I was hitting the original issue of lack of ARM64 support of Volcano

@dongjoon-hyun So my suggestion is first we do validation on release-1.5, if all is ready, volcano will realese v1.5.1, then we bump min verified version to v1.5.1. WDYT?

@dongjoon-hyun
Copy link
Member

I don't think that's Apache Spark's role. Apache Spark community will mark this is a known issue in the release note if Volcano doesn't provide any release until our release date, @Yikun .

@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

@dongjoon-hyun OK, so now we have only one way to solve this issue: volcano release v1.5.1 before cut down?

If we find any other issues before Spark 3.3 RC, Volcano could release a demand patch version like v1.5.2.

@dongjoon-hyun
Copy link
Member

Unfortunately, I don't know what is inside Volcano code, @Yikun . It's a role of Volcano community to provide a reliable tested release. Given the current circumstance I observed so far during helping this area, I don't know v1.5.1 will be healthy or not.

We will test the stability again when it's available publicly.

@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

@dongjoon-hyun OK, thanks for your suggestion, I will discuss with Volcano community! Thanks!

@github-actions github-actions bot added the DOCS label Mar 15, 2022
@Yikun Yikun changed the title [SPARK-38524][K8S][TESTS] Fix Volcano weight to be positive integer and use cpu capability instead [SPARK-38524][SPARK-38553][K8S] Bump Volcano to v1.5.1 and fix Volcano weight to be positive integer and use cpu capability instead Mar 15, 2022
@Yikun Yikun marked this pull request as draft March 15, 2022 09:47
@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

@dongjoon-hyun Volcano are considering to release v1.5.1 (same code base with v1.5.1.beta.0). The release will ready soon. volcano-sh/volcano#2090

I also updated this PR to bump v1.5.1 and fix the test issue, I couldn't separate version bump and test case, so just squash them in to this PR.

@Yikun Yikun marked this pull request as ready for review March 15, 2022 15:21
@Yikun
Copy link
Member Author

Yikun commented Mar 15, 2022

Ready for review.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. It's great, @Yikun .
Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants