Skip to content

Conversation

@dongjoon-hyun
Copy link
Member

@dongjoon-hyun dongjoon-hyun commented Mar 9, 2022

What changes were proposed in this pull request?

This PR aims to remove spark.kubernetes.job.queue in favor of spark.kubernetes.driver.podGroupTemplateFile for Apache Spark 3.3.

Why are the changes needed?

There are several batch execution scheduler options including custom schedulers in K8s environment.
We had better isolate scheduler specific settings instead of introducing a new configuration.

Does this PR introduce any user-facing change?

No, the previous configuration is not released yet.

How was this patch tested?

Pass the CIs and K8s IT.

[info] KubernetesSuite:
[info] - Run SparkPi with no resources (8 seconds, 548 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (8 seconds, 419 milliseconds)
[info] - Run SparkPi with a very long application name. (8 seconds, 360 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (8 seconds, 386 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (8 seconds, 589 milliseconds)
[info] - Run SparkPi with an argument. (8 seconds, 361 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (8 seconds, 363 milliseconds)
[info] - All pods have the same service account by default (8 seconds, 332 milliseconds)
[info] - Run extraJVMOptions check on driver (4 seconds, 331 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 392 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (13 seconds, 915 milliseconds)
[info] - Run SparkPi with env and mount secrets. (18 seconds, 172 milliseconds)
[info] - Run PySpark on simple pi.py example (9 seconds, 368 milliseconds)
[info] - Run PySpark to test a pyfiles example (11 seconds, 489 milliseconds)
[info] - Run PySpark with memory customization (9 seconds, 378 milliseconds)
[info] - Run in client mode. (6 seconds, 296 milliseconds)
[info] - Start pod creation from template (8 seconds, 465 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (9 seconds, 460 milliseconds)
[info] - Test basic decommissioning (40 seconds, 795 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (41 seconds, 16 milliseconds)
[info] *** Test still running after 2 minutes, 19 seconds: suite name: KubernetesSuite, test name: Test decommissioning with dynamic allocation & shuffle cleanups.
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 40 seconds)
[info] - Test decommissioning timeouts (40 seconds, 446 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 5 seconds)
[info] - Run SparkR on simple dataframe.R example (12 seconds, 562 milliseconds)
[info] VolcanoSuite:
[info] - Run SparkPi with no resources (10 seconds, 339 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (9 seconds, 346 milliseconds)
[info] - Run SparkPi with a very long application name. (9 seconds, 306 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 361 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (9 seconds, 344 milliseconds)
[info] - Run SparkPi with an argument. (9 seconds, 421 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 365 milliseconds)
[info] - All pods have the same service account by default (9 seconds, 337 milliseconds)
[info] - Run extraJVMOptions check on driver (5 seconds, 348 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 310 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (15 seconds, 13 milliseconds)
[info] - Run SparkPi with env and mount secrets. (18 seconds, 466 milliseconds)
[info] - Run PySpark on simple pi.py example (10 seconds, 558 milliseconds)
[info] - Run PySpark to test a pyfiles example (11 seconds, 445 milliseconds)
[info] - Run PySpark with memory customization (10 seconds, 395 milliseconds)
[info] - Run in client mode. (6 seconds, 239 milliseconds)
[info] - Start pod creation from template (10 seconds, 415 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (9 seconds, 440 milliseconds)
[info] - Test basic decommissioning (42 seconds, 799 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (42 seconds, 836 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 41 seconds)
[info] - Test decommissioning timeouts (42 seconds, 375 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 7 seconds)
[info] - Run SparkR on simple dataframe.R example (12 seconds, 441 milliseconds)
[info] - Run SparkPi with volcano scheduler (10 seconds, 421 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (13 seconds, 256 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (16 seconds, 216 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (14 seconds, 264 milliseconds
[info] - SPARK-38423: Run driver job to validate priority order (16 seconds, 325 milliseconds)
[info] Run completed in 28 minutes, 9 seconds.
[info] Total number of tests run: 53
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 53, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 1785 s (29:45), completed Mar 8, 2022 11:15:23 PM

@dongjoon-hyun
Copy link
Member Author

cc @viirya and @Yikun

Copy link
Member

@yaooqinn yaooqinn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya @martin-g @yaooqinn Thanks for your review. And sorry to late reply, frankly, I was a little bit concerned about flexibility before, but now I think I'm +1 on this.

If needed, we still can select some configuration carefully in future to overwrite.

I also took some time to get some more feedback from our internal and local users/developers (@yaooqinn @aidaizyy @william-wang @k82cn) who are using kubernetes or using spark with volcano. They also think it's a good way.

Thanks @dongjoon-hyun for your help! LGTM!

Copy link
Member

@Yikun Yikun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you meet some problem then changed val to var?

All related test are all passed with below changes, so maybe val is enough in here.

[info] VolcanoSuite:
[info] - Run SparkPi with volcano scheduler (12 seconds, 439 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (14 seconds, 336 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (25 seconds, 422 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (18 seconds, 373 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (16 seconds, 409 milliseconds)

Of course, it's fine for me to address this in a followup.

@k82cn
Copy link

k82cn commented Mar 9, 2022

@viirya @martin-g @yaooqinn Thanks for your review. And sorry to late reply, frankly, I was a little bit concerned about flexibility before, but now I think I'm +1 on this.

If needed, we still can select some configuration carefully in future to overwrite.

I also took some time to get some more feedback from our internal and local users/developers (@yaooqinn @aidaizyy @william-wang @k82cn) who are using kubernetes or using spark with volcano. They also think it's a good way.

Thanks @dongjoon-hyun for your help! LGTM!

Overall, that's ok to me :) But it's better to have related parameters to make it easier.

.stringConf
.createOptional

val KUBERNETES_JOB_QUEUE = ConfigBuilder("spark.kubernetes.job.queue")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@github-actions github-actions bot added the DOCS label Mar 9, 2022
@dongjoon-hyun
Copy link
Member Author

No, @k82cn . That's not better because this is volcano specific. Apache Spark wants to be open and extensible to all custom schedulers. To do that, we need a clear isolation between schedulers.

Overall, that's ok to me :) But it's better to have related parameters to make it easier.

@dongjoon-hyun
Copy link
Member Author

The last commit updates documentation and changes var to val in the test case.

  • UT passed.
[info] BasicDriverFeatureStepSuite:
[info] - Check the pod respects all configurations from the user. (200 milliseconds)
[info] - Check driver pod respects kubernetes driver request cores (9 milliseconds)
[info] - Check appropriate entrypoint rerouting for various bindings (3 milliseconds)
[info] - memory overhead factor: java (2 milliseconds)
[info] - memory overhead factor: python default (2 milliseconds)
[info] - memory overhead factor: python w/ override (2 milliseconds)
[info] - memory overhead factor: r default (1 millisecond)
[info] - SPARK-35493: make spark.blockManager.port be able to be fallen back to in driver pod (3 milliseconds)
[info] - SPARK-36075: Check driver pod respects nodeSelector/driverNodeSelector (2 milliseconds)
[info] EnvSecretsFeatureStepSuite:
[info] - sets up all keyRefs (3 milliseconds)
[info] ExecutorPodsPollingSnapshotSourceSuite:
[info] - Items returned by the API should be pushed to the event queue (17 milliseconds)
[info] - SPARK-36334: Support pod listing with resource version (7 milliseconds)
[info] VolcanoFeatureStepSuite:
[info] - SPARK-36061: Driver Pod with Volcano PodGroup (329 milliseconds)
[info] - SPARK-36061: Executor Pod with Volcano PodGroup (2 milliseconds)
[info] - SPARK-38423: Support priorityClassName (31 milliseconds)
[info] - SPARK-38455: Support driver podgroup template (77 milliseconds)
[info] - SPARK-38455: Support executor podgroup template (8 milliseconds)
[info] ExecutorPodsSnapshotSuite:
[info] - States are interpreted correctly from pod metadata. (14 milliseconds)
[info] - SPARK-30821: States are interpreted correctly from pod metadata when configured to check all containers. (3 milliseconds)
[info] - Updates add new pods for non-matching ids and edit existing pods for matching ids (1 millisecond)
[info] ExecutorKubernetesCredentialsFeatureStepSuite:
[info] - configure spark pod with executor service account (2 milliseconds)
[info] - configure spark pod with with driver service account and without executor service account (0 milliseconds)
[info] - configure spark pod with with driver service account and with executor service account (1 millisecond)
[info] DriverKubernetesCredentialsFeatureStepSuite:
[info] - Don't set any credentials (4 milliseconds)
[info] - Only set credentials that are manually mounted. (1 millisecond)
[info] - Mount credentials from the submission client as a secret. (31 milliseconds)
[info] PodTemplateConfigMapStepSuite:
[info] - Do nothing when executor template is not specified (1 millisecond)
[info] - Mounts executor template volume if config specified (45 milliseconds)
[info] KubernetesExecutorBuilderSuite:
[info] - use empty initial pod if template is not specified (45 milliseconds)
[info] - SPARK-36059: set custom scheduler (70 milliseconds)
[info] - load pod template if specified (18 milliseconds)
[info] - configure a custom test step (17 milliseconds)
[info] - SPARK-37145: configure a custom test step with base config (14 milliseconds)
[info] - SPARK-37145: configure a custom test step with driver or executor config (20 milliseconds)
[info] - SPARK-37145: configure a custom test step with wrong type config (12 milliseconds)
[info] - SPARK-37145: configure a custom test step with wrong name (12 milliseconds)
[info] - complain about misconfigured pod template (11 milliseconds)
[info] KubernetesConfSuite:
[info] - Resolve driver labels, annotations, secret mount paths, envs, and memory overhead (2 milliseconds)
[info] - Basic executor translated fields. (0 milliseconds)
[info] - resource profile not default. (0 milliseconds)
[info] - Image pull secrets. (0 milliseconds)
[info] - Set executor labels, annotations, and secrets (1 millisecond)
[info] - Verify that executorEnv key conforms to the regular specification (1 millisecond)
[info] - SPARK-36075: Set nodeSelector, driverNodeSelector, executorNodeSelect (1 millisecond)
[info] - SPARK-36059: Set driver.scheduler and executor.scheduler (1 millisecond)
[info] - SPARK-37735: access appId in KubernetesConf (1 millisecond)
[info] - SPARK-36566: get app name label (1 millisecond)
[info] BasicExecutorFeatureStepSuite:
[info] - test spark resource missing vendor (7 milliseconds)
[info] - test spark resource missing amount (1 millisecond)
[info] - basic executor pod with resources (8 milliseconds)
[info] - basic executor pod has reasonable defaults (8 milliseconds)
[info] - executor pod hostnames get truncated to 63 characters (8 milliseconds)
[info] - SPARK-35460: invalid PodNamePrefixes (1 millisecond)
[info] - hostname truncation generates valid host names (18 milliseconds)
[info] - classpath and extra java options get translated into environment variables (7 milliseconds)
[info] - SPARK-32655 Support appId/execId placeholder in SPARK_EXECUTOR_DIRS (6 milliseconds)
[info] - test executor pyspark memory (6 milliseconds)
[info] - auth secret propagation (8 milliseconds)
[info] - Auth secret shouldn't propagate if files are loaded. (9 milliseconds)
[info] - SPARK-32661 test executor offheap memory (7 milliseconds)
[info] - basic resourceprofile (7 milliseconds)
[info] - resourceprofile with gpus (7 milliseconds)
[info] - Verify spark conf dir is mounted as configmap volume on executor pod's container. (7 milliseconds)
[info] - SPARK-34316 Disable configmap volume on executor pod's container (6 milliseconds)
[info] - SPARK-35482: user correct block manager port for executor pods (8 milliseconds)
[info] - SPARK-35969: Make the pod prefix more readable and tallied with K8S DNS Label Names (11 milliseconds)
[info] - SPARK-36075: Check executor pod respects nodeSelector/executorNodeSelector (6 milliseconds)
[info] KubernetesVolumeUtilsSuite:
[info] - Parses hostPath volumes correctly (1 millisecond)
[info] - Parses subPath correctly (0 milliseconds)
[info] - Parses persistentVolumeClaim volumes correctly (1 millisecond)
[info] - Parses emptyDir volumes correctly (1 millisecond)
[info] - Parses emptyDir volume options can be optional (0 milliseconds)
[info] - Defaults optional readOnly to false (0 milliseconds)
[info] - Fails on missing mount key (0 milliseconds)
[info] - Fails on missing option key (1 millisecond)
[info] - SPARK-33063: Fails on missing option key in persistentVolumeClaim (0 milliseconds)
[info] - Parses read-only nfs volumes correctly (1 millisecond)
[info] - Parses read/write nfs volumes correctly (0 milliseconds)
[info] - Fails on missing path option (0 milliseconds)
[info] - Fails on missing server option (1 millisecond)
[info] ExecutorRollPluginSuite:
[info] - Empty executor list (9 milliseconds)
[info] - Driver summary should be ignored (3 milliseconds)
[info] - A one-item executor list (4 milliseconds)
[info] - SPARK-37806: All policy should ignore executor if totalTasks < minTasks (1 millisecond)
[info] - Policy: ID (1 millisecond)
[info] - Policy: ADD_TIME (1 millisecond)
[info] - Policy: TOTAL_GC_TIME (0 milliseconds)
[info] - Policy: TOTAL_DURATION (0 milliseconds)
[info] - Policy: FAILED_TASKS (0 milliseconds)
[info] - Policy: AVERAGE_DURATION (1 millisecond)
[info] - Policy: OUTLIER - Work like TOTAL_DURATION if there is no outlier (0 milliseconds)
[info] - Policy: OUTLIER - Detect an average task duration outlier (0 milliseconds)
[info] - Policy: OUTLIER - Detect a total task duration outlier (1 millisecond)
[info] - Policy: OUTLIER - Detect a total GC time outlier (1 millisecond)
[info] - Policy: OUTLIER_NO_FALLBACK - Return None if there are no outliers (0 milliseconds)
[info] - Policy: OUTLIER_NO_FALLBACK - Detect an average task duration outlier (1 millisecond)
[info] - Policy: OUTLIER_NO_FALLBACK - Detect a total task duration outlier (0 milliseconds)
[info] - Policy: OUTLIER_NO_FALLBACK - Detect a total GC time outlier (1 millisecond)
[info] KubernetesClusterSchedulerBackendSuite:
[info] - Start all components (4 milliseconds)
[info] - Stop all components (69 milliseconds)
[info] - Remove executor (26 milliseconds)
[info] - Kill executors (50 milliseconds)
[info] - SPARK-34407: CoarseGrainedSchedulerBackend.stop may throw SparkException (5 milliseconds)
[info] - SPARK-34469: Ignore RegisterExecutor when SparkContext is stopped (1 millisecond)
[info] - Dynamically fetch an executor ID (1 millisecond)
[info] KubernetesDriverBuilderSuite:
[info] - use empty initial pod if template is not specified (32 milliseconds)
[info] - SPARK-36059: set custom scheduler (34 milliseconds)
[info] - load pod template if specified (18 milliseconds)
[info] - configure a custom test step (19 milliseconds)
[info] - SPARK-37145: configure a custom test step with base config (19 milliseconds)
[info] - SPARK-37145: configure a custom test step with driver or executor config (18 milliseconds)
[info] - SPARK-37145: configure a custom test step with wrong type config (6 milliseconds)
[info] - SPARK-37145: configure a custom test step with wrong name (6 milliseconds)
[info] - complain about misconfigured pod template (6 milliseconds)
[info] - SPARK-37331: check driver pre kubernetes resource, empty by default (13 milliseconds)
[info] - SPARK-37331: check driver pre kubernetes resource as expected (12 milliseconds)
[info] LocalDirsFeatureStepSuite:
[info] - Resolve to default local dir if neither env nor configuration are set (0 milliseconds)
[info] - Use configured local dirs split on comma if provided. (1 millisecond)
[info] - Use tmpfs to back default local dir (1 millisecond)
[info] - local dir on mounted volume (1 millisecond)
[info] ExecutorPodsWatchSnapshotSourceSuite:
[info] - Watch events should be pushed to the snapshots store as snapshot updates. (1 millisecond)
[info] ExecutorPodsAllocatorSuite:
[info] - SPARK-36052: test splitSlots (1 millisecond)
[info] - SPARK-36052: pending pod limit with multiple resource profiles (20 milliseconds)
[info] - Initially request executors in batches. Do not request another batch if the first has not finished. (3 milliseconds)
[info] - Request executors in batches. Allow another batch to be requested if all pending executors start running. (4 milliseconds)
[info] - When a current batch reaches error states immediately, re-request them on the next batch. (3 milliseconds)
[info] - Verify stopping deletes the labeled pods (0 milliseconds)
[info] - When an executor is requested but the API does not report it in a reasonable time, retry requesting that executor. (4 milliseconds)
[info] - SPARK-28487: scale up and down on target executor count changes (4 milliseconds)
[info] - SPARK-34334: correctly identify timed out pending pod requests as excess (2 milliseconds)
[info] - SPARK-33099: Respect executor idle timeout configuration (2 milliseconds)
[info] - SPARK-34361: scheduler backend known pods with multiple resource profiles at downscaling (7 milliseconds)
[info] - SPARK-33288: multiple resource profiles (5 milliseconds)
[info] - SPARK-33262: pod allocator does not stall with pending pods (3 milliseconds)
[info] - SPARK-35416: Support PersistentVolumeClaim Reuse (10 milliseconds)
[info] - print the pod name instead of Some(name) if pod is absent (1 millisecond)
[info] ExecutorPodsSnapshotsStoreSuite:
[info] - Subscribers get notified of events periodically. (2 milliseconds)
[info] - Even without sending events, initially receive an empty buffer. (1 millisecond)
[info] - Replacing the snapshot passes the new snapshot to subscribers. (0 milliseconds)
[info] ExecutorPodsLifecycleManagerSuite:
[info] - When an executor reaches error states immediately, remove from the scheduler backend. (14 milliseconds)
[info] - Don't remove executors twice from Spark but remove from K8s repeatedly. (1 millisecond)
[info] - When the scheduler backend lists executor ids that aren't present in the cluster, remove those executors from Spark. (2 milliseconds)
[info] - Keep executor pods in k8s if configured. (2 milliseconds)
[info] StatefulSetAllocatorSuite:
[info] - Validate initial statefulSet creation & cleanup with two resource profiles (12 milliseconds)
[info] - Validate statefulSet scale up (1 millisecond)
[info] HadoopConfDriverFeatureStepSuite:
[info] - mount hadoop config map if defined (1 millisecond)
[info] - create hadoop config map if config dir is defined (2 milliseconds)
[info] KubernetesClusterManagerSuite:
[info] - constructing a AbstractPodsAllocator works (2 milliseconds)
[info] KubernetesClientUtilsSuite:
[info] - verify load files, loads only allowed files and not the disallowed files. (11 milliseconds)
[info] - verify load files, truncates the content to maxSize, when keys are very large in number. (1 second, 283 milliseconds)
[info] - verify load files, truncates the content to maxSize, when keys are equal in length. (2 milliseconds)
[info] - verify that configmap built as expected (1 millisecond)
[info] MountVolumesFeatureStepSuite:
[info] - Mounts hostPath volumes (0 milliseconds)
[info] - Mounts persistentVolumeClaims (1 millisecond)
[info] - SPARK-32713 Mounts parameterized persistentVolumeClaims in executors (1 millisecond)
[info] - Create and mounts persistentVolumeClaims in driver (1 millisecond)
[info] - Create and mount persistentVolumeClaims in executors (0 milliseconds)
[info] - Mounts emptyDir (2 milliseconds)
[info] - Mounts emptyDir with no options (0 milliseconds)
[info] - Mounts read/write nfs volumes (2 milliseconds)
[info] - Mounts read-only nfs volumes (0 milliseconds)
[info] - Mounts multiple volumes (1 millisecond)
[info] - mountPath should be unique (1 millisecond)
[info] - Mounts subpath on emptyDir (0 milliseconds)
[info] - Mounts subpath on persistentVolumeClaims (1 millisecond)
[info] - Mounts multiple subpaths (1 millisecond)
[info] ClientSuite:
[info] - The client should configure the pod using the builder. (4 milliseconds)
[info] - The client should create Kubernetes resources (1 millisecond)
[info] - SPARK-37331: The client should create Kubernetes resources with pre resources (2 milliseconds
[info] - All files from SPARK_CONF_DIR, except templates, spark config, binary files and are within size limit, should be populated to pod's configMap. (6 milliseconds)
[info] - Waiting for app completion should stall on the watcher (0 milliseconds)
[info] K8sSubmitOpSuite:
[info] - List app status (3 milliseconds)
[info] - List status for multiple apps with glob (1 millisecond)
[info] - Kill app (0 milliseconds)
[info] - Kill app with gracePeriod (1 millisecond)
[info] - Kill multiple apps with glob without gracePeriod (0 milliseconds)
[info] KubernetesLocalDiskShuffleDataIOSuite:
[info] - recompute is not blocked by the recovery (5 seconds, 406 milliseconds)
[info] - Partial recompute shuffle data (6 seconds, 91 milliseconds)
[info] - A new rdd and full recovery of old data (6 seconds, 98 milliseconds)
[info] - Multi stages (4 seconds, 779 milliseconds)
[info] KerberosConfDriverFeatureStepSuite:
[info] - mount krb5 config map if defined (13 milliseconds)
[info] - create krb5.conf config map if local config provided (11 milliseconds)
[info] - create keytab secret if client keytab file used (7 milliseconds)
[info] - do nothing if container-local keytab used (5 milliseconds)
[info] - mount delegation tokens if provided (6 milliseconds)
[info] - create delegation tokens if needed (17 milliseconds)
[info] - do nothing if no config and no tokens (10 milliseconds)
[info] MountSecretsFeatureStepSuite:
[info] - mounts all given secrets (2 milliseconds)
[info] DriverServiceFeatureStepSuite:
[info] - Headless service has a port for the driver RPC, the block manager and driver ui. (2 milliseconds)
[info] - Hostname and ports are set according to the service name. (0 milliseconds)
[info] - Ports should resolve to defaults in SparkConf and in the service. (0 milliseconds)
[info] - Long prefixes should switch to using a generated unique name. (3 milliseconds)
[info] - Disallow bind address and driver host to be set explicitly. (1 millisecond)
[info] DriverCommandFeatureStepSuite:
[info] - java resource (0 milliseconds)
[info] - python resource (1 millisecond)
[info] - python executable precedence (1 millisecond)
[info] - R resource (0 milliseconds)
[info] - SPARK-25355: java resource args with proxy-user (0 milliseconds)
[info] - SPARK-25355: python resource args with proxy-user (0 milliseconds)
[info] - SPARK-25355: R resource args with proxy-user (0 milliseconds)
[info] KubernetesUtilsSuite:
[info] - Selects the given container as spark container. (1 millisecond)
[info] - Selects the first container if no container name is given. (0 milliseconds)
[info] - Falls back to the first container if given container name does not exist. (0 milliseconds)
[info] - constructs spark pod correctly with pod template with no containers (0 milliseconds)
[info] - SPARK-38201: check uploadFileToHadoopCompatibleFS with different delSrc and overwrite (76 milliseconds)
[info] Run completed in 28 seconds, 304 milliseconds.
[info] Total number of tests run: 205
[info] Suites: completed 33, aborted 0
[info] Tests: succeeded 205, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 37 s, completed Mar 9, 2022 9:18:17 AM
  • IT passed.
[info] KubernetesSuite:
[info] - Run SparkPi with no resources (10 seconds, 874 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (9 seconds, 705 milliseconds)
[info] - Run SparkPi with a very long application name. (9 seconds, 724 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 648 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (9 seconds, 689 milliseconds)
[info] - Run SparkPi with an argument. (9 seconds, 632 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 749 milliseconds)
[info] - All pods have the same service account by default (9 seconds, 646 milliseconds)
[info] - Run extraJVMOptions check on driver (4 seconds, 502 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (9 seconds, 746 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (15 seconds, 131 milliseconds)
[info] - Run SparkPi with env and mount secrets. (18 seconds, 590 milliseconds)
[info] - Run PySpark on simple pi.py example (10 seconds, 808 milliseconds)
[info] - Run PySpark to test a pyfiles example (11 seconds, 814 milliseconds)
[info] - Run PySpark with memory customization (9 seconds, 655 milliseconds)
[info] - Run in client mode. (7 seconds, 306 milliseconds)
[info] - Start pod creation from template (9 seconds, 788 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (9 seconds, 745 milliseconds)
[info] - Test basic decommissioning (42 seconds, 121 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (42 seconds, 274 milliseconds)
[info] *** Test still running after 2 minutes, 13 seconds: suite name: KubernetesSuite, test name: Test decommissioning with dynamic allocation & shuffle cleanups.
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 41 seconds)
[info] - Test decommissioning timeouts (41 seconds, 775 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 6 seconds)
[info] - Run SparkR on simple dataframe.R example (12 seconds, 699 milliseconds)
[info] VolcanoSuite:
[info] - Run SparkPi with no resources (10 seconds, 600 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (10 seconds, 690 milliseconds)
[info] - Run SparkPi with a very long application name. (10 seconds, 663 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (10 seconds, 665 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (10 seconds, 736 milliseconds)
[info] - Run SparkPi with an argument. (10 seconds, 705 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (10 seconds, 645 milliseconds)
[info] - All pods have the same service account by default (10 seconds, 669 milliseconds)
[info] - Run extraJVMOptions check on driver (5 seconds, 591 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (10 seconds, 664 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (16 seconds, 375 milliseconds)
[info] - Run SparkPi with env and mount secrets. (20 seconds, 707 milliseconds)
[info] - Run PySpark on simple pi.py example (11 seconds, 680 milliseconds)
[info] - Run PySpark to test a pyfiles example (12 seconds, 783 milliseconds)
[info] - Run PySpark with memory customization (10 seconds, 708 milliseconds)
[info] - Run in client mode. (7 seconds, 222 milliseconds)
[info] - Start pod creation from template (10 seconds, 765 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (10 seconds, 772 milliseconds)
[info] - Test basic decommissioning (42 seconds, 213 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (43 seconds, 377 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 42 seconds)
[info] - Test decommissioning timeouts (42 seconds, 791 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 8 seconds)
[info] - Run SparkR on simple dataframe.R example (12 seconds, 764 milliseconds)
[info] - Run SparkPi with volcano scheduler (10 seconds, 742 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (13 seconds, 462 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (21 seconds, 397 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (15 seconds, 291 milliseconds)
[info] - SPARK-38423: Run driver job to validate priority order (16 seconds, 398 milliseconds)
[info] Run completed in 28 minutes, 15 seconds.
[info] Total number of tests run: 53
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 53, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 1805 s (30:05), completed Mar 9, 2022 9:56:20 AM

@dongjoon-hyun
Copy link
Member Author

Thank you, @viirya , @yaooqinn , @Yikun , @martin-g , @k82cn.
Merged to master for Apache Spark 3.3.

@dongjoon-hyun dongjoon-hyun deleted the SPARK-38480 branch March 9, 2022 18:12
LuciferYang pushed a commit to LuciferYang/spark that referenced this pull request Mar 10, 2022
…park.kubernetes.driver.podGroupTemplateFile`

### What changes were proposed in this pull request?

This PR aims to remove `spark.kubernetes.job.queue` in favor of `spark.kubernetes.driver.podGroupTemplateFile` for Apache Spark 3.3.

### Why are the changes needed?

There are several batch execution scheduler options including custom schedulers in K8s environment.
We had better isolate scheduler specific settings instead of introducing a new configuration.

### Does this PR introduce _any_ user-facing change?

No, the previous configuration is not released yet.

### How was this patch tested?

Pass the CIs and K8s IT.

```
[info] KubernetesSuite:
[info] - Run SparkPi with no resources (8 seconds, 548 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (8 seconds, 419 milliseconds)
[info] - Run SparkPi with a very long application name. (8 seconds, 360 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (8 seconds, 386 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (8 seconds, 589 milliseconds)
[info] - Run SparkPi with an argument. (8 seconds, 361 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (8 seconds, 363 milliseconds)
[info] - All pods have the same service account by default (8 seconds, 332 milliseconds)
[info] - Run extraJVMOptions check on driver (4 seconds, 331 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 392 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (13 seconds, 915 milliseconds)
[info] - Run SparkPi with env and mount secrets. (18 seconds, 172 milliseconds)
[info] - Run PySpark on simple pi.py example (9 seconds, 368 milliseconds)
[info] - Run PySpark to test a pyfiles example (11 seconds, 489 milliseconds)
[info] - Run PySpark with memory customization (9 seconds, 378 milliseconds)
[info] - Run in client mode. (6 seconds, 296 milliseconds)
[info] - Start pod creation from template (8 seconds, 465 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (9 seconds, 460 milliseconds)
[info] - Test basic decommissioning (40 seconds, 795 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (41 seconds, 16 milliseconds)
[info] *** Test still running after 2 minutes, 19 seconds: suite name: KubernetesSuite, test name: Test decommissioning with dynamic allocation & shuffle cleanups.
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 40 seconds)
[info] - Test decommissioning timeouts (40 seconds, 446 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 5 seconds)
[info] - Run SparkR on simple dataframe.R example (12 seconds, 562 milliseconds)
[info] VolcanoSuite:
[info] - Run SparkPi with no resources (10 seconds, 339 milliseconds)
[info] - Run SparkPi with no resources & statefulset allocation (9 seconds, 346 milliseconds)
[info] - Run SparkPi with a very long application name. (9 seconds, 306 milliseconds)
[info] - Use SparkLauncher.NO_RESOURCE (9 seconds, 361 milliseconds)
[info] - Run SparkPi with a master URL without a scheme. (9 seconds, 344 milliseconds)
[info] - Run SparkPi with an argument. (9 seconds, 421 milliseconds)
[info] - Run SparkPi with custom labels, annotations, and environment variables. (9 seconds, 365 milliseconds)
[info] - All pods have the same service account by default (9 seconds, 337 milliseconds)
[info] - Run extraJVMOptions check on driver (5 seconds, 348 milliseconds)
[info] - Run SparkRemoteFileTest using a remote data file (8 seconds, 310 milliseconds)
[info] - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j2.properties (15 seconds, 13 milliseconds)
[info] - Run SparkPi with env and mount secrets. (18 seconds, 466 milliseconds)
[info] - Run PySpark on simple pi.py example (10 seconds, 558 milliseconds)
[info] - Run PySpark to test a pyfiles example (11 seconds, 445 milliseconds)
[info] - Run PySpark with memory customization (10 seconds, 395 milliseconds)
[info] - Run in client mode. (6 seconds, 239 milliseconds)
[info] - Start pod creation from template (10 seconds, 415 milliseconds)
[info] - SPARK-38398: Schedule pod creation from template (9 seconds, 440 milliseconds)
[info] - Test basic decommissioning (42 seconds, 799 milliseconds)
[info] - Test basic decommissioning with shuffle cleanup (42 seconds, 836 milliseconds)
[info] - Test decommissioning with dynamic allocation & shuffle cleanups (2 minutes, 41 seconds)
[info] - Test decommissioning timeouts (42 seconds, 375 milliseconds)
[info] - SPARK-37576: Rolling decommissioning (1 minute, 7 seconds)
[info] - Run SparkR on simple dataframe.R example (12 seconds, 441 milliseconds)
[info] - Run SparkPi with volcano scheduler (10 seconds, 421 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (only 1 enabled) (13 seconds, 256 milliseconds)
[info] - SPARK-38188: Run SparkPi jobs with 2 queues (all enabled) (16 seconds, 216 milliseconds)
[info] - SPARK-38423: Run SparkPi Jobs with priorityClassName (14 seconds, 264 milliseconds
[info] - SPARK-38423: Run driver job to validate priority order (16 seconds, 325 milliseconds)
[info] Run completed in 28 minutes, 9 seconds.
[info] Total number of tests run: 53
[info] Suites: completed 2, aborted 0
[info] Tests: succeeded 53, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[success] Total time: 1785 s (29:45), completed Mar 8, 2022 11:15:23 PM
```

Closes apache#35783 from dongjoon-hyun/SPARK-38480.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants