Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24793][K8s] Enhance spark-submit for app management #23599

Closed
wants to merge 1 commit into from

Conversation

skonto
Copy link
Contributor

@skonto skonto commented Jan 20, 2019

What changes were proposed in this pull request?

  • supports --kill & --status flags.
  • supports globs which is useful in general check this long standing issue for kubectl.

How was this patch tested?

Manually against running apps. Example output:

Submission Id reported at launch time:

2019-01-20 23:47:56 INFO  Client:58 - Waiting for application spark-pi with submissionId spark:spark-pi-1548020873671-driver to finish...

Killing the app:

./bin/spark-submit --kill spark:spark-pi-1548020873671-driver --master  k8s://https://192.168.2.8:8443
2019-01-20 23:48:07 WARN  Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0)
2019-01-20 23:48:07 WARN  Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address

App terminates with 143 (SIGTERM, since we have tiny this should lead to graceful shutdown):

2019-01-20 23:48:08 INFO  LoggingPodStatusWatcherImpl:58 - State changed, new state: 
	 pod name: spark-pi-1548020873671-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver
	 pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T21:47:55Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T21:47:55Z
	 phase: Running
	 container status: 
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: running
		 container started at: 2019-01-20T21:48:00Z
2019-01-20 23:48:09 INFO  LoggingPodStatusWatcherImpl:58 - State changed, new state: 
	 pod name: spark-pi-1548020873671-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver
	 pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T21:47:55Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T21:47:55Z
	 phase: Failed
	 container status: 
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: terminated
		 container started at: 2019-01-20T21:48:00Z
		 container finished at: 2019-01-20T21:48:08Z
		 exit code: 143
		 termination reason: Error
2019-01-20 23:48:09 INFO  LoggingPodStatusWatcherImpl:58 - Container final statuses:
	 container name: spark-kubernetes-driver
	 container image: skonto/spark:k8s-3.0.0
	 container state: terminated
	 container started at: 2019-01-20T21:48:00Z
	 container finished at: 2019-01-20T21:48:08Z
	 exit code: 143
	 termination reason: Error
2019-01-20 23:48:09 INFO  Client:58 - Application spark-pi with submissionId spark:spark-pi-1548020873671-driver finished.
2019-01-20 23:48:09 INFO  ShutdownHookManager:58 - Shutdown hook called
2019-01-20 23:48:09 INFO  ShutdownHookManager:58 - Deleting directory /tmp/spark-f114b2e0-5605-4083-9203-a4b1c1f6059e

Glob scenario:

./bin/spark-submit --status spark:spark-pi* --master  k8s://https://192.168.2.8:8443
2019-01-20 22:27:44 WARN  Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0)
2019-01-20 22:27:44 WARN  Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address
Application status (driver): 
	 pod name: spark-pi-1547948600328-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-f13f01702f0b4503975ce98252d59b94, spark-role -> driver
	 pod uid: c576e1c6-1c54-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T01:43:22Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T01:43:22Z
	 phase: Running
	 container status: 
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: running
		 container started at: 2019-01-20T01:43:27Z
Application status (driver): 
	 pod name: spark-pi-1547948792539-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-006d252db9b24f25b5069df357c30264, spark-role -> driver
	 pod uid: 38375b4b-1c55-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T01:46:35Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T01:46:35Z
	 phase: Succeeded
	 container status: 
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: terminated
		 container started at: 2019-01-20T01:46:39Z
		 container finished at: 2019-01-20T01:46:56Z
		 exit code: 0
		 termination reason: Completed

@skonto skonto changed the title [Spark][K8s] Enhance spark-submit for app management [SPARK-24793][K8s] Enhance spark-submit for app management Jan 20, 2019
@SparkQA
Copy link

SparkQA commented Jan 20, 2019

Test build #101454 has finished for PR 23599 at commit 42cc505.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@SparkQA
Copy link

SparkQA commented Jan 20, 2019

Test build #101455 has finished for PR 23599 at commit 335abc1.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@SparkQA
Copy link

SparkQA commented Jan 20, 2019

@SparkQA
Copy link

SparkQA commented Jan 20, 2019

@SparkQA
Copy link

SparkQA commented Jan 20, 2019

@SparkQA
Copy link

SparkQA commented Jan 20, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7252/

@SparkQA
Copy link

SparkQA commented Jan 20, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7253/

@SparkQA
Copy link

SparkQA commented Jan 20, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7254/

Copy link
Contributor

@HeartSaVioR HeartSaVioR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know about k8s honestly but there're other than k8s which I could comment.

.killSubmission(args.submissionToKill)
if (args.master.startsWith("k8s://")) {
try {
val ops = Utils.classForName("org.apache.spark.deploy.k8s.submit.K8sSubmitOps$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, is there any reason to leverage reflection instead of the approach standalone and mesos scheduler take? I guess Spark put the good effort to decouple resource manager(s) to core.

Copy link
Contributor Author

@skonto skonto Jan 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two approaches for cluster mode, one that uses a rest server/client (Spark has such an implementation for Standalone and Mesos implements that as part of the Dispatcher code) and the other is the one that uses whatever the env provides. For k8s there is no rest api to use directly. We communicate to the k8s api server via the fabric8io client lib later on. Reflection is needed because we need to access classes which are not by default available in every Spark build. For example if a build does not target k8s then these classes will not be available. Also Spark-submit code is independent of the resource manager dependencies so we should not reference in there any types that come from the K8s backend.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why the code needs to be different, but I agree with the sentiment that adding this kind of code to SparkSubmit is what we should be trying to avoid at this point.

This is the kind of thing that belongs behind an interface that each RM backend should then implement, like ExternalClusterManager does for the scheduler. So if you have an interface e.g. SparkSubmitBackend which k8s then implements, this code can all be encapsulated in the k8s module, can be more easily tested, and moves SparkSubmit in the right direction, where all RM-specific functionality would eventually be behind that interface (instead of the current mess).

Copy link
Contributor Author

@skonto skonto Feb 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin
If you have an interface then you need to initialize an instance out of it. That means you need to use some form of class loading. Spark context is initializing the right cluster manager by using:

  private def getClusterManager(url: String): Option[ExternalClusterManager] = {
    val loader = Utils.getContextOrSparkClassLoader
    val serviceLoaders =
      ServiceLoader.load(classOf[ExternalClusterManager], loader).asScala.filter(_.canCreate(url))
....

and picks up the META-INF/Services aggregated at build time.
Do you mean that I should use something like that? If so I can refactor it this way and also hide the additional mess behind the interface (there is mess in there already). Are you ok with that plan?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that I should use something like that?

Yes. That would make submission pluggable like the scheduler is, using a similar mechanism. Eventually I hope we can hide all of the details that SparkSubmit handles behind that interface...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok @vanzin could you also have a look at #23546. Although a bit different case, also affects Spark submit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think parts of that change actually pose a similar issue. Basically, the part that deals with how to stage dependencies for cluster-mode submissions, since each backend may do something slightly different (like YARN does its own thing, even if the "copy files to HDFS" part is somewhat similar).

But in this particular case, this big chunk of reflection code is begging to just be turned into a plugin. You don't need fully hash out that plugin interface, but at least add the basics of what you're trying to achieve in this PR to it. We can look at adding more things later as needed.

Copy link
Contributor Author

@skonto skonto Feb 5, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok for this PR makes 100% sense, for the deps stuff it can be kept as is since it is k8s specific and I need to parse some maven coordinates and skip others to make things work.
Yes I dont need to make an abstraction for everything. I will fix it.

.requestSubmissionStatus(args.submissionToRequestStatusFor)
if (args.master.startsWith("k8s://")) {
try {
val ops = Utils.classForName("org.apache.spark.deploy.k8s.submit.K8sSubmitOps$")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@@ -304,7 +304,8 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S
}

private def validateKillArguments(): Unit = {
if (!master.startsWith("spark://") && !master.startsWith("mesos://")) {
if (!master.startsWith("spark://") && !master.startsWith("mesos://")
&& !master.startsWith("k8s://")) {
error("Killing submissions is only supported in standalone or Mesos mode!")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add k8s

@@ -313,7 +314,8 @@ private[deploy] class SparkSubmitArguments(args: Seq[String], env: Map[String, S
}

private def validateStatusRequestArguments(): Unit = {
if (!master.startsWith("spark://") && !master.startsWith("mesos://")) {
if (!master.startsWith("spark://") && !master.startsWith("mesos://") &&
!master.startsWith("k8s://")) {
error(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add k8s

@@ -25,8 +25,12 @@ import org.apache.spark.SparkException
* Contains basic command line parsing functionality and methods to parse some common Spark CLI
* options.
*/
private[spark] trait CommandLineUtils {
private[spark] trait CommandLineUtils extends CommandLineLoggingUtils{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: space before {

Like Mesos and Standalone, Kubernetes provides simple application management via th spark-submit cli tool in cluster mode.
Users can kill a job by providing the submission id that is printed when submitting their job.
The submission id follows the following format namespace:driver-pod-name and is similar to the driverId found in Standalone mode.
If user omits the namespace then the `deault` namespace is used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: default

@@ -393,6 +393,27 @@ RBAC authorization and how to configure Kubernetes service accounts for pods, pl
[Using RBAC Authorization](https://kubernetes.io/docs/admin/authorization/rbac/) and
[Configure Service Accounts for Pods](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).

## Spark Application Management

Like Mesos and Standalone, Kubernetes provides simple application management via th spark-submit cli tool in cluster mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the, or remove th?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cli -> CLI


Like Mesos and Standalone, Kubernetes provides simple application management via th spark-submit cli tool in cluster mode.
Users can kill a job by providing the submission id that is printed when submitting their job.
The submission id follows the following format namespace:driver-pod-name and is similar to the driverId found in Standalone mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: might be better to wrap format with `` to highlight.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id -> ID
follows the following format -> follows the format
Standalone -> standalone

@SparkQA
Copy link

SparkQA commented Jan 21, 2019

Test build #101456 has finished for PR 23599 at commit ed35a4f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@felixcheung
Copy link
Member

@@ -393,6 +393,27 @@ RBAC authorization and how to configure Kubernetes service accounts for pods, pl
[Using RBAC Authorization](https://kubernetes.io/docs/admin/authorization/rbac/) and
[Configure Service Accounts for Pods](https://kubernetes.io/docs/tasks/configure-pod-container/configure-service-account/).

## Spark Application Management

Like Mesos and Standalone, Kubernetes provides simple application management via th spark-submit cli tool in cluster mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cli -> CLI


Like Mesos and Standalone, Kubernetes provides simple application management via th spark-submit cli tool in cluster mode.
Users can kill a job by providing the submission id that is printed when submitting their job.
The submission id follows the following format namespace:driver-pod-name and is similar to the driverId found in Standalone mode.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id -> ID
follows the following format -> follows the format
Standalone -> standalone

```bash
$ spark-submit --status spark:spark-pi-1547948636094-driver --master k8s://https://192.168.2.8:8443
```
Both operations support glob names for example user can run:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"support glob patterns. For example:"

(implicit client: KubernetesClient): Unit = {
if (pods.nonEmpty) {
for (pod <- pods) {
printMessage("Application status (driver): " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indent this block

.filter { pod =>
val meta = pod.getMetadata
meta.getName.startsWith(pName.stripSuffix("*")) &&
meta
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to unindent one step

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it 4 spaces for logical expressions (these are 2)?

@LucaCanali
Copy link
Contributor

Minor comment: I would also update the help message for spark-submit, at printUsageAndExit in SparkSubmitArguments.scala


Like Mesos and Standalone, Kubernetes provides simple application management via th spark-submit cli tool in cluster mode.
Users can kill a job by providing the submission id that is printed when submitting their job.
The submission id follows the following format namespace:driver-pod-name and is similar to the driverId found in Standalone mode.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: put namespace:driver-pod-name in a `` block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@skonto
Copy link
Contributor Author

skonto commented Feb 5, 2019

@LucaCanali now that I checked SparkSubmitArguments I dont think I need to change the printed message as I dont modify the api. I just have a specific format for the submissionId. So the user just needs to record the submissionId and that will work.

@LucaCanali
Copy link
Contributor

@skonto What I had in mind was a minor change: just adding the info in the help message that --kill and --status flags are now also valid for Kubernetes. Currently, the help message reports that --kill, --status (and also the --supervise option) are valid for "Spark standalone or Mesos with cluster deploy mode only")

@skonto skonto force-pushed the submit_ops_extension branch 2 times, most recently from 5adf812 to 66186cb Compare February 5, 2019 17:27
@SparkQA
Copy link

SparkQA commented Feb 5, 2019

@SparkQA
Copy link

SparkQA commented Feb 5, 2019

@SparkQA
Copy link

SparkQA commented Feb 5, 2019

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7716/

@SparkQA
Copy link

SparkQA commented Feb 5, 2019

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/7717/

@skonto
Copy link
Contributor Author

skonto commented Feb 5, 2019

@shaneknapp something is wrong with the k8s tests. The one that tries to access a remote file, fails consistently.

@shaneknapp
Copy link
Contributor

ill take a look now.

@shaneknapp
Copy link
Contributor

test this please

@skonto
Copy link
Contributor Author

skonto commented Mar 19, 2019

@vanzin updated the PR, gentle ping. A changed the name for the grace period property to spark.kubernetes.appKillPodDeletionGracePeriod if it is too much I can fall back to killGracePeriod.

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

@skonto skonto force-pushed the submit_ops_extension branch 4 times, most recently from ba3e1a9 to fdb00ff Compare March 19, 2019 02:09
@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/9048/

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/9049/

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/9051/

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Test build #103651 has finished for PR 23599 at commit fdb00ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Test build #103647 has finished for PR 23599 at commit 92b075c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Test build #103648 has finished for PR 23599 at commit 9eb1052.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@SparkQA
Copy link

SparkQA commented Mar 19, 2019

Test build #103650 has finished for PR 23599 at commit ba3e1a9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@skonto
Copy link
Contributor Author

skonto commented Mar 19, 2019

@vanzin I think it is ready to be merged.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style suggestions.

* Request the status of an existing submission using the REST protocol.
* Standalone and Mesos cluster mode only.
* Request the status of an existing submission.
* Standalone, Kubernetes and Mesos cluster mode only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

.killSubmission(args.submissionToKill)
} else {
val sparkConf = new SparkConf()
args.sparkProperties.foreach { case (k, v) => sparkConf.set(k, v) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd turn this into a new method args.toSparkConf(), since it's not copy & pasted in multiple places.

case 1 => serviceLoaders.headOption.get
case _ =>
throw new IllegalArgumentException(s"No external SparkSubmitOperations " +
s"clients found for master url: '$master'")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent more

}

val gracePeriodSeconds = getGracePeriod(sparkConf)
gracePeriodSeconds match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getGracePeriod(sparkConf) match { and save one line.

}.withName(pName)

if (Option(podToDelete).isDefined) {
gracePeriodSeconds match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here re: avoiding the gracePeriodSeconds variable.

(implicit client: KubernetesClient): Unit = {
val gracePeriodSeconds = getGracePeriod(sparkConf)
val podToDelete = {
namespace match {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They must have some common parent type since otherwise this wouldn't compile. A quick test tells me that type is MixedOperation.

Or if the helper method is non-public you can cheat and not add a return type:

scala> def foo = { 1 }
foo: Int

(Spark's style checker will only complain about that if the method is public, IIRC.)

}

after {
System.setErr(oldErr)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remind me again why listStatus.printStream = err does not work and you have to mess with System.err instead?

Copy link
Contributor Author

@skonto skonto Mar 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vanzin one is MixedOperation<T, L, D, R extends Resource<T, D>> extends Operation<T, L, D, R>, NonNamespaceOperation<T, L, D, R> and the other is NonNamespaceOperation, so NonNamespaceOperation is the common one. I noticed when refactored it, so I will make this a private function.

Copy link
Contributor Author

@skonto skonto Mar 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

listStatus.printStream = err does not work because It is defined in:

trait CommandLineLoggingUtils {
  // Exposed for testing
  private[spark] var exitFn: Int => Unit = (exitCode: Int) => System.exit(exitCode)

  private[spark] var printStream: PrintStream = System.err
...

I dont want to mock the ops (which is the subject of testing) which inherit from K8sSubmitOp which inherit from CommandLineLoggingUtils I would like to test the real op. So I set a mocked stream from which I grab the println calls. CommandLineLoggingUtils is used by SparkSubmit so didnt want to touch it (could use a def for the val). The val there is already set when the trait is mixed in. I dont think it is a mess, it is just local in the tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't follow you.

ListStatus inherits from CommandLineLoggingUtils through CommandLineLoggingUtils. So it has a var named printStream. Why exactly can't you set it to whatever you want?

Can you post the exact compiler error you see when you do that?

Copy link
Contributor Author

@skonto skonto Mar 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ok I misread, it is a var (thought it was val).

@skonto
Copy link
Contributor Author

skonto commented Mar 24, 2019

@vanzin gentle ping for merging.

@SparkQA
Copy link

SparkQA commented Mar 24, 2019

@SparkQA
Copy link

SparkQA commented Mar 24, 2019

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/9242/

@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103877 has finished for PR 23599 at commit dc50774.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@SparkQA
Copy link

SparkQA commented Mar 25, 2019

Test build #103878 has finished for PR 23599 at commit 119f5a3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • trait CommandLineLoggingUtils

@@ -320,7 +336,8 @@ private[spark] class SparkSubmit extends Logging {
}
}

args.sparkProperties.foreach { case (k, v) => sparkConf.set(k, v) }
// update spark config from args
args.toSparkConf(Option(sparkConf))
Copy link
Contributor Author

@skonto skonto Mar 25, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to keep this here as args are being updated within the body of this method. We should clean it up in another PR. args should be immutable and we should only modify SparkConf.

Copy link
Contributor

@vanzin vanzin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging to master.

I'm not a super fan of the internal op abstraction in the k8s side, but hey, it works, and it can be changed later if needed. I'll remove the unused variable while merging.

import org.apache.spark.scheduler.cluster.k8s.ExecutorLifecycleTestUtils.TEST_SPARK_APP_ID

class K8sSubmitOpSuite extends SparkFunSuite with BeforeAndAfter {
private val oldErr = System.err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not used.

@vanzin vanzin closed this in 05168e7 Mar 26, 2019
IceMimosa added a commit to growingio/spark that referenced this pull request Apr 12, 2019
…TA/pages/864879077/on+K8S

Fix ImplicitCastInputTypes

[SPARK-25222][K8S] Improve container status logging

[SPARK-25262][K8S] Allow SPARK_LOCAL_DIRS to be tmpfs backed on K8S

[SPARK-25021][K8S] Add spark.executor.pyspark.memory limit for K8S

[SPARK-25415][SQL] Make plan change log in RuleExecutor configurable by SQLConf

In RuleExecutor, after applying a rule, if the plan has changed, the before and after plan will be logged using level "trace". At times, however, such information can be very helpful for debugging. Hence, making the log level configurable in SQLConf would allow users to turn on the plan change log independently and save the trouble of tweaking log4j settings. Meanwhile, filtering plan change log for specific rules can also be very useful.
So this PR adds two SQL configurations:
1. spark.sql.optimizer.planChangeLog.level - set a specific log level for logging plan changes after a rule is applied.
2. spark.sql.optimizer.planChangeLog.rules - enable plan change logging only for a set of specified rules, separated by commas.

Added UT.

Closes apache#22406 from maryannxue/spark-25415.

Authored-by: maryannxue <[email protected]>
Signed-off-by: gatorsmile <[email protected]>

[SPARK-25338][TEST] Ensure to call super.beforeAll() and super.afterAll() in test cases

This PR ensures to call `super.afterAll()` in `override afterAll()` method for test suites.

* Some suites did not call `super.afterAll()`
* Some suites may call `super.afterAll()` only under certain condition
* Others never call `super.afterAll()`.

This PR also ensures to call `super.beforeAll()` in `override beforeAll()` for test suites.

Existing UTs

Closes apache#22337 from kiszk/SPARK-25338.

Authored-by: Kazuaki Ishizaki <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-25415][SQL][FOLLOW-UP] Add Locale.ROOT when toUpperCase

Add `Locale.ROOT` when `toUpperCase`.

manual tests

Closes apache#22531 from wangyum/SPARK-25415.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>

[SPARK-25514][SQL] Generating pretty JSON by to_json

The PR introduces new JSON option `pretty` which allows to turn on `DefaultPrettyPrinter` of `Jackson`'s Json generator. New option is useful in exploring of deep nested columns and in converting of JSON columns in more readable representation (look at the added test).

Added rount trip test which convert an JSON string to pretty representation via `from_json()` and `to_json()`.

Closes apache#22534 from MaxGekk/pretty-json.

Lead-authored-by: Maxim Gekk <[email protected]>
Co-authored-by: Maxim Gekk <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>

[SPARK-25262][DOC][FOLLOWUP] Fix missing markup tag

This adds a missing end markup tag. This should go `master` branch only.

This is a doc-only change. Manual via `SKIP_API=1 jekyll build`.

Closes apache#22584 from dongjoon-hyun/SPARK-25262.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>

[SPARK-23257][K8S] Kerberos Support for Spark on K8S

[SPARK-25682][K8S] Package example jars in same target for dev and distro images.

This way the image generated from both environments has the same layout,
with just a difference in contents that should not affect functionality.

Also added some minor error checking to the image script.

Closes apache#22681 from vanzin/SPARK-25682.

Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-25745][K8S] Improve docker-image-tool.sh script

Adds error checking and handling to `docker` invocations ensuring the script terminates early in the event of any errors.  This avoids subtle errors that can occur e.g. if the base image fails to build the Python/R images can end up being built from outdated base images and makes it more explicit to the user that something went wrong.

Additionally the provided `Dockerfiles` assume that Spark was first built locally or is a runnable distribution however it didn't previously enforce this.  The script will now check the JARs folder to ensure that Spark JARs actually exist and if not aborts early reminding the user they need to build locally first.

- Tested with a `mvn clean` working copy and verified that the script now terminates early
- Tested with bad `Dockerfiles` that fail to build to see that early termination occurred

Closes apache#22748 from rvesse/SPARK-25745.

Authored-by: Rob Vesse <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-25730][K8S] Delete executor pods from kubernetes after figuring out why they died

`removeExecutorFromSpark` tries to fetch the reason the executor exited from Kubernetes, which may be useful if the pod was OOMKilled. However, the code previously deleted the pod from Kubernetes first which made retrieving this status impossible. This fixes the ordering.

On a separate but related note, it would be nice to wait some time before removing the pod - to let the operator examine logs and such.

Running on my local cluster.

Author: Mike Kaplinskiy <[email protected]>

Closes apache#22720 from mikekap/patch-1.

[SPARK-25828][K8S] Bumping Kubernetes-Client version to 4.1.

[SPARK-24434][K8S] pod template files

[SPARK-25809][K8S][TEST] New K8S integration testing backends

[SPARK-25875][K8S] Merge code to set up driver command into a single step.

Right now there are 3 different classes dealing with building the driver
command to run inside the pod, one for each "binding" supported by Spark.
This has two main shortcomings:

- the code in the 3 classes is very similar; changing things in one place
  would probably mean making a similar change in the others.

- it gives the false impression that the step implementation is the only
  place where binding-specific logic is needed. That is not true; there
  was code in KubernetesConf that was binding-specific, and there's also
  code in the executor-specific config step. So the 3 classes weren't really
  working as a language-specific abstraction.

On top of that, the current code was propagating command line parameters in
a different way depending on the binding. That doesn't seem necessary, and
in fact using environment variables for command line parameters is in general
a really bad idea, since you can't handle special characters (e.g. spaces)
that way.

This change merges the 3 different code paths for Java, Python and R into
a single step, and also merges the 3 code paths to start the Spark driver
in the k8s entry point script. This increases the amount of shared code,
and also moves more feature logic into the step itself, so it doesn't live
in KubernetesConf.

Note that not all logic related to setting up the driver lives in that
step. For example, the memory overhead calculation still lives separately,
except it now happens in the driver config step instead of outside the
step hierarchy altogether.

Some of the noise in the diff is because of changes to KubernetesConf, which
will be addressed in a separate change.

Tested with new and updated unit tests + integration tests.

Author: Marcelo Vanzin <[email protected]>

Closes apache#22897 from vanzin/SPARK-25875.

[SPARK-25897][K8S] Hook up k8s integration tests to sbt build.

The integration tests can now be run in sbt if the right profile
is enabled, using the "test" task under the respective project.

This avoids having to fall back to maven to run the tests, which
invalidates all your compiled stuff when you go back to sbt, making
development way slower than it should.

There's also a task to run the tests directly without refreshing
the docker images, which is helpful if you just made a change to
the submission code which should not affect the code in the images.

The sbt tasks currently are not very customizable; there's some
very minor things you can set in the sbt shell itself, but otherwise
it's hardcoded to run on minikube.

I also had to make some slight adjustments to the IT code itself,
mostly to remove assumptions about the existing harness.

Tested on sbt and maven.

Closes apache#22909 from vanzin/SPARK-25897.

Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-25957][K8S] Make building alternate language binding docker images optional

bin/docker-image-tool.sh tries to build all docker images (JVM, PySpark
and SparkR) by default. But not all spark distributions are built with
SparkR and hence this script will fail on such distros.

With this change, we make building alternate language binding docker images (PySpark and SparkR) optional. User has to specify dockerfile for those language bindings using -p and -R flags accordingly, to build the binding docker images.

Tested following scenarios.
*bin/docker-image-tool.sh -r <repo> -t <tag> build* --> Builds only JVM docker image (default behavior)

*bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile build* --> Builds both JVM and PySpark docker images

*bin/docker-image-tool.sh -r <repo> -t <tag> -p kubernetes/dockerfiles/spark/bindings/python/Dockerfile -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build* --> Builds JVM, PySpark and SparkR docker images.

Author: Nagaram Prasad Addepally <[email protected]>

Closes apache#23053 from ramaddepally/SPARK-25957.

[SPARK-25960][K8S] Support subpath mounting with Kubernetes

This PR adds configurations to use subpaths with Spark on k8s. Subpaths (https://kubernetes.io/docs/concepts/storage/volumes/#using-subpath) allow the user to specify a path within a volume to use instead of the volume's root.

Added unit tests. Ran SparkPi on a cluster with event logging pointed at a subpath-mount and verified the driver host created and used the subpath.

Closes apache#23026 from NiharS/k8s_subpath.

Authored-by: Nihar Sheth <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26025][K8S] Speed up docker image build on dev repo.

[SPARK-26015][K8S] Set a default UID for Spark on K8S Images

Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation.  A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g.

```
> bin/docker-image-tool.sh -r rvesse -t uid -u 185 build
> bin/docker-image-tool.sh -r rvesse -t uid push
```

If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps (
https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html)

Notes:
- We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException`

To Do:
- [x] Debug and resolve issue with client mode test
- [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry
- [x] Rebase once PR apache#23013 is merged and update documentation accordingly

Built the Docker images with the new Dockerfiles that include the `USER` directives.  Ran the Spark on K8S integration tests against the new images.  All pass except client mode which I am currently debugging further.

Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected.

Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID.

cc felixcheung skonto vanzin

Closes apache#23017 from rvesse/SPARK-26015.

Authored-by: Rob Vesse <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-25876][K8S] Simplify kubernetes configuration types.

[SPARK-23781][CORE] Merge token renewer functionality into HadoopDelegationTokenManager.

[SPARK-25515][K8S] Adds a config option to keep executor pods for debugging

[SPARK-26083][K8S] Add Copy pyspark into corresponding dir cmd in pyspark Dockerfile

When I try to run `./bin/pyspark` cmd in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error:
```
$SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ...
Python 2.7.15 (default, Aug 22 2018, 13:24:18)
[GCC 6.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Could not open PYTHONSTARTUP
IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'
```
This is because `pyspark` folder doesn't exist under `/opt/spark/python/`

Added `COPY python/pyspark ${SPARK_HOME}/python/pyspark` to pyspark Dockerfile to resolve issue above.

Google Kubernetes Engine

Closes apache#23037 from AzureQ/master.

Authored-by: Qi Shao <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26194][K8S] Auto generate auth secret for k8s apps.

This change modifies the logic in the SecurityManager to do two
things:

- generate unique app secrets also when k8s is being used
- only store the secret in the user's UGI on YARN

The latter is needed so that k8s won't unnecessarily create
k8s secrets for the UGI credentials when only the auth token
is stored there.

On the k8s side, the secret is propagated to executors using
an environment variable instead. This ensures it works in both
client and cluster mode.

Security doc was updated to mention the feature and clarify that
proper access control in k8s should be enabled for it to be secure.

Author: Marcelo Vanzin <[email protected]>

Closes apache#23174 from vanzin/SPARK-26194.

[SPARK-25877][K8S] Move all feature logic to feature classes.

[SPARK-25922][K8] Spark Driver/Executor "spark-app-selector" label mismatch

In K8S Cluster mode, the algorithm to generate spark-app-selector/spark.app.id of spark driver is different with spark executor.
This patch makes sure spark driver and executor to use the same spark-app-selector/spark.app.id if spark.app.id is set, otherwise it will use superclass applicationId.

In K8S Client mode, spark-app-selector/spark.app.id for executors will use superclass applicationId.

Manually run."

Closes apache#23322 from suxingfate/SPARK-25922.

Lead-authored-by: suxingfate <[email protected]>
Co-authored-by: xinglwang <[email protected]>
Signed-off-by: Yinan Li <[email protected]>

[SPARK-26642][K8S] Add --num-executors option to spark-submit for Spark on K8S.

[SPARK-25887][K8S] Configurable K8S context support

This enhancement allows for specifying the desired context to use for the initial K8S client auto-configuration.  This allows users to more easily access alternative K8S contexts without having to first
explicitly change their current context via kubectl.

Explicitly set my K8S context to a context pointing to a non-existent cluster, then launched Spark jobs with explicitly specified contexts via the new `spark.kubernetes.context` configuration property.

Example Output:

```
> kubectl config current-context
minikube
> minikube status
minikube: Stopped
cluster:
kubectl:
> ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.context=docker-for-desktop --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4
18/10/31 11:57:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/31 11:57:51 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using context docker-for-desktop from users K8S config file
18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1540987071845-driver
	 namespace: default
	 labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver
	 pod uid: 32462cac-dd04-11e8-b6c6-025000000001
	 creation time: 2018-10-31T11:57:52Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv
	 node name: N/A
	 start time: N/A
	 phase: Pending
	 container status: N/A
18/10/31 11:57:52 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1540987071845-driver
	 namespace: default
	 labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver
	 pod uid: 32462cac-dd04-11e8-b6c6-025000000001
	 creation time: 2018-10-31T11:57:52Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv
	 node name: docker-for-desktop
	 start time: N/A
	 phase: Pending
	 container status: N/A
...
18/10/31 11:58:03 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1540987071845-driver
	 namespace: default
	 labels: spark-app-selector -> spark-2c4abc226ed3415986eb602bd13f3582, spark-role -> driver
	 pod uid: 32462cac-dd04-11e8-b6c6-025000000001
	 creation time: 2018-10-31T11:57:52Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-glpfv
	 node name: docker-for-desktop
	 start time: 2018-10-31T11:57:52Z
	 phase: Succeeded
	 container status:
		 container name: spark-kubernetes-driver
		 container image: rvesse/spark:debian
		 container state: terminated
		 container started at: 2018-10-31T11:57:54Z
		 container finished at: 2018-10-31T11:58:02Z
		 exit code: 0
		 termination reason: Completed
```

Without the `spark.kubernetes.context` setting this will fail because the current context - `minikube` - is pointing to a non-running cluster e.g.

```
> ./spark-submit --master k8s://https://localhost:6443 --deploy-mode cluster --name spark-pi --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=2 --conf spark.kubernetes.container.image=rvesse/spark:debian local:///opt/spark/examples/jars/spark-examples_2.11-3.0.0-SNAPSHOT.jar 4
18/10/31 12:02:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/31 12:02:30 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
18/10/31 12:02:31 WARN WatchConnectionManager: Exec Failure
javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
	at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281)
	at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151)
	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292)
	at sun.security.validator.Validator.validate(Validator.java:260)
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491)
	... 39 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382)
	... 45 more
Exception in thread "kubernetes-dispatcher-0" Exception in thread "main" java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask611a9c09 rejected from java.util.concurrent.ScheduledThreadPoolExecutor404819e4[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 0]
	at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
	at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
	at java.util.concurrent.ScheduledThreadPoolExecutor.delayedExecute(ScheduledThreadPoolExecutor.java:326)
	at java.util.concurrent.ScheduledThreadPoolExecutor.schedule(ScheduledThreadPoolExecutor.java:533)
	at java.util.concurrent.ScheduledThreadPoolExecutor.submit(ScheduledThreadPoolExecutor.java:632)
	at java.util.concurrent.Executors$DelegatedExecutorService.submit(Executors.java:678)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.scheduleReconnect(WatchConnectionManager.java:300)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$800(WatchConnectionManager.java:48)
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:213)
	at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
	at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
io.fabric8.kubernetes.client.KubernetesClientException: Failed to start websocket
	at io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$2.onFailure(WatchConnectionManager.java:204)
	at okhttp3.internal.ws.RealWebSocket.failWebSocket(RealWebSocket.java:543)
	at okhttp3.internal.ws.RealWebSocket$2.onFailure(RealWebSocket.java:208)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:148)
	at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.ssl.Alerts.getSSLException(Alerts.java:192)
	at sun.security.ssl.SSLSocketImpl.fatal(SSLSocketImpl.java:1949)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:302)
	at sun.security.ssl.Handshaker.fatalSE(Handshaker.java:296)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1509)
	at sun.security.ssl.ClientHandshaker.processMessage(ClientHandshaker.java:216)
	at sun.security.ssl.Handshaker.processLoop(Handshaker.java:979)
	at sun.security.ssl.Handshaker.process_record(Handshaker.java:914)
	at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1062)
	at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
	at sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
	at okhttp3.internal.connection.RealConnection.connectTls(RealConnection.java:281)
	at okhttp3.internal.connection.RealConnection.establishProtocol(RealConnection.java:251)
	at okhttp3.internal.connection.RealConnection.connect(RealConnection.java:151)
	at okhttp3.internal.connection.StreamAllocation.findConnection(StreamAllocation.java:195)
	at okhttp3.internal.connection.StreamAllocation.findHealthyConnection(StreamAllocation.java:121)
	at okhttp3.internal.connection.StreamAllocation.newStream(StreamAllocation.java:100)
	at okhttp3.internal.connection.ConnectInterceptor.intercept(ConnectInterceptor.java:42)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.cache.CacheInterceptor.intercept(CacheInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.internal.http.BridgeInterceptor.intercept(BridgeInterceptor.java:93)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RetryAndFollowUpInterceptor.intercept(RetryAndFollowUpInterceptor.java:120)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.BackwardsCompatibilityInterceptor.intercept(BackwardsCompatibilityInterceptor.java:119)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:66)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at io.fabric8.kubernetes.client.utils.HttpClientUtils$2.intercept(HttpClientUtils.java:109)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:92)
	at okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:67)
	at okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:185)
	at okhttp3.RealCall$AsyncCall.execute(RealCall.java:135)
	... 4 more
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:387)
	at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:292)
	at sun.security.validator.Validator.validate(Validator.java:260)
	at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:324)
	at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509TrustManagerImpl.java:229)
	at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X509TrustManagerImpl.java:124)
	at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHandshaker.java:1491)
	... 39 more
Caused by: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
	at sun.security.provider.certpath.SunCertPathBuilder.build(SunCertPathBuilder.java:141)
	at sun.security.provider.certpath.SunCertPathBuilder.engineBuild(SunCertPathBuilder.java:126)
	at java.security.cert.CertPathBuilder.build(CertPathBuilder.java:280)
	at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:382)
	... 45 more
18/10/31 12:02:31 INFO ShutdownHookManager: Shutdown hook called
18/10/31 12:02:31 INFO ShutdownHookManager: Deleting directory /private/var/folders/6b/y1010qp107j9w2dhhy8csvz0000xq3/T/spark-5e649891-8a0f-4f17-bf3a-33b34082eba8
```

Suggested reviews: mccheah liyinan926 - this is the follow up fix to the bug discovered while working on SPARK-25809 (PR apache#22805)

Closes apache#22904 from rvesse/SPARK-25887.

Authored-by: Rob Vesse <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26685][K8S] Correct placement of ARG declaration

Latest Docker releases are stricter in their enforcement of build argument scope.  The location of the `ARG spark_uid` declaration in the Python and R Dockerfiles means the variable is out of scope by the time it is used in a `USER` declaration resulting in a container running as root rather than the default/configured UID.

Also with some of the refactoring of the script that has happened since my PR that introduced the configurable UID it turns out the `-u <uid>` argument is not being properly passed to the Python and R image builds when those are opted into

This commit moves the `ARG` declaration to just before the argument is used such that it is in scope.  It also ensures that Python and R image builds receive the build arguments that include the `spark_uid` argument where relevant

Prior to the patch images are produced where the Python and R images ignore the default/configured UID:

```
> docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
bash-4.4# whoami
root
bash-4.4# id -u
0
bash-4.4# exit
> docker run -it --entrypoint /bin/bash rvesse/spark:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
```

Note that the Python image is still running as `root` having ignored the configured UID of 456 while the base image has the correct UID because the relevant `ARG` declaration is correctly in scope.

After the patch the correct UID is observed:

```
> docker run -it --entrypoint /bin/bash rvesse/spark-r:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
exit
> docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
exit
> docker run -it --entrypoint /bin/bash rvesse/spark:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
```

Closes apache#23611 from rvesse/SPARK-26685.

Authored-by: Rob Vesse <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26687][K8S] Fix handling of custom Dockerfile paths

With the changes from vanzin's PR apache#23019 (SPARK-26025) we use a pared down temporary Docker build context which significantly improves build times.  However the way this is implemented leads to non-intuitive behaviour when supplying custom Docker file paths.  This is because of the following code snippets:

```
(cd $(img_ctx_dir base) && docker build $NOCACHEARG "${BUILD_ARGS[]}" \
    -t $(image_ref spark) \
    -f "$BASEDOCKERFILE" .)
```

Since the script changes to the temporary build context directory and then runs `docker build` there any path given for the Docker file is taken as relative to the temporary build context directory rather than to the directory where the user invoked the script.  This is rather unintuitive and produces somewhat unhelpful errors e.g.

```
> ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon  218.4MB
Step 1/15 : FROM openjdk:8-alpine
 ---> 5801f7d008e5
Step 2/15 : ARG spark_uid=185
 ---> Using cache
 ---> 5fd63df1ca39
...
Successfully tagged rvesse/spark:badpath
unable to prepare context: unable to evaluate symlinks in Dockerfile path: lstat /Users/rvesse/Documents/Work/Code/spark/target/tmp/docker/pyspark/resource-managers: no such file or directory
Failed to build PySpark Docker image, please refer to Docker build output for details.
```

Here we can see that the relative path that was valid where the user typed the command was not valid inside the build context directory.

To resolve this we need to ensure that we are resolving relative paths to Docker files appropriately which we do by adding a `resolve_file` function to the script and invoking that on the supplied Docker file paths

Validated that relative paths now work as expected:

```
> ./bin/docker-image-tool.sh -r rvesse -t badpath -p resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/python/Dockerfile build
Sending build context to Docker daemon  218.4MB
Step 1/15 : FROM openjdk:8-alpine
 ---> 5801f7d008e5
Step 2/15 : ARG spark_uid=185
 ---> Using cache
 ---> 5fd63df1ca39
Step 3/15 : RUN set -ex &&     apk upgrade --no-cache &&     apk add --no-cache bash tini libc6-compat linux-pam krb5 krb5-libs &&     mkdir -p /opt/spark &&     mkdir -p /opt/spark/examples &&     mkdir -p /opt/spark/work-dir &&     touch /opt/spark/RELEASE &&     rm /bin/sh &&     ln -sv /bin/bash /bin/sh &&     echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su &&     chgrp root /etc/passwd && chmod ug+rw /etc/passwd
 ---> Using cache
 ---> eb0a568e032f
Step 4/15 : COPY jars /opt/spark/jars
...
Successfully tagged rvesse/spark:badpath
Sending build context to Docker daemon  6.599MB
Step 1/13 : ARG base_img
Step 2/13 : ARG spark_uid=185
Step 3/13 : FROM $base_img
 ---> 8f4fff16f903
Step 4/13 : WORKDIR /
 ---> Running in 25466e66f27f
Removing intermediate container 25466e66f27f
 ---> 1470b6efae61
Step 5/13 : USER 0
 ---> Running in b094b739df37
Removing intermediate container b094b739df37
 ---> 6a27eb4acad3
Step 6/13 : RUN mkdir ${SPARK_HOME}/python
 ---> Running in bc8002c5b17c
Removing intermediate container bc8002c5b17c
 ---> 19bb12f4286a
Step 7/13 : RUN apk add --no-cache python &&     apk add --no-cache python3 &&     python -m ensurepip &&     python3 -m ensurepip &&     rm -r /usr/lib/python*/ensurepip &&     pip install --upgrade pip setuptools &&     rm -r /root/.cache
 ---> Running in 12dcba5e527f
...
Successfully tagged rvesse/spark-py:badpath
```

Closes apache#23613 from rvesse/SPARK-26687.

Authored-by: Rob Vesse <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26794][SQL] SparkSession enableHiveSupport does not point to hive but in-memory while the SparkContext exists

```java
public class SqlDemo {
    public static void main(final String[] args) throws Exception {
        SparkConf conf = new SparkConf().setAppName("spark-sql-demo");
        JavaSparkContext sc = new JavaSparkContext(conf);
        SparkSession ss = SparkSession.builder().enableHiveSupport().getOrCreate();
        ss.sql("show databases").show();
    }
}
```
Before https://issues.apache.org/jira/browse/SPARK-20946, the demo above point to the right hive metastore if the hive-site.xml is present. But now it can only point to the default in-memory one.

Catalog is now as a variable shared across SparkSessions, it is instantiated with SparkContext's conf. After https://issues.apache.org/jira/browse/SPARK-20946, Session level configs are not pass to SparkContext's conf anymore, so the enableHiveSupport API takes no affect on the catalog instance.

You can set spark.sql.catalogImplementation=hive application wide to solve the problem, or never create a sc before you call SparkSession.builder().enableHiveSupport().getOrCreate()

Here we respect the SparkSession level configuration at the first time to generate catalog within SharedState

1. add ut
2. manually
```scala
test("enableHiveSupport has right to determine the catalog while using an existing sc") {
    val conf = new SparkConf().setMaster("local").setAppName("SharedState Test")
    val sc = SparkContext.getOrCreate(conf)
    val ss = SparkSession.builder().enableHiveSupport().getOrCreate()
    assert(ss.sharedState.externalCatalog.unwrapped.isInstanceOf[HiveExternalCatalog],
      "The catalog should be hive ")

    val ss2 = SparkSession.builder().getOrCreate()
    assert(ss2.sharedState.externalCatalog.unwrapped.isInstanceOf[HiveExternalCatalog],
      "The catalog should be shared across sessions")
  }
```

Without this fix, the above test will fail.
You can apply it to `org.apache.spark.sql.hive.HiveSharedStateSuite`,
and run,
```sbt
./build/sbt  -Phadoop-2.7 -Phive  "hive/testOnly org.apache.spark.sql.hive.HiveSharedStateSuite"
```
to verify.

Closes apache#23709 from yaooqinn/SPARK-26794.

Authored-by: Kent Yao <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>

[SPARK-24894][K8S] Make sure valid host names are created for executors.

Since the host name is derived from the app name, which can contain arbitrary
characters, it needs to be sanitized so that only valid characters are allowed.

On top of that, take extra care that truncation doesn't leave characters that
are valid except at the start of a host name.

Closes apache#23781 from vanzin/SPARK-24894.

Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-25394][CORE] Add an application status metrics source

- Exposes several metrics regarding application status as a source, useful to scrape them via jmx instead of mining the metrics rest api.  Example use case: prometheus + jmx exporter.
- Metrics are gathered when a job ends at the AppStatusListener side, could be more fine-grained but most metrics like tasks completed are also counted by executors. More metrics could be exposed in the future to avoid scraping executors in some scenarios.
- a config option `spark.app.status.metrics.enabled` is added to disable/enable these metrics, by default they are disabled.

This was manually tested with jmx source enabled and prometheus server on k8s:
![metrics](https://user-images.githubusercontent.com/7945591/45300945-63064d00-b518-11e8-812a-d9b4155ba0c0.png)
In the next pic the job delay is shown for repeated pi calculation (Spark action).
![pi](https://user-images.githubusercontent.com/7945591/45329927-89a1a380-b56b-11e8-9cc1-5e76cb83969f.png)

Closes apache#22381 from skonto/add_app_status_metrics.

Authored-by: Stavros Kontopoulos <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-25926][CORE] Move config entries in core module to internal.config.

[SPARK-26489][CORE] Use ConfigEntry for hardcoded configs for python/r categories

[SPARK-26445][CORE] Use ConfigEntry for hardcoded configs for driver/executor categories.

[SPARK-20327][CORE][YARN] Add CLI support for YARN custom resources, like GPUs

[SPARK-26239] File-based secret key loading for SASL.

[SPARK-26482][CORE] Use ConfigEntry for hardcoded configs for ui categories

[SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories.

[SPARK-24736][K8S] Let spark-submit handle dependency resolution.

[SPARK-26420][K8S] Generate more unique IDs when creating k8s resource names.

Using the current time as an ID is more prone to clashes than people generally
realize, so try to make things a bit more unique without necessarily using a
UUID, which would eat too much space in the names otherwise.

The implemented approach uses some bits from the current time, plus some random
bits, which should be more resistant to clashes.

Closes apache#23805 from vanzin/SPARK-26420.

Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

[K8S][MINOR] Log minikube version when running integration tests.

Closes apache#23893 from vanzin/minikube-version.

Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26995][K8S] Make ld-linux-x86-64.so.2 visible to snappy native library under /lib in docker image with Alpine Linux

[SPARK-27023][K8S] Make k8s client timeouts configurable

Make k8s client timeouts configurable. No test suite exists for the client factory class, happy to add one if needed

Closes apache#23928 from onursatici/os/k8s-client-timeouts.

Lead-authored-by: Onur Satici <[email protected]>
Co-authored-by: Onur Satici <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-27061][K8S] Expose Driver UI port on driver service to access …

Expose Spark UI port on driver service to access logs from service.

The patch was tested using unit tests being contributed as a part of the PR

Closes apache#23990 from chandulal/SPARK-27061.

Authored-by: chandulal.kavar <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26343][K8S] Try to speed up running local k8s integration tests

Speed up running k8s integration tests locally by allowing folks to skip the tgz dist build and extraction

Run tests locally without a distribution of Spark, just a local build

Closes apache#23380 from holdenk/SPARK-26343-Speed-up-running-the-kubernetes-integration-tests-locally.

Authored-by: Holden Karau <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-26729][K8S] Make image names under test configurable

[SPARK-24793][K8S] Enhance spark-submit for app management

- supports `--kill` & `--status` flags.
- supports globs which is useful in general check this long standing [issue](kubernetes/kubernetes#17144 (comment)) for kubectl.

Manually against running apps. Example output:

Submission Id reported at launch time:

```
2019-01-20 23:47:56 INFO  Client:58 - Waiting for application spark-pi with submissionId spark:spark-pi-1548020873671-driver to finish...
```

Killing the app:

```
./bin/spark-submit --kill spark:spark-pi-1548020873671-driver --master  k8s://https://192.168.2.8:8443
2019-01-20 23:48:07 WARN  Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0)
2019-01-20 23:48:07 WARN  Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address

```

App terminates with 143 (SIGTERM, since we have tiny this should lead to [graceful shutdown](https://cloud.google.com/solutions/best-practices-for-building-containers)):

```
2019-01-20 23:48:08 INFO  LoggingPodStatusWatcherImpl:58 - State changed, new state:
	 pod name: spark-pi-1548020873671-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver
	 pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T21:47:55Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T21:47:55Z
	 phase: Running
	 container status:
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: running
		 container started at: 2019-01-20T21:48:00Z
2019-01-20 23:48:09 INFO  LoggingPodStatusWatcherImpl:58 - State changed, new state:
	 pod name: spark-pi-1548020873671-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver
	 pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T21:47:55Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T21:47:55Z
	 phase: Failed
	 container status:
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: terminated
		 container started at: 2019-01-20T21:48:00Z
		 container finished at: 2019-01-20T21:48:08Z
		 exit code: 143
		 termination reason: Error
2019-01-20 23:48:09 INFO  LoggingPodStatusWatcherImpl:58 - Container final statuses:
	 container name: spark-kubernetes-driver
	 container image: skonto/spark:k8s-3.0.0
	 container state: terminated
	 container started at: 2019-01-20T21:48:00Z
	 container finished at: 2019-01-20T21:48:08Z
	 exit code: 143
	 termination reason: Error
2019-01-20 23:48:09 INFO  Client:58 - Application spark-pi with submissionId spark:spark-pi-1548020873671-driver finished.
2019-01-20 23:48:09 INFO  ShutdownHookManager:58 - Shutdown hook called
2019-01-20 23:48:09 INFO  ShutdownHookManager:58 - Deleting directory /tmp/spark-f114b2e0-5605-4083-9203-a4b1c1f6059e

```

Glob scenario:

```
./bin/spark-submit --status spark:spark-pi* --master  k8s://https://192.168.2.8:8443
2019-01-20 22:27:44 WARN  Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0)
2019-01-20 22:27:44 WARN  Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address
Application status (driver):
	 pod name: spark-pi-1547948600328-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-f13f01702f0b4503975ce98252d59b94, spark-role -> driver
	 pod uid: c576e1c6-1c54-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T01:43:22Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T01:43:22Z
	 phase: Running
	 container status:
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: running
		 container started at: 2019-01-20T01:43:27Z
Application status (driver):
	 pod name: spark-pi-1547948792539-driver
	 namespace: spark
	 labels: spark-app-selector -> spark-006d252db9b24f25b5069df357c30264, spark-role -> driver
	 pod uid: 38375b4b-1c55-11e9-8215-a434d9270a65
	 creation time: 2019-01-20T01:46:35Z
	 service account name: spark-sa
	 volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm
	 node name: minikube
	 start time: 2019-01-20T01:46:35Z
	 phase: Succeeded
	 container status:
		 container name: spark-kubernetes-driver
		 container image: skonto/spark:k8s-3.0.0
		 container state: terminated
		 container started at: 2019-01-20T01:46:39Z
		 container finished at: 2019-01-20T01:46:56Z
		 exit code: 0
		 termination reason: Completed

```

Closes apache#23599 from skonto/submit_ops_extension.

Authored-by: Stavros Kontopoulos <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>

[SPARK-24902][K8S] Add PV integration tests

- Adds persistent volume integration tests
- Adds a custom tag to the test to exclude it if it is run against a cloud backend.
- Assumes default fs type for the host, AFAIK that is ext4.

Manually run the tests against minikube as usual:
```
[INFO] --- scalatest-maven-plugin:1.0:test (integration-test)  spark-kubernetes-integration-tests_2.12 ---
Discovery starting.
Discovery completed in 192 milliseconds.
Run starting. Expected test count is: 16
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Test PVs with local storage
```

Closes apache#23514 from skonto/pvctests.

Authored-by: Stavros Kontopoulos <[email protected]>
Signed-off-by: shane knapp <[email protected]>

[SPARK-27216][CORE][BACKPORT-2.4] Upgrade RoaringBitmap to 0.7.45 to fix Kryo unsafe ser/dser issue

Fix ImplicitCastInputTypes
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants