Skip to content

Conversation

@hopper-signifyd
Copy link

@hopper-signifyd hopper-signifyd commented Jul 2, 2020

What changes were proposed in this pull request?

This is a backport of changes from SPARK-25262, PR #22323

I plan to also backport SPARK-28042, PR #24879 after this is landed. That diff will also include a test confirming that related issue SPARK-31666 is fixed.

I had originally submitted both as a single PR, but have decided to close that PR and open 2 smaller ones per the recommendation of @dongjoon-hyun . See #28982 for more details.

Why are the changes needed?

Running Spark on Kubernetes and not being able to use mounted volumes as local storage causes issues that prevent Spark jobs from starting. I've seen this on AWS EKS, but I've been able to reproduce it with a basic spark-submit command on a standard K8S cluster. Upgrading to 3.0 just to fix this bug is more hassle than it's worth for some organizations.

Does this PR introduce any user-facing change?

Technically, yes. This adds the spark.kubernetes.local.dirs.tmpfs back to Spark 2.4 from Spark 3. However, there's no "breaking changes" per se.

How was this patch tested?

The tests were backported. Also, we've been running our own custom Spark 2.4.5 build with this patch applied at my org for the past few months.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, how did it help SPARK-31666? Any unit test about that?

@hopper-signifyd
Copy link
Author

As best as I can tell, both patches are required for fixing SPARK-31666. My plan was to add the additional unit test in the PR for backporting SPARK-28042. If I add the additional unit test to this PR, it will fail.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I investigated SPARK-31666 and left some comments.

In short, "Cannot map hostPath volumes to container" is wrong because SPARK-23529 works correctly like the following.

# minikube ssh ls /data
SPARK-31666.txt
export HTTP2_DISABLE=true
bin/spark-submit \
      --master k8s://$K8S_MASTER \
      --deploy-mode cluster \
      --name spark-pi \
      --class org.apache.spark.examples.SparkPi \
      --conf spark.kubernetes.driverEnv.HTTP2_DISABLE=true \
      --conf spark.executor.instances=1 \
      --conf spark.kubernetes.container.image=spark/spark:v2.4.6 \
      --conf spark.kubernetes.executor.volumes.hostPath.data.mount.path=/data \
      --conf spark.kubernetes.executor.volumes.hostPath.data.options.path=/data \
      local:///opt/spark/examples/jars/spark-examples_2.11-2.4.6.jar 10000
# k exec po/spark-pi-1593729363998-exec-1 -- ls /data
SPARK-31666.txt

Please see your error message Invalid value: "/tmp1": must be unique.. The error message occurs because spark-local-dir-x is already mounted by another volume by Spark. You should not use the same name.

20/07/02 15:38:39 INFO LoggingPodStatusWatcherImpl: State changed, new state:
	 pod name: spark-pi-1593729518015-driver
	 namespace: default
	 labels: spark-app-selector -> spark-74b65a9a61cc46fd8bfc5e03e4b28bb8, spark-role -> driver
	 pod uid: d838532b-eaa9-4b11-8eba-655f66965580
	 creation time: 2020-07-02T22:38:39Z
	 service account name: default
	 volumes: spark-local-dir-1, spark-conf-volume, default-token-n5wwg
	 node name: N/A
	 start time: N/A
	 container images: N/A
	 phase: Pending
	 status: []

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Jul 2, 2020

I closed SPARK-31666 as Not A Problem because Apache Spark 2.4's SPARK-23529 was not designed for the duplicate volume name. It works as designed and your configuration is not supported by Kubernetes. It seems that you are just wanting a new feature instead of a bug fix.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants