Initial Kubernetes cluster manager implementation.#50
Initial Kubernetes cluster manager implementation.#50
Conversation
Includes the following initial feature set: - Cluster mode with only Scala/Java jobs - Spark-submit support - Dynamic allocation Does not include, most notably: - Client mode support - Proper testing on both the unit and integration level; integration tests are flaky
There was a problem hiding this comment.
Actually we have to look up the container status for the container that is hosting the Driver container... will need to think about this. Alternatively we can just wait for all container statuses to be ready.
There was a problem hiding this comment.
Should probably check if it's a file (which implicitly checks existence) to ensure it's not a directory
| Executors.newCachedThreadPool( | ||
| new ThreadFactoryBuilder() | ||
| .setDaemon(true) | ||
| .setNameFormat("kubernetes-executor-requests") |
There was a problem hiding this comment.
I think we want name format of "kubernetes-executor-requests-%d"
| WORKDIR /opt/spark | ||
|
|
||
| # TODO support spark.executor.extraClassPath | ||
| CMD ${JAVA_HOME}/bin/java -Dspark.executor.port=$SPARK_EXECUTOR_PORT -Xmx$SPARK_EXECUTOR_MEMORY -cp ${SPARK_HOME}/jars/\* org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url $SPARK_DRIVER_URL --executor-id $(dbus-uuidgen) --cores $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname $HOSTNAME |
There was a problem hiding this comment.
Should set min heap as well: -Xms$SPARK_EXECUTOR_MEMORY to avoid a bunch of Full GC due to ergonomics as heap size dynamically adjusts
|
|
||
| WORKDIR /opt/spark | ||
|
|
||
| CMD ${JAVA_HOME}/bin/java -Dspark.shuffle.service.port=$SPARK_SHUFFLE_SERVICE_PORT -Xmx1g -cp ${SPARK_HOME}/jars/\* org.apache.spark.deploy.ExternalShuffleService |
There was a problem hiding this comment.
Configurable heap? Also set min heap = max?
|
|
||
| case KUBERNETES_EXPOSE_DRIVER_PORT => | ||
| value.split("=", 2).toSeq match { | ||
| case Seq(k, v) => exposeDriverPorts(k) = v.toInt |
There was a problem hiding this comment.
Yes. Probably should have a try-catch for a more favorable message than what I presume the default will be though.
| protected final String KUBERNETES_APP_NAMESPACE = "--kubernetes-app-namespace"; | ||
| protected final String KUBERNETES_CLIENT_CERT_FILE = "--kubernetes-client-cert-file"; | ||
| protected final String KUBERNETES_CLIENT_KEY_FILE = "--kubernetes-client-key-file"; | ||
| protected final String KUBERNETES_CA_CERT_FILE = "--kubernetes-ca-cert-file"; |
| <artifactId>netty</artifactId> | ||
| <version>3.8.0.Final</version> | ||
| </dependency> | ||
|
|
| * the locally-generated ID from the superclass. | ||
| * @return The application ID | ||
| * | ||
| * @return The application ID |
- Run spark shuffle service independently of a Spark job - Run executor pods instead of replication controllers
|
@ash211 @mccheah Shall we move these changes to a branch in https://github.com/foxish/spark/ and continue the discussions there on the various points we discussed last week? |
|
@foxish yep! Just got legal signoff this morning so we'll be sending it over shortly |
|
Thanks! |
HostPath volume will be used both on the shuffle service and the executors that connect to it. The shuffle service picks up the files written by the executors via the shared host volume.
|
@foxish can I get permission to push to your fork? |
|
Code transferred to foxish#7 with discussion now happening there instead |
* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes
* Create README to better describe project purpose * Add links to usage guide and dev docs * Minor changes
Includes the following initial feature set:
Does not include, most notably: