[SPARK-638] Update install docs for strict mode (apache#260)

Arthur Rand · susanxhuynh · commit c7c418d4f213 · 2018-01-26T13:57:38.000-08:00
* update install docs for strict mode

* fix table formatting?

* small edits

* update troubleshooting for bootstrap workaround

* Update install.md

minor typo

* Update troubleshooting.md

Added extra information on how to remove NO_BOOTSTRAP

* small change to command

* added docs for quota and strict

* small cleanup

* added option to enable bootstrap for IP detection in the dispatcher

* fix logic for using bootstrap
diff --git a/conf/spark-env.sh b/conf/spark-env.sh
@@ -18,7 +18,7 @@ MESOS_NATIVE_JAVA_LIBRARY=/opt/mesosphere/libmesos-bundle/lib/libmesos.so
 
 # Unless explicitly directed, use bootstrap (defined on L55 of Dockerfile) to lookup the IP of the driver agent
 # this should be LIBPROCESS_IP iff the driver is on the host network, $(hostname) when it's not (e.g. CNI).
-if [ -z ${NO_BOOTSTRAP} ]; then
+if [ -z ${SKIP_BOOTSTRAP_IP_DETECT} ]; then
     if [ -f ${BOOTSTRAP} ]; then
         echo "Using bootstrap to set SPARK_LOCAL_IP" >&2
         SPARK_LOCAL_IP=$($BOOTSTRAP --get-task-ip)
diff --git a/docs/install.md b/docs/install.md
@@ -12,16 +12,16 @@ Spark is available in the Universe and can be installed by using either the GUI
 
 **Prerequisites:**
 
-- [DC/OS and DC/OS CLI installed](https://docs.mesosphere.com/1.9/installing/).
-- Depending on your [security mode](https://docs.mesosphere.com/1.9/overview/security/security-modes/), Spark requires
-  service authentication for access to DC/OS. For more information, see [Configuring DC/OS Access for
-  Spark](https://docs.mesosphere.com/services/spark/spark-auth/).
-  
-  | Security mode | Service Account |
+- [DC/OS and DC/OS CLI installed](https://docs.mesosphere.com/1.10/installing/oss/).
+- Depending on your [security mode](https://docs.mesosphere.com/1.10/security/ent/#security-modes), Spark requires
+  service authentication for access to DC/OS. For more information:  
+
+  | Security mode | Service Account       |
   |---------------|-----------------------|
-  | Disabled      | Not available   |
-  | Permissive    | Optional   |
-  | Strict        | Required |
+  | Disabled      | Not available         |
+  | Permissive    | Optional              |
+  | Strict        | **Required**          |
+
 
 # Default Installation
 To install the DC/OS Apache Spark service, run the following command on the DC/OS CLI. This installs the Spark DC/OS
@@ -77,6 +77,7 @@ dcos package describe spark --config
 ```
 
 ## Customize Spark Distribution
+
 DC/OS Apache Spark does not support arbitrary Spark distributions, but Mesosphere does provide multiple pre-built
 distributions, primarily used to select Hadoop versions.  
 
@@ -142,11 +143,177 @@ Install Spark with the options file specified:
 dcos package install --options=multiple.json spark
 ```
 
-Alternatively, you can specify a Spark instance directly from the CLI. For example:
+To specify which instance of Spark to use add `--name=<service_name>` to your CLI, for example
 
 ```bash
-dcos config set spark.app_id spark-dev
+$ dcos spark --name=spark-dev run ...
 ```
 
+# Installation for Strict mode (setting service authentication)
+
+If your cluster is setup for [strict](https://docs.mesosphere.com/1.10/security/ent/#strict) security then you will need
+to follow these steps to install and run Spark.
+
+## Service Accounts and Secrets
+
+1.  Install the `dcos-enterprise-cli` to get CLI security commands (if you haven't already):
+
+    ```bash
+    $ dcos package install dcos-enterprise-cli
+    ```
+
+1.  Create a key pair, a 2048-bit RSA public-private key pair is created using the Enterprise DC/OS CLI. Create a
+    public-private key pair and save each value into a separate file within the current directory.
+
+    ```bash
+    $ dcos security org service-accounts keypair <your-private-key>.pem <your-public-key>.pem
+    ```
+
+    For example:
+
+    ```bash
+    dcos security org service-accounts keypair private-key.pem public-key.pem
+    ```
+
+1.  Create a new service account, `service-account-id` (e.g. `spark-principal`) containing the public key,
+    `your-public-key.pem`.
+
+    ```bash
+    $ dcos security org service-accounts create -p <your-public-key>.pem -d "Spark service account" <service-account>
+    ```
+
+    For example: 
+
+    ```bash
+    dcos security org service-accounts create -p public-key.pem -d "Spark service account" spark-principal
+    
+    ```
+
+    In the Mesos parlance a `service-account` is called a `principal` and so we use the terms interchangeably here. 
+
+    **Note** You can verify your new service account using the following command.
+
+    ```bash
+    $ dcos security org service-accounts show <service-account>
+    ```
+
+1.  Create a secret (e.g. `spark/<secret-name>`) with your service account, `service-account`, and private key
+    specified, `your-private-key.pem`.
+
+    ```bash
+    # permissive mode
+    $ dcos security secrets create-sa-secret <your-private-key>.pem <service-account> spark/<secret-name>
+    # strict mode
+    $ dcos security secrets create-sa-secret --strict <private-key>.pem <service-account> spark/<secret-name>
+    ```
+
+    For example, on a strict-mode DC/OS cluster:
+
+    ```bash
+    dcos security secrets create-sa-secret --strict private-key.pem spark-principal spark/spark-secret
+    ```
+
+    **Note* You can verify the secrets were created with:
+
+    ```bash
+    $ dcos security secrets list /
+    ```
+
+## Assigning permissions
+Permissions must be created so that the Spark service will be able to start Spark jobs and so the jobs themselves can
+launch the executors that perform the work on their behalf. There are a few points to keep in mind depending on your
+cluster:
+
+*   RHEL/CentOS users cannot currently run Spark in strict mode as user `nobody`, but must run as user `root`. This is
+    due to how accounts are mapped to UIDs. CoreOS users are unaffected, and can run as user `nobody`. We designate the
+    user as `spark-user` below.
+
+*   Spark runs by default under the Mesos default role, which is represented by the `*` symbol. You can deploy multiple
+    instances of Spark without modifying this default. If you want to override the default Spark role, you must modify
+    these code samples accordingly. We use `spark-service-role` to designate the role used below.
+
+Permissions can also be assigned through the UI. 
+
+1.  Run the following to create the required permissions for Spark:
+    ```bash
+    $ dcos security org users grant <service-account> dcos:mesos:master:task:user:<user> create --description "Allows the Linux user to execute tasks"
+    $ dcos security org users grant <service-account> dcos:mesos:master:framework:role:<spark-service-role> create --description "Allows a framework to register with the Mesos master using the Mesos default role"
+    $ dcos security org users grant <service-account> dcos:mesos:master:task:app_id:/<service_name> create --description "Allows reading of the task state"
+    ```
+    
+    Note that above the `dcos:mesos:master:task:app_id:/<service_name>` will likely be `dcos:mesos:master:task:app_id:/spark`
+
+    For example, continuing from above:
+    
+    ```bash
+    dcos security org users grant spark-principal dcos:mesos:master:task:user:root create --description "Allows the Linux user to execute tasks"
+    dcos security org users grant spark-principal dcos:mesos:master:framework:role:* create --description "Allows a framework to register with the Mesos master using the Mesos default role"
+    dcos security org users grant spark-principal dcos:mesos:master:task:app_id:/spark create --description "Allows reading of the task state"
+    
+    ```
+
+    Note that here we're using the service account `spark-principal` and the user `root`.
+
+1.  If you are running the Spark service as `root` (as we are in this example) you will need to add an additional
+    permission for Marathon:
+
+    ```bash
+    dcos security org users grant dcos_marathon dcos:mesos:master:task:user:root create --description "Allow Marathon to launch containers as root"
+    ```
+
+## Install Spark with necessary configuration 
+
+1.  Make a configuration file with the following before installing Spark, these settings can also be set through the UI:
+    ```json
+    $ cat spark-strict-options.json 
+    {
+    "service": {
+            "service_account": "<service-account-id>",
+            "user": "<user>",
+            "service_account_secret": "spark/<secret_name>"
+            }
+    }
+    ```
+
+    A minimal example would be: 
+
+    ```json
+    { 
+    "service": {
+            "service_account": "spark-principal",
+            "user": "root",
+            "service_account_secret": "spark/spark-secret"
+            }
+    }
+    ```
+
+    Then install:
+
+    ```bash
+    $ dcos package install spark --options=spark-strict-options.json
+    ```
+
+
+## Add necessary configuration to your Spark jobs when submitting them
+
+*   To run a job on a strict mode cluster, you must add the `principal` to the command line. For example:
+    ```bash
+    $ dcos spark run --verbose --submit-args=" \
+    --conf spark.mesos.principal=<service-account> \
+    --conf spark.mesos.containerizer=mesos \
+    --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.0.1.jar 100"
+    ```
+
+If you want to use the [Docker Engine](/1.10/deploying-services/containerizers/docker-containerizer/) instead of the [Universal Container Runtime](/1.10/deploying-services/containerizers/ucr/), you must specify the user through the `SPARK_USER` environment variable: 
+
+    ```bash
+    $ dcos spark run --verbose --submit-args="\
+    --conf spark.mesos.principal=<service-account> \
+    --conf spark.mesos.driverEnv.SPARK_USER=nobody \
+    --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.0.1.jar 100"
+    ```
+
+
+
  [7]: #custom
  [16]: https://github.com/mesosphere/dcos-vagrant
diff --git a/docs/job-scheduling.md b/docs/job-scheduling.md
@@ -47,6 +47,7 @@ mode above).
 Quota for the Drivers allows the operator of the cluster to ensure that only a given number of Drivers are concurrently
 running. As additional Drivers are submitted, they will be enqueued by the Spark Dispatcher. Below are the recommended
 steps for setting Quota for the Drivers:
+
 1.  Set the Quota conservatively, keeping in mind that it will effect the number of jobs that can run concurrently.
 1.  Decide how much of your cluster's resources to allocate to running Drivers. These resources will only be used for
     the Spark Drivers, meaning that here we can decide roughly how many concurrent jobs we’d like to have running at a
@@ -142,6 +143,91 @@ job from consuming the entire Quota the max CPUs for that Spark job should be se
 Quota’s resources. This ensures that the Spark job will get sufficient resources to make progress, setting the max CPUs
 ensures it will not starve other Spark jobs of resources as well as predictable offer suppression semantics.
 
+## Permissions when using Quota with Strict mode 
+
+Strict mode clusters (see [security modes](https://docs.mesosphere.com/1.10/security/ent/#security-modes)) require extra
+permissions to be set in order to use Quota. Follow the instructions in
+[installing](https://github.com/mesosphere/spark-build/blob/master/docs/install.md) and add the additional permissions
+for the roles you intend to use, detailed below. Following the example above they would be set as follows:
+
+1.    First set Quota for the Dispatcher's role (`dispatcher`)
+
+    ```bash
+    $ cat dispatcher-quota.json
+    {
+     "role": "dispatcher",
+     "guarantee": [
+       {
+         "name": "cpus",
+         "type": "SCALAR",
+         "scalar": { "value": 5.0 }
+       },
+       {
+         "name": "mem",
+         "type": "SCALAR",
+         "scalar": { "value": 5120.0 }
+       }
+     ]
+    }
+    ```
+
+    Then set the Quota *from your local machine*, this assumes you've downloaded the CA certificate,`dcos-ca.crt` to
+    your local machine via the `https://<dcos_url>/ca/dcos-ca.crt` endpoint:
+    
+    ```bash
+    curl -X POST --cacert dcos-ca.crt -H "Authorization: token=$(dcos config show core.dcos_acs_token)" $(dcos config show core.dcos_url)/mesos/quota -d @dispatcher-quota.json -H 'Content-Type: application/json'
+    ```
+
+1.    Optionally set Quota for the executors also, this is the same as above:
+
+    ```bash
+    $ cat executor-quota.json
+    {
+      "role": "executor",
+      "guarantee": [
+        {
+          "name": "cpus",
+          "type": "SCALAR",
+          "scalar": { "value": 100.0 }
+        },
+        {
+          "name": "mem",
+          "type": "SCALAR",
+          "scalar": { "value": 409600.0 }
+        }
+      ]
+    }
+    ```
+
+    Then set the Quota from your local machine, again assuming you have `dcos-ca.crt` locally:
+
+    ```bash
+    curl -X POST --cacert dcos-ca.crt -H "Authorization: token=$(dcos config show core.dcos_acs_token)" $(dcos config show core.dcos_url)/mesos/quota -d @executor-quota.json -H 'Content-Type: application/json'
+    ```
+
+1.    Install Spark with these minimal configurations:
+
+    ```bash
+    { 
+        "service": {
+                "service_account": "spark-principal",
+                "role": "dispatcher",
+                "user": "root",
+                "service_account_secret": "spark/spark-secret"
+        }
+    }
+    ```
+
+1.    Now you're ready to run a Spark job using the principal you set and the roles:
+
+    ```bash
+    dcos spark run --verbose --submit-args=" \
+    --conf spark.mesos.principal=spark-principal \
+    --conf spark.mesos.role=executor \
+    --conf spark.mesos.containerizer=mesos \
+    --class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.0.1.jar 100"
+    ```
+
 # Setting `spark.cores.max`
 
 To improve Spark job execution reliability, set the maximum number of cores consumed by any given job.  This avoids
diff --git a/docs/troubleshooting.md b/docs/troubleshooting.md
@@ -9,19 +9,59 @@ menuWeight: 125
 
 # Dispatcher
 
-The Mesos cluster dispatcher is responsible for queuing, tracking, and supervising drivers. Potential problems may arise if the dispatcher does not receive the resources offers you expect from Mesos, or if driver submission is failing. To debug this class of issue, visit the Mesos UI at `http://<dcos-url>/mesos/` and navigate to the sandbox for the dispatcher.
+*   The Mesos cluster dispatcher is responsible for queuing, tracking, and supervising drivers. Potential problems may
+    arise if the dispatcher does not receive the resources offers you expect from Mesos, or if driver submission is
+    failing. To debug this class of issue, visit the Mesos UI at `http://<dcos-url>/mesos/` and navigate to the sandbox
+    for the dispatcher.
+
+*   Spark has an internal mechanism for detecting the IP of the host. We use this method by default, but sometimes it
+    fails, returning errors like these:
+
+    ```
+    ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
+        java.net.UnknownHostException: ip-172-31-4-148: ip-172-31-4-148: Name or service not known
+            at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
+            at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:891)
+            at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:884)
+            at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:884)
+            at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
+            at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
+            at scala.Option.getOrElse(Option.scala:121)
+            at org.apache.spark.util.Utils$.localHostName(Utils.scala:941)
+            at org.apache.spark.deploy.mesos.MesosClusterDispatcherArguments.<init>(MesosClusterDispatcherArguments.scala:27)
+            at org.apache.spark.deploy.mesos.MesosClusterDispatcher$.main(MesosClusterDispatcher.scala:103)
+            at org.apache.spark.deploy.mesos.MesosClusterDispatcher.main(MesosClusterDispatcher.scala)
+        Caused by: java.net.UnknownHostException: ip-172-31-4-148: Name or service not known
+            at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
+            at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
+            at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
+            at java.net.InetAddress.getLocalHost(InetAddress.java:1500)
+            ... 10 more
+    18/01/25 17:42:57 INFO ShutdownHookManager: Shutdown hook called
+    ```
+
+    In this case, enable the `service.use_bootstrap_for_IP_detect` option in the Dispatcher config, either via the UI,
+    editing the task or set to `true` in the options.json, and restart the service.  This will cause the DC/OS-specific
+    `bootstrap` utility to detect the IP, which may allow the initialization of the Spark service to complete. 
 
 # Jobs
 
-*   DC/OS Apache Spark jobs are submitted through the dispatcher, which displays Spark properties and job state. Start here to verify that the job is configured as you expect.
+*   DC/OS Apache Spark jobs are submitted through the dispatcher, which displays Spark properties and job state. Start
+    here to verify that the job is configured as you expect.
+
+*   The dispatcher further provides a link to the job's entry in the history server, which displays the Spark Job UI.
+    This UI shows the for the job. Go here to debug issues with scheduling and performance.
 
-*   The dispatcher further provides a link to the job's entry in the history server, which displays the Spark Job UI. This UI shows the for the job. Go here to debug issues with scheduling and performance.
+*   Jobs themselves log output to their sandbox, which you can access through the Mesos UI. The Spark logs will be sent
+    to `stderr`, while any output you write in your job will be sent to `stdout`.
 
-*   Jobs themselves log output to their sandbox, which you can access through the Mesos UI. The Spark logs will be sent to `stderr`, while any output you write in your job will be sent to `stdout`.
+*   To disable using the Mesosphere `bootstrap` utility for host IP detection in jobs add
+    `spark.mesos.driverEnv.SKIP_BOOTSTRAP_IP_DETECT=true` to your job configuration.
 
 # CLI
 
-The Spark CLI is integrated with the dispatcher so that they always use the same version of Spark, and so that certain defaults are honored. To debug issues with their communication, run your jobs with the `--verbose` flag.
+The Spark CLI is integrated with the dispatcher so that they always use the same version of Spark, and so that certain
+defaults are honored. To debug issues with their communication, run your jobs with the `--verbose` flag.
 
 # HDFS Kerberos
 
diff --git a/universe/config.json b/universe/config.json
@@ -60,6 +60,11 @@
                     "type": "boolean",
                     "description": "Launch the Dispatcher using the Universal Container Runtime (UCR)",
                     "default": false
+                },
+                "use_bootstrap_for_IP_detect": {
+                    "type": "boolean",
+                    "description": "Use the bootstrap utility for detecting host IP as opposed to using Spark's internal mechanism, see troubleshooting.md.",
+                    "default": false
                 }
             }
         },
diff --git a/universe/marathon.json.mustache b/universe/marathon.json.mustache

Original file line number	Diff line number	Diff line change
`@@ -60,6 +60,11 @@`
`60`	`60`	`"type": "boolean",`
`61`	`61`	`"description": "Launch the Dispatcher using the Universal Container Runtime (UCR)",`
`62`	`62`	`"default": false`
	`63`	`+ },`
	`64`	`+ "use_bootstrap_for_IP_detect": {`
	`65`	`+ "type": "boolean",`
	`66`	`+ "description": "Use the bootstrap utility for detecting host IP as opposed to using Spark's internal mechanism, see troubleshooting.md.",`
	`67`	`+ "default": false`
`63`	`68`	`}`
`64`	`69`	`}`
`65`	`70`	`},`