Skip to content

Commit c7c418d

Browse files
Arthur Randsusanxhuynh
authored andcommitted
[SPARK-638] Update install docs for strict mode (apache#260)
* update install docs for strict mode * fix table formatting? * small edits * update troubleshooting for bootstrap workaround * Update install.md minor typo * Update troubleshooting.md Added extra information on how to remove NO_BOOTSTRAP * small change to command * added docs for quota and strict * small cleanup * added option to enable bootstrap for IP detection in the dispatcher * fix logic for using bootstrap
1 parent a29e54e commit c7c418d

File tree

6 files changed

+322
-18
lines changed

6 files changed

+322
-18
lines changed

conf/spark-env.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ MESOS_NATIVE_JAVA_LIBRARY=/opt/mesosphere/libmesos-bundle/lib/libmesos.so
1818

1919
# Unless explicitly directed, use bootstrap (defined on L55 of Dockerfile) to lookup the IP of the driver agent
2020
# this should be LIBPROCESS_IP iff the driver is on the host network, $(hostname) when it's not (e.g. CNI).
21-
if [ -z ${NO_BOOTSTRAP} ]; then
21+
if [ -z ${SKIP_BOOTSTRAP_IP_DETECT} ]; then
2222
if [ -f ${BOOTSTRAP} ]; then
2323
echo "Using bootstrap to set SPARK_LOCAL_IP" >&2
2424
SPARK_LOCAL_IP=$($BOOTSTRAP --get-task-ip)

docs/install.md

Lines changed: 178 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -12,16 +12,16 @@ Spark is available in the Universe and can be installed by using either the GUI
1212

1313
**Prerequisites:**
1414

15-
- [DC/OS and DC/OS CLI installed](https://docs.mesosphere.com/1.9/installing/).
16-
- Depending on your [security mode](https://docs.mesosphere.com/1.9/overview/security/security-modes/), Spark requires
17-
service authentication for access to DC/OS. For more information, see [Configuring DC/OS Access for
18-
Spark](https://docs.mesosphere.com/services/spark/spark-auth/).
19-
20-
| Security mode | Service Account |
15+
- [DC/OS and DC/OS CLI installed](https://docs.mesosphere.com/1.10/installing/oss/).
16+
- Depending on your [security mode](https://docs.mesosphere.com/1.10/security/ent/#security-modes), Spark requires
17+
service authentication for access to DC/OS. For more information:
18+
19+
| Security mode | Service Account |
2120
|---------------|-----------------------|
22-
| Disabled | Not available |
23-
| Permissive | Optional |
24-
| Strict | Required |
21+
| Disabled | Not available |
22+
| Permissive | Optional |
23+
| Strict | **Required** |
24+
2525

2626
# Default Installation
2727
To install the DC/OS Apache Spark service, run the following command on the DC/OS CLI. This installs the Spark DC/OS
@@ -77,6 +77,7 @@ dcos package describe spark --config
7777
```
7878

7979
## Customize Spark Distribution
80+
8081
DC/OS Apache Spark does not support arbitrary Spark distributions, but Mesosphere does provide multiple pre-built
8182
distributions, primarily used to select Hadoop versions.
8283

@@ -142,11 +143,177 @@ Install Spark with the options file specified:
142143
dcos package install --options=multiple.json spark
143144
```
144145

145-
Alternatively, you can specify a Spark instance directly from the CLI. For example:
146+
To specify which instance of Spark to use add `--name=<service_name>` to your CLI, for example
146147

147148
```bash
148-
dcos config set spark.app_id spark-dev
149+
$ dcos spark --name=spark-dev run ...
149150
```
150151

152+
# Installation for Strict mode (setting service authentication)
153+
154+
If your cluster is setup for [strict](https://docs.mesosphere.com/1.10/security/ent/#strict) security then you will need
155+
to follow these steps to install and run Spark.
156+
157+
## Service Accounts and Secrets
158+
159+
1. Install the `dcos-enterprise-cli` to get CLI security commands (if you haven't already):
160+
161+
```bash
162+
$ dcos package install dcos-enterprise-cli
163+
```
164+
165+
1. Create a key pair, a 2048-bit RSA public-private key pair is created using the Enterprise DC/OS CLI. Create a
166+
public-private key pair and save each value into a separate file within the current directory.
167+
168+
```bash
169+
$ dcos security org service-accounts keypair <your-private-key>.pem <your-public-key>.pem
170+
```
171+
172+
For example:
173+
174+
```bash
175+
dcos security org service-accounts keypair private-key.pem public-key.pem
176+
```
177+
178+
1. Create a new service account, `service-account-id` (e.g. `spark-principal`) containing the public key,
179+
`your-public-key.pem`.
180+
181+
```bash
182+
$ dcos security org service-accounts create -p <your-public-key>.pem -d "Spark service account" <service-account>
183+
```
184+
185+
For example:
186+
187+
```bash
188+
dcos security org service-accounts create -p public-key.pem -d "Spark service account" spark-principal
189+
190+
```
191+
192+
In the Mesos parlance a `service-account` is called a `principal` and so we use the terms interchangeably here.
193+
194+
**Note** You can verify your new service account using the following command.
195+
196+
```bash
197+
$ dcos security org service-accounts show <service-account>
198+
```
199+
200+
1. Create a secret (e.g. `spark/<secret-name>`) with your service account, `service-account`, and private key
201+
specified, `your-private-key.pem`.
202+
203+
```bash
204+
# permissive mode
205+
$ dcos security secrets create-sa-secret <your-private-key>.pem <service-account> spark/<secret-name>
206+
# strict mode
207+
$ dcos security secrets create-sa-secret --strict <private-key>.pem <service-account> spark/<secret-name>
208+
```
209+
210+
For example, on a strict-mode DC/OS cluster:
211+
212+
```bash
213+
dcos security secrets create-sa-secret --strict private-key.pem spark-principal spark/spark-secret
214+
```
215+
216+
**Note* You can verify the secrets were created with:
217+
218+
```bash
219+
$ dcos security secrets list /
220+
```
221+
222+
## Assigning permissions
223+
Permissions must be created so that the Spark service will be able to start Spark jobs and so the jobs themselves can
224+
launch the executors that perform the work on their behalf. There are a few points to keep in mind depending on your
225+
cluster:
226+
227+
* RHEL/CentOS users cannot currently run Spark in strict mode as user `nobody`, but must run as user `root`. This is
228+
due to how accounts are mapped to UIDs. CoreOS users are unaffected, and can run as user `nobody`. We designate the
229+
user as `spark-user` below.
230+
231+
* Spark runs by default under the Mesos default role, which is represented by the `*` symbol. You can deploy multiple
232+
instances of Spark without modifying this default. If you want to override the default Spark role, you must modify
233+
these code samples accordingly. We use `spark-service-role` to designate the role used below.
234+
235+
Permissions can also be assigned through the UI.
236+
237+
1. Run the following to create the required permissions for Spark:
238+
```bash
239+
$ dcos security org users grant <service-account> dcos:mesos:master:task:user:<user> create --description "Allows the Linux user to execute tasks"
240+
$ dcos security org users grant <service-account> dcos:mesos:master:framework:role:<spark-service-role> create --description "Allows a framework to register with the Mesos master using the Mesos default role"
241+
$ dcos security org users grant <service-account> dcos:mesos:master:task:app_id:/<service_name> create --description "Allows reading of the task state"
242+
```
243+
244+
Note that above the `dcos:mesos:master:task:app_id:/<service_name>` will likely be `dcos:mesos:master:task:app_id:/spark`
245+
246+
For example, continuing from above:
247+
248+
```bash
249+
dcos security org users grant spark-principal dcos:mesos:master:task:user:root create --description "Allows the Linux user to execute tasks"
250+
dcos security org users grant spark-principal dcos:mesos:master:framework:role:* create --description "Allows a framework to register with the Mesos master using the Mesos default role"
251+
dcos security org users grant spark-principal dcos:mesos:master:task:app_id:/spark create --description "Allows reading of the task state"
252+
253+
```
254+
255+
Note that here we're using the service account `spark-principal` and the user `root`.
256+
257+
1. If you are running the Spark service as `root` (as we are in this example) you will need to add an additional
258+
permission for Marathon:
259+
260+
```bash
261+
dcos security org users grant dcos_marathon dcos:mesos:master:task:user:root create --description "Allow Marathon to launch containers as root"
262+
```
263+
264+
## Install Spark with necessary configuration
265+
266+
1. Make a configuration file with the following before installing Spark, these settings can also be set through the UI:
267+
```json
268+
$ cat spark-strict-options.json
269+
{
270+
"service": {
271+
"service_account": "<service-account-id>",
272+
"user": "<user>",
273+
"service_account_secret": "spark/<secret_name>"
274+
}
275+
}
276+
```
277+
278+
A minimal example would be:
279+
280+
```json
281+
{
282+
"service": {
283+
"service_account": "spark-principal",
284+
"user": "root",
285+
"service_account_secret": "spark/spark-secret"
286+
}
287+
}
288+
```
289+
290+
Then install:
291+
292+
```bash
293+
$ dcos package install spark --options=spark-strict-options.json
294+
```
295+
296+
297+
## Add necessary configuration to your Spark jobs when submitting them
298+
299+
* To run a job on a strict mode cluster, you must add the `principal` to the command line. For example:
300+
```bash
301+
$ dcos spark run --verbose --submit-args=" \
302+
--conf spark.mesos.principal=<service-account> \
303+
--conf spark.mesos.containerizer=mesos \
304+
--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.0.1.jar 100"
305+
```
306+
307+
If you want to use the [Docker Engine](/1.10/deploying-services/containerizers/docker-containerizer/) instead of the [Universal Container Runtime](/1.10/deploying-services/containerizers/ucr/), you must specify the user through the `SPARK_USER` environment variable:
308+
309+
```bash
310+
$ dcos spark run --verbose --submit-args="\
311+
--conf spark.mesos.principal=<service-account> \
312+
--conf spark.mesos.driverEnv.SPARK_USER=nobody \
313+
--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.0.1.jar 100"
314+
```
315+
316+
317+
151318
[7]: #custom
152319
[16]: https://github.com/mesosphere/dcos-vagrant

docs/job-scheduling.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -47,6 +47,7 @@ mode above).
4747
Quota for the Drivers allows the operator of the cluster to ensure that only a given number of Drivers are concurrently
4848
running. As additional Drivers are submitted, they will be enqueued by the Spark Dispatcher. Below are the recommended
4949
steps for setting Quota for the Drivers:
50+
5051
1. Set the Quota conservatively, keeping in mind that it will effect the number of jobs that can run concurrently.
5152
1. Decide how much of your cluster's resources to allocate to running Drivers. These resources will only be used for
5253
the Spark Drivers, meaning that here we can decide roughly how many concurrent jobs we’d like to have running at a
@@ -142,6 +143,91 @@ job from consuming the entire Quota the max CPUs for that Spark job should be se
142143
Quota’s resources. This ensures that the Spark job will get sufficient resources to make progress, setting the max CPUs
143144
ensures it will not starve other Spark jobs of resources as well as predictable offer suppression semantics.
144145
146+
## Permissions when using Quota with Strict mode
147+
148+
Strict mode clusters (see [security modes](https://docs.mesosphere.com/1.10/security/ent/#security-modes)) require extra
149+
permissions to be set in order to use Quota. Follow the instructions in
150+
[installing](https://github.com/mesosphere/spark-build/blob/master/docs/install.md) and add the additional permissions
151+
for the roles you intend to use, detailed below. Following the example above they would be set as follows:
152+
153+
1. First set Quota for the Dispatcher's role (`dispatcher`)
154+
155+
```bash
156+
$ cat dispatcher-quota.json
157+
{
158+
"role": "dispatcher",
159+
"guarantee": [
160+
{
161+
"name": "cpus",
162+
"type": "SCALAR",
163+
"scalar": { "value": 5.0 }
164+
},
165+
{
166+
"name": "mem",
167+
"type": "SCALAR",
168+
"scalar": { "value": 5120.0 }
169+
}
170+
]
171+
}
172+
```
173+
174+
Then set the Quota *from your local machine*, this assumes you've downloaded the CA certificate,`dcos-ca.crt` to
175+
your local machine via the `https://<dcos_url>/ca/dcos-ca.crt` endpoint:
176+
177+
```bash
178+
curl -X POST --cacert dcos-ca.crt -H "Authorization: token=$(dcos config show core.dcos_acs_token)" $(dcos config show core.dcos_url)/mesos/quota -d @dispatcher-quota.json -H 'Content-Type: application/json'
179+
```
180+
181+
1. Optionally set Quota for the executors also, this is the same as above:
182+
183+
```bash
184+
$ cat executor-quota.json
185+
{
186+
"role": "executor",
187+
"guarantee": [
188+
{
189+
"name": "cpus",
190+
"type": "SCALAR",
191+
"scalar": { "value": 100.0 }
192+
},
193+
{
194+
"name": "mem",
195+
"type": "SCALAR",
196+
"scalar": { "value": 409600.0 }
197+
}
198+
]
199+
}
200+
```
201+
202+
Then set the Quota from your local machine, again assuming you have `dcos-ca.crt` locally:
203+
204+
```bash
205+
curl -X POST --cacert dcos-ca.crt -H "Authorization: token=$(dcos config show core.dcos_acs_token)" $(dcos config show core.dcos_url)/mesos/quota -d @executor-quota.json -H 'Content-Type: application/json'
206+
```
207+
208+
1. Install Spark with these minimal configurations:
209+
210+
```bash
211+
{
212+
"service": {
213+
"service_account": "spark-principal",
214+
"role": "dispatcher",
215+
"user": "root",
216+
"service_account_secret": "spark/spark-secret"
217+
}
218+
}
219+
```
220+
221+
1. Now you're ready to run a Spark job using the principal you set and the roles:
222+
223+
```bash
224+
dcos spark run --verbose --submit-args=" \
225+
--conf spark.mesos.principal=spark-principal \
226+
--conf spark.mesos.role=executor \
227+
--conf spark.mesos.containerizer=mesos \
228+
--class org.apache.spark.examples.SparkPi http://downloads.mesosphere.com/spark/assets/spark-examples_2.11-2.0.1.jar 100"
229+
```
230+
145231
# Setting `spark.cores.max`
146232
147233
To improve Spark job execution reliability, set the maximum number of cores consumed by any given job. This avoids

docs/troubleshooting.md

Lines changed: 45 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -9,19 +9,59 @@ menuWeight: 125
99

1010
# Dispatcher
1111

12-
The Mesos cluster dispatcher is responsible for queuing, tracking, and supervising drivers. Potential problems may arise if the dispatcher does not receive the resources offers you expect from Mesos, or if driver submission is failing. To debug this class of issue, visit the Mesos UI at `http://<dcos-url>/mesos/` and navigate to the sandbox for the dispatcher.
12+
* The Mesos cluster dispatcher is responsible for queuing, tracking, and supervising drivers. Potential problems may
13+
arise if the dispatcher does not receive the resources offers you expect from Mesos, or if driver submission is
14+
failing. To debug this class of issue, visit the Mesos UI at `http://<dcos-url>/mesos/` and navigate to the sandbox
15+
for the dispatcher.
16+
17+
* Spark has an internal mechanism for detecting the IP of the host. We use this method by default, but sometimes it
18+
fails, returning errors like these:
19+
20+
```
21+
ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[main,5,main]
22+
java.net.UnknownHostException: ip-172-31-4-148: ip-172-31-4-148: Name or service not known
23+
at java.net.InetAddress.getLocalHost(InetAddress.java:1505)
24+
at org.apache.spark.util.Utils$.findLocalInetAddress(Utils.scala:891)
25+
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress$lzycompute(Utils.scala:884)
26+
at org.apache.spark.util.Utils$.org$apache$spark$util$Utils$$localIpAddress(Utils.scala:884)
27+
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
28+
at org.apache.spark.util.Utils$$anonfun$localHostName$1.apply(Utils.scala:941)
29+
at scala.Option.getOrElse(Option.scala:121)
30+
at org.apache.spark.util.Utils$.localHostName(Utils.scala:941)
31+
at org.apache.spark.deploy.mesos.MesosClusterDispatcherArguments.<init>(MesosClusterDispatcherArguments.scala:27)
32+
at org.apache.spark.deploy.mesos.MesosClusterDispatcher$.main(MesosClusterDispatcher.scala:103)
33+
at org.apache.spark.deploy.mesos.MesosClusterDispatcher.main(MesosClusterDispatcher.scala)
34+
Caused by: java.net.UnknownHostException: ip-172-31-4-148: Name or service not known
35+
at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
36+
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
37+
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
38+
at java.net.InetAddress.getLocalHost(InetAddress.java:1500)
39+
... 10 more
40+
18/01/25 17:42:57 INFO ShutdownHookManager: Shutdown hook called
41+
```
42+
43+
In this case, enable the `service.use_bootstrap_for_IP_detect` option in the Dispatcher config, either via the UI,
44+
editing the task or set to `true` in the options.json, and restart the service. This will cause the DC/OS-specific
45+
`bootstrap` utility to detect the IP, which may allow the initialization of the Spark service to complete.
1346
1447
# Jobs
1548
16-
* DC/OS Apache Spark jobs are submitted through the dispatcher, which displays Spark properties and job state. Start here to verify that the job is configured as you expect.
49+
* DC/OS Apache Spark jobs are submitted through the dispatcher, which displays Spark properties and job state. Start
50+
here to verify that the job is configured as you expect.
51+
52+
* The dispatcher further provides a link to the job's entry in the history server, which displays the Spark Job UI.
53+
This UI shows the for the job. Go here to debug issues with scheduling and performance.
1754
18-
* The dispatcher further provides a link to the job's entry in the history server, which displays the Spark Job UI. This UI shows the for the job. Go here to debug issues with scheduling and performance.
55+
* Jobs themselves log output to their sandbox, which you can access through the Mesos UI. The Spark logs will be sent
56+
to `stderr`, while any output you write in your job will be sent to `stdout`.
1957
20-
* Jobs themselves log output to their sandbox, which you can access through the Mesos UI. The Spark logs will be sent to `stderr`, while any output you write in your job will be sent to `stdout`.
58+
* To disable using the Mesosphere `bootstrap` utility for host IP detection in jobs add
59+
`spark.mesos.driverEnv.SKIP_BOOTSTRAP_IP_DETECT=true` to your job configuration.
2160
2261
# CLI
2362
24-
The Spark CLI is integrated with the dispatcher so that they always use the same version of Spark, and so that certain defaults are honored. To debug issues with their communication, run your jobs with the `--verbose` flag.
63+
The Spark CLI is integrated with the dispatcher so that they always use the same version of Spark, and so that certain
64+
defaults are honored. To debug issues with their communication, run your jobs with the `--verbose` flag.
2565
2666
# HDFS Kerberos
2767

universe/config.json

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,11 @@
6060
"type": "boolean",
6161
"description": "Launch the Dispatcher using the Universal Container Runtime (UCR)",
6262
"default": false
63+
},
64+
"use_bootstrap_for_IP_detect": {
65+
"type": "boolean",
66+
"description": "Use the bootstrap utility for detecting host IP as opposed to using Spark's internal mechanism, see troubleshooting.md.",
67+
"default": false
6368
}
6469
}
6570
},

0 commit comments

Comments
 (0)