You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-1395] Fix "local:" URI support in Yarn mode (again).
Recent changes ignored the fact that path may be defined with "local:"
URIs, which means they need to be explicitly added to the classpath
everywhere a remote process is started. This change fixes that by:
- Using the correct methods to add paths to the classpath
- Creating SparkConf settings for the Spark jar itself and for the
user's jar
- Propagating those two settings to the remote processes where needed
This ensures that both in client and in cluster mode, the driver has
the necessary info to build the executor's classpath and have things
still work when they contain "local:" references.
The change also fixes some confusion in ClientBase about whether
to use SparkConf or system properties to propagate config options to
the driver and executors, by standardizing on using data held by
SparkConf.
On the cleanup front, I removed the hacky way that log4j configuration
was being propagated to handle the "local:" case. It's much more cleanly
(and generically) handled by using spark-submit arguments (--files to
upload a config file, or setting spark.executor.extraJavaOptions to pass
JVM arguments and use a local file).
Author: Marcelo Vanzin <[email protected]>
Closes#560 from vanzin/yarn-local-2 and squashes the following commits:
4e7f066 [Marcelo Vanzin] Correctly propagate SPARK_JAVA_OPTS to driver/executor.
6a454ea [Marcelo Vanzin] Use constants for PWD in test.
6dd5943 [Marcelo Vanzin] Fix propagation of config options to driver / executor.
b2e377f [Marcelo Vanzin] Review feedback.
93c3f85 [Marcelo Vanzin] Fix ClassCastException in test.
e5c682d [Marcelo Vanzin] Fix cluster mode, restore SPARK_LOG4J_CONF.
1dfbb40 [Marcelo Vanzin] Add documentation for spark.yarn.jar.
bbdce05 [Marcelo Vanzin] [SPARK-1395] Fix "local:" URI support in Yarn mode (again).
Copy file name to clipboardExpand all lines: docs/running-on-yarn.md
+25-3Lines changed: 25 additions & 3 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -95,10 +95,19 @@ Most of the configs are the same for Spark on YARN as for other deployment modes
95
95
The amount of off heap memory (in megabytes) to be allocated per driver. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc.
96
96
</td>
97
97
</tr>
98
+
<tr>
99
+
<td><code>spark.yarn.jar</code></td>
100
+
<td>(none)</td>
101
+
<td>
102
+
The location of the Spark jar file, in case overriding the default location is desired.
103
+
By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be
104
+
in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't
105
+
need to be distributed each time an application runs. To point to a jar on HDFS, for example,
106
+
set this configuration to "hdfs:///some/path".
107
+
</td>
108
+
</tr>
98
109
</table>
99
110
100
-
By default, Spark on YARN will use a Spark jar installed locally, but the Spark JAR can also be in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an application runs. To point to a JAR on HDFS, `export SPARK_JAR=hdfs:///some/path`.
101
-
102
111
# Launching Spark on YARN
103
112
104
113
Ensure that `HADOOP_CONF_DIR` or `YARN_CONF_DIR` points to the directory which contains the (client side) configuration files for the Hadoop cluster.
@@ -156,7 +165,20 @@ all environment variables used for launching each container. This process is use
156
165
classpath problems in particular. (Note that enabling this requires admin privileges on cluster
157
166
settings and a restart of all node managers. Thus, this is not applicable to hosted clusters).
158
167
159
-
# Important Notes
168
+
To use a custom log4j configuration for the application master or executors, there are two options:
169
+
170
+
- upload a custom log4j.properties using spark-submit, by adding it to the "--files" list of files
171
+
to be uploaded with the application.
172
+
- add "-Dlog4j.configuration=<locationofconfigurationfile>" to "spark.driver.extraJavaOptions"
173
+
(for the driver) or "spark.executor.extraJavaOptions" (for executors). Note that if using a file,
174
+
the "file:" protocol should be explicitly provided, and the file needs to exist locally on all
175
+
the nodes.
176
+
177
+
Note that for the first option, both executors and the application master will share the same
178
+
log4j configuration, which may cause issues when they run on the same node (e.g. trying to write
179
+
to the same log file).
180
+
181
+
# Important notes
160
182
161
183
- Before Hadoop 2.2, YARN does not support cores in container resource requests. Thus, when running against an earlier version, the numbers of cores given via command line arguments cannot be passed to YARN. Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
162
184
- The local directories used by Spark executors will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored.
0 commit comments