You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/python-programming-guide.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -63,6 +63,11 @@ All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.
63
63
Standalone PySpark applications should be run using the `bin/pyspark` script, which automatically configures the Java and Python environment using the settings in `conf/spark-env.sh` or `.cmd`.
64
64
The script automatically adds the `bin/pyspark` package to the `PYTHONPATH`.
65
65
66
+
# Running PySpark on YARN
67
+
68
+
Running PySpark on a YARN-managed cluster requires a few extra steps. The client must reference a ZIP file containing PySpark and its dependencies. To create this file, run "make" inside the `python/` directory in the Spark source. This will generate `pyspark-assembly.zip` under `python/build/`. Then, set the PYSPARK_ZIP environment variable to point to the location of this file. Lastly, set MASTER=yarn-client.
69
+
70
+
`pyspark-assembly.zip` can be placed either on local disk or on HDFS. If in a public location on HDFS, YARN will be able to cache it on each node so that it doesn't need to be transferred each time an app is run.
0 commit comments