-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3772] Allow ipython to be used by Pyspark workers; IPython support improvements:
#2651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Fix the remaining uses of the '-u' flag, which IPython doesn't support. - Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized.
|
/cc @davies @cocoatomo @robbles for reviews / feedback. |
|
QA tests have started for PR 2651 at commit
|
|
QA tests have finished for PR 2651 at commit
|
|
Test PASSed. |
|
Before 1.2 release, maybe it's time to rethink how to run pyspark shell or scripts, using bin/pyspark or spark-submit is not so friendly for user, maybe we could simplify it. The most things that pyspark does are setup SPARK_HOME and PYTHONPATH, so I did these in .bashrc: Then I could run any python script to use pyspark (most of them are testing). I can easily choose the version of python to use, such as ipython: If the version of python I used for driver is not binary compatible with the default one, then I need to use PYSPARK_PYTHON I think we could find the correct version to use for worker, PYSPARK_PYTHON is not need for most cases. For example, if PYSPARK_PYTHON is not set, by default, we could use the path of python used in driver for it. We could create some special cases, such as ipython, we could use python2.7 for ipython which uses python2.7. Also, we could create SPARK_HOME and PYTHONPATH during install spark for user. bin/pyspark could be called sparkshell.py, then user could easily choose whatever version of python to use it. bin/spark-submit is still useful to submit the jobs into cluster or adding files. In the same time, may be we could introduce some default arguments for general pyspark scripts. such as: $ ipython wc.py -h
Usage: wc.py [options] [args]
Options:
-h, --help show this help message and exit
-q, --quiet
-v, --verbose
PySpark Options:
-m MASTER, --master=MASTER
-p PARALLEL, --parallel=PARALLEL number of processes
-c CPUS, --cpus=CPUS cpus used per task
-M MEM, --mem=MEM memory used per task
--conf=CONF path for configuration file
--profile do profiling |
|
@davies Good points; we should definitely discuss this before 1.2. I guess that the script also loads Since we still need to support I'd still like to discuss the rest of your proposal, but I'd like to try to get the fixes here merged first because the current master instructions are broken and we need to re-introduce backwards-compatibility. |
|
The current patch looks good to me. One question: should we figure out the version of IPython, then use the correct of python in worker? In the most common cases, user's have IPython (2.7), but have no Python 2.7 in workers, we need to tell user how to make it work. |
|
We don't support IPython 1.0 anymore, so it seems reasonable to make Python 2.7 the default when using IPython (since IPython 2.0 requires at least Python 2.7). It seems like there's a growing list of use-cases that we'd like to support:
In Spark 1.1, we support 1 and 2 (via For 3, we need a way to specify the driver's Python executable independently from the worker's executable. Currently, As far as defaults are concerned, maybe we could try Does this sound like a reasonable approach? |
ipython to be used by Pyspark workers; IPython fixes:ipython to be used by Pyspark workers; IPython support improvements:
- Introduce PYSPARK_DRIVER_PYTHON - Attempt to use python2.7 as the default Python version. - Refuse to launch IPython without Python 2.7 if PYSPARK_PYTHON isn't set.
|
Updated based on offline discussion. The key changes:
|
|
QA tests have started for PR 2651 at commit
|
|
LGTM, thanks! |
|
QA tests have finished for PR 2651 at commit
|
|
Test PASSed. |
|
Thanks for the review! I'm going to merge this. Fun discovery: I tried running some simple examples using IPython (Python 2.7) on the driver with PyPy on the workers and it seemed to work (probably because we didn't use |
This pull request addresses a few issues related to PySpark's IPython support:
ipythonwhile the workers use a different Python version.There are more details in a block comment in
bin/pyspark.