Skip to content

Conversation

@JoshRosen
Copy link
Contributor

This pull request addresses a few issues related to PySpark's IPython support:

  • Fix the remaining uses of the '-u' flag, which IPython doesn't support (see SPARK-3772).
  • Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old name is reserved in case we ever want to allow the worker Python options to be customized (this variable was introduced in [SPARK-3706][PySpark] Cannot run IPython REPL with IPYTHON set to "1" and PYSPARK_PYTHON unset #2554 and hasn't landed in a release yet, so this doesn't break any compatibility).
  • Introduce a PYSPARK_DRIVER_PYTHON option that allows the driver to use ipython while the workers use a different Python version.
  • Attempt to use Python 2.7 by default if PYSPARK_PYTHON is not specified.
  • Retain the old semantics for IPYTHON=1 and IPYTHON_OPTS (to avoid breaking existing example programs).

There are more details in a block comment in bin/pyspark.

- Fix the remaining uses of the '-u' flag, which IPython doesn't support.
- Change PYSPARK_PYTHON_OPTS to PYSPARK_DRIVER_PYTHON_OPTS, so that the old
  name is reserved in case we ever want to allow the worker Python options
  to be customized.
@JoshRosen
Copy link
Contributor Author

/cc @davies @cocoatomo @robbles for reviews / feedback.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have started for PR 2651 at commit c4f5778.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 4, 2014

QA tests have finished for PR 2651 at commit c4f5778.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21284/Test PASSed.

@davies
Copy link
Contributor

davies commented Oct 5, 2014

Before 1.2 release, maybe it's time to rethink how to run pyspark shell or scripts, using bin/pyspark or spark-submit is not so friendly for user, maybe we could simplify it.

The most things that pyspark does are setup SPARK_HOME and PYTHONPATH, so I did these in .bashrc:

export SPARK_HOME=xxxx
export PYTHONPATH=${PYTHONPATH}:${SPARK_HOME}/python:${SPARK_HOME}/python/lib/py4j-0.8.2.1-src.zip

Then I could run any python script to use pyspark (most of them are testing). I can easily choose the version of python to use, such as ipython:

ipython python/pyspark/tests.py

If the version of python I used for driver is not binary compatible with the default one, then I need to use PYSPARK_PYTHON

PYSPARK_PYTHON=pypy pypy rdd.py

I think we could find the correct version to use for worker, PYSPARK_PYTHON is not need for most cases. For example, if PYSPARK_PYTHON is not set, by default, we could use the path of python used in driver for it. We could create some special cases, such as ipython, we could use python2.7 for ipython which uses python2.7.

Also, we could create SPARK_HOME and PYTHONPATH during install spark for user.

bin/pyspark could be called sparkshell.py, then user could easily choose whatever version of python to use it.

bin/spark-submit is still useful to submit the jobs into cluster or adding files. In the same time, may be we could introduce some default arguments for general pyspark scripts. such as:

$ ipython wc.py -h
Usage: wc.py [options] [args]

Options:
  -h, --help            show this help message and exit
  -q, --quiet
  -v, --verbose

  PySpark Options:
    -m MASTER, --master=MASTER
    -p PARALLEL, --parallel=PARALLEL number of processes
    -c CPUS, --cpus=CPUS cpus used per task
    -M MEM, --mem=MEM  memory used per task
    --conf=CONF         path for configuration file
    --profile                  do profiling

@JoshRosen
Copy link
Contributor Author

@davies Good points; we should definitely discuss this before 1.2. I guess that the script also loads spark-env.sh, but this is less of an issue since we moved most of the configuration to SparkConf.

Since we still need to support spark-submit and maintain backwards-compatibility with existing uses of the pyspark script, what do you think about this PR's approach? At a minimum, we need to preserve the old behavior of the IPYTHON= setting. Does the PYSPARK_DRIVER_PYTHON_OPTS seem like a reasonable thing to add to this script?

I'd still like to discuss the rest of your proposal, but I'd like to try to get the fixes here merged first because the current master instructions are broken and we need to re-introduce backwards-compatibility.

@davies
Copy link
Contributor

davies commented Oct 7, 2014

The current patch looks good to me.

One question: should we figure out the version of IPython, then use the correct of python in worker? In the most common cases, user's have IPython (2.7), but have no Python 2.7 in workers, we need to tell user how to make it work.

@JoshRosen
Copy link
Contributor Author

We don't support IPython 1.0 anymore, so it seems reasonable to make Python 2.7 the default when using IPython (since IPython 2.0 requires at least Python 2.7). It seems like there's a growing list of use-cases that we'd like to support:

  1. The same custom Python version (non IPython) used everywhere (e.g. I want to use PyPy).
  2. Stock IPython (ipython) with a different Python version on the workers.
  3. Custom IPython (SPARK-3265 Allow using custom ipython executable with pyspark #2167) with a different Python version on the workers.
  4. IPython (custom or stock) everywhere, including the workers.

In Spark 1.1, we support 1 and 2 (via IPYTHON=1), but not 3 or 4. In #2167, we tried to add support for 3 (Custom IPython) but ended up actually providing 4 (which was broken until #2554 and this PR).

For 3, we need a way to specify the driver's Python executable independently from the worker's executable. Currently, PYSPARK_PYTHON affects all Python processes. What if we added a PYSPARK_DRIVER_PYTHON option that only affected the driver Python (and which defaults to PYSPARK_PYTHON if unset)? I think this would provide enough mechanism to support all four use-cases.

As far as defaults are concerned, maybe we could try which python2.7 to check whether Python 2.7 is installed, and fall back to python if it's not available (this would only be used if PYSPARK_PYTHON wasn't set). I've noticed that certain programs run way faster under 2.7 (programs that use json, for example), so we should probably try to make python2.7 the default if it's installed.

Does this sound like a reasonable approach?

@JoshRosen JoshRosen changed the title [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython fixes: [SPARK-3772] Allow ipython to be used by Pyspark workers; IPython support improvements: Oct 8, 2014
- Introduce PYSPARK_DRIVER_PYTHON
- Attempt to use python2.7 as the default Python version.
- Refuse to launch IPython without Python 2.7 if PYSPARK_PYTHON isn't set.
@JoshRosen
Copy link
Contributor Author

Updated based on offline discussion. The key changes:

  • Introduce PYSPARK_DRIVER_PYTHON to allow the driver's Python executable to be different than the workers'.
  • Attempt to use python2.7 as the default Python version and fall back to python if it's not available.
  • Refuse to launch IPython without python2.7, unless PYSPARK_PYTHON is set.

@SparkQA
Copy link

SparkQA commented Oct 8, 2014

QA tests have started for PR 2651 at commit 7b8eb86.

  • This patch merges cleanly.

@davies
Copy link
Contributor

davies commented Oct 8, 2014

LGTM, thanks!

@SparkQA
Copy link

SparkQA commented Oct 9, 2014

QA tests have finished for PR 2651 at commit 7b8eb86.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/21490/Test PASSed.

@JoshRosen
Copy link
Contributor Author

Thanks for the review! I'm going to merge this.

Fun discovery: I tried running some simple examples using IPython (Python 2.7) on the driver with PyPy on the workers and it seemed to work (probably because we didn't use marshal):

PYSPARK_PYTHON=pypy PYSPARK_DRIVER_PYTHON=ipython ./bin/pyspark

@asfgit asfgit closed this in 4e9b551 Oct 9, 2014
@JoshRosen JoshRosen deleted the SPARK-3772 branch October 13, 2014 04:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants