[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used #40327

HyukjinKwon · 2023-03-08T04:41:18Z

What changes were proposed in this pull request?

This PR proposes to remove the parent directory in shell.py execution when IPython is used.

This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (sys.path), which results in searching packages under pyspark directory. For example, import pandas attempts to import pyspark.pandas.

So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem.

Running it with IPython can easily reproduce the error:

PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"

Why are the changes needed?

To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., pyspark.pandas vs pandas)

Does this PR introduce any user-facing change?

No to the end users:

Because this path is only inserted for shell.py execution, and thankfully we didn't have such relative import case so far.
It fixes the issue in the unreleased, Spark Connect.

How was this patch tested?

Manually tested.

PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"

Before:

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 40, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
    from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
  File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module>
    check_dependencies(__name__, __file__)
  File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
    import pandas
  File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
    from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions
  File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
...

After:

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
      /_/

Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]:

HyukjinKwon · 2023-03-08T04:42:24Z

cc @zhengruifeng @ueshin @grundprinzip FYI

HyukjinKwon · 2023-03-08T10:38:37Z

Merged to master and branch-3.4.

…on when IPython is used ### What changes were proposed in this pull request? This PR proposes to remove the parent directory in `shell.py` execution when IPython is used. This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (`sys.path`), which results in searching packages under `pyspark` directory. For example, `import pandas` attempts to import `pyspark.pandas`. So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem. Running it with IPython can easily reproduce the error: ```bash PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]" ``` ### Why are the changes needed? To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`) ### Does this PR introduce _any_ user-facing change? No to the end users: - Because this path is only inserted for `shell.py` execution, and thankfully we didn't have such relative import case so far. - It fixes the issue in the unreleased, Spark Connect. ### How was this patch tested? Manually tested. ```bash PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]" ``` **Before:** ``` Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) Type 'copyright', 'credits' or 'license' for more information IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help. /.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session. warnings.warn("Failed to initialize Spark session.") Traceback (most recent call last): File "/.../spark/python/pyspark/shell.py", line 40, in <module> spark = SparkSession.builder.getOrCreate() File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate from pyspark.sql.connect.session import SparkSession as RemoteSparkSession File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module> check_dependencies(__name__, __file__) File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies require_minimum_pandas_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version import pandas File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module> from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module> require_minimum_pandas_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version): AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import) ... ``` **After:** ``` Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) Type 'copyright', 'credits' or 'license' for more information IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0 /_/ Using Python version 3.9.16 (main, Feb 1 2023 21:42:20) Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. In [1]: ``` Closes #40327 from HyukjinKwon/SPARK-42266. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 8e83ab7) Signed-off-by: Hyukjin Kwon <[email protected]>

…on when IPython is used ### What changes were proposed in this pull request? This PR proposes to remove the parent directory in `shell.py` execution when IPython is used. This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (`sys.path`), which results in searching packages under `pyspark` directory. For example, `import pandas` attempts to import `pyspark.pandas`. So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem. Running it with IPython can easily reproduce the error: ```bash PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]" ``` ### Why are the changes needed? To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`) ### Does this PR introduce _any_ user-facing change? No to the end users: - Because this path is only inserted for `shell.py` execution, and thankfully we didn't have such relative import case so far. - It fixes the issue in the unreleased, Spark Connect. ### How was this patch tested? Manually tested. ```bash PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]" ``` **Before:** ``` Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) Type 'copyright', 'credits' or 'license' for more information IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help. /.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session. warnings.warn("Failed to initialize Spark session.") Traceback (most recent call last): File "/.../spark/python/pyspark/shell.py", line 40, in <module> spark = SparkSession.builder.getOrCreate() File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate from pyspark.sql.connect.session import SparkSession as RemoteSparkSession File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module> check_dependencies(__name__, __file__) File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies require_minimum_pandas_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version import pandas File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module> from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module> require_minimum_pandas_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version): AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import) ... ``` **After:** ``` Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20) Type 'copyright', 'credits' or 'license' for more information IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help. Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0 /_/ Using Python version 3.9.16 (main, Feb 1 2023 21:42:20) Client connected to the Spark Connect server at localhost SparkSession available as 'spark'. In [1]: ``` Closes apache#40327 from HyukjinKwon/SPARK-42266. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 8e83ab7) Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added CORE PYTHON labels Mar 8, 2023

zhengruifeng approved these changes Mar 8, 2023

View reviewed changes

HyukjinKwon force-pushed the SPARK-42266 branch 2 times, most recently from 561d51e to 59fe05f Compare March 8, 2023 05:13

Local mode should work with IPython

452439d

HyukjinKwon force-pushed the SPARK-42266 branch from 59fe05f to 452439d Compare March 8, 2023 10:38

HyukjinKwon closed this in 8e83ab7 Mar 8, 2023

HyukjinKwon deleted the SPARK-42266 branch January 15, 2024 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used #40327

[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used #40327

Uh oh!

HyukjinKwon commented Mar 8, 2023

Uh oh!

HyukjinKwon commented Mar 8, 2023

Uh oh!

HyukjinKwon commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used #40327

[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used #40327

Uh oh!

Conversation

HyukjinKwon commented Mar 8, 2023

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Mar 8, 2023

Uh oh!

HyukjinKwon commented Mar 8, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants