Skip to content

Conversation

@HyukjinKwon
Copy link
Member

What changes were proposed in this pull request?

This PR proposes to remove the parent directory in shell.py execution when IPython is used.

This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (sys.path), which results in searching packages under pyspark directory. For example, import pandas attempts to import pyspark.pandas.

So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem.

Running it with IPython can easily reproduce the error:

PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"

Why are the changes needed?

To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., pyspark.pandas vs pandas)

Does this PR introduce any user-facing change?

No to the end users:

  • Because this path is only inserted for shell.py execution, and thankfully we didn't have such relative import case so far.
  • It fixes the issue in the unreleased, Spark Connect.

How was this patch tested?

Manually tested.

PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"

Before:

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 40, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
    from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
  File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module>
    check_dependencies(__name__, __file__)
  File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
    import pandas
  File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
    from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions
  File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
...

After:

Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
      /_/

Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]:

@HyukjinKwon
Copy link
Member Author

cc @zhengruifeng @ueshin @grundprinzip FYI

@HyukjinKwon HyukjinKwon force-pushed the SPARK-42266 branch 2 times, most recently from 561d51e to 59fe05f Compare March 8, 2023 05:13
@HyukjinKwon
Copy link
Member Author

Merged to master and branch-3.4.

HyukjinKwon added a commit that referenced this pull request Mar 8, 2023
…on when IPython is used

### What changes were proposed in this pull request?

This PR proposes to remove the parent directory in `shell.py` execution when IPython is used.

This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (`sys.path`), which results in searching packages under `pyspark` directory. For example, `import pandas` attempts to import `pyspark.pandas`.

So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem.

Running it with IPython can easily reproduce the error:

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

### Why are the changes needed?

To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)

### Does this PR introduce _any_ user-facing change?

No to the end users:
- Because this path is only inserted for `shell.py` execution, and thankfully we didn't have such relative import case so far.
- It fixes the issue in the unreleased, Spark Connect.

### How was this patch tested?

Manually tested.

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

**Before:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 40, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
    from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
  File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module>
    check_dependencies(__name__, __file__)
  File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
    import pandas
  File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
    from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions
  File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
...
```

**After:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
      /_/

Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]:
```

Closes #40327 from HyukjinKwon/SPARK-42266.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 8e83ab7)
Signed-off-by: Hyukjin Kwon <[email protected]>
snmvaughan pushed a commit to snmvaughan/spark that referenced this pull request Jun 20, 2023
…on when IPython is used

### What changes were proposed in this pull request?

This PR proposes to remove the parent directory in `shell.py` execution when IPython is used.

This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (`sys.path`), which results in searching packages under `pyspark` directory. For example, `import pandas` attempts to import `pyspark.pandas`.

So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem.

Running it with IPython can easily reproduce the error:

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

### Why are the changes needed?

To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)

### Does this PR introduce _any_ user-facing change?

No to the end users:
- Because this path is only inserted for `shell.py` execution, and thankfully we didn't have such relative import case so far.
- It fixes the issue in the unreleased, Spark Connect.

### How was this patch tested?

Manually tested.

```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```

**Before:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
  warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
  File "/.../spark/python/pyspark/shell.py", line 40, in <module>
    spark = SparkSession.builder.getOrCreate()
  File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
    from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
  File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module>
    check_dependencies(__name__, __file__)
  File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
    import pandas
  File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
    from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions
  File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
    require_minimum_pandas_version()
  File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
...
```

**After:**

```
Python 3.9.16 | packaged by conda-forge | (main, Feb  1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.0.dev0
      /_/

Using Python version 3.9.16 (main, Feb  1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.

In [1]:
```

Closes apache#40327 from HyukjinKwon/SPARK-42266.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 8e83ab7)
Signed-off-by: Hyukjin Kwon <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-42266 branch January 15, 2024 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants