-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-42266][PYTHON] Remove the parent directory in shell.py execution when IPython is used #40327
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Member
Author
|
cc @zhengruifeng @ueshin @grundprinzip FYI |
zhengruifeng
approved these changes
Mar 8, 2023
561d51e to
59fe05f
Compare
Member
Author
|
Merged to master and branch-3.4. |
HyukjinKwon
added a commit
that referenced
this pull request
Mar 8, 2023
…on when IPython is used
### What changes were proposed in this pull request?
This PR proposes to remove the parent directory in `shell.py` execution when IPython is used.
This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (`sys.path`), which results in searching packages under `pyspark` directory. For example, `import pandas` attempts to import `pyspark.pandas`.
So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem.
Running it with IPython can easily reproduce the error:
```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```
### Why are the changes needed?
To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)
### Does this PR introduce _any_ user-facing change?
No to the end users:
- Because this path is only inserted for `shell.py` execution, and thankfully we didn't have such relative import case so far.
- It fixes the issue in the unreleased, Spark Connect.
### How was this patch tested?
Manually tested.
```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```
**Before:**
```
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "/.../spark/python/pyspark/shell.py", line 40, in <module>
spark = SparkSession.builder.getOrCreate()
File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module>
check_dependencies(__name__, __file__)
File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
import pandas
File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions
File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version
if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
...
```
**After:**
```
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0
/_/
Using Python version 3.9.16 (main, Feb 1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
In [1]:
```
Closes #40327 from HyukjinKwon/SPARK-42266.
Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 8e83ab7)
Signed-off-by: Hyukjin Kwon <[email protected]>
snmvaughan
pushed a commit
to snmvaughan/spark
that referenced
this pull request
Jun 20, 2023
…on when IPython is used
### What changes were proposed in this pull request?
This PR proposes to remove the parent directory in `shell.py` execution when IPython is used.
This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (`sys.path`), which results in searching packages under `pyspark` directory. For example, `import pandas` attempts to import `pyspark.pandas`.
So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem.
Running it with IPython can easily reproduce the error:
```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```
### Why are the changes needed?
To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g., `pyspark.pandas` vs `pandas`)
### Does this PR introduce _any_ user-facing change?
No to the end users:
- Because this path is only inserted for `shell.py` execution, and thankfully we didn't have such relative import case so far.
- It fixes the issue in the unreleased, Spark Connect.
### How was this patch tested?
Manually tested.
```bash
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"
```
**Before:**
```
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
/.../spark/python/pyspark/shell.py:45: UserWarning: Failed to initialize Spark session.
warnings.warn("Failed to initialize Spark session.")
Traceback (most recent call last):
File "/.../spark/python/pyspark/shell.py", line 40, in <module>
spark = SparkSession.builder.getOrCreate()
File "/.../spark/python/pyspark/sql/session.py", line 437, in getOrCreate
from pyspark.sql.connect.session import SparkSession as RemoteSparkSession
File "/.../spark/python/pyspark/sql/connect/session.py", line 19, in <module>
check_dependencies(__name__, __file__)
File "/.../spark/python/pyspark/sql/connect/utils.py", line 33, in check_dependencies
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 27, in require_minimum_pandas_version
import pandas
File "/.../spark/python/pyspark/pandas/__init__.py", line 29, in <module>
from pyspark.pandas.missing.general_functions import MissingPandasLikeGeneralFunctions
File "/.../spark/python/pyspark/pandas/__init__.py", line 34, in <module>
require_minimum_pandas_version()
File "/.../spark/python/pyspark/sql/pandas/utils.py", line 37, in require_minimum_pandas_version
if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):
AttributeError: partially initialized module 'pandas' has no attribute '__version__' (most likely due to a circular import)
...
```
**After:**
```
Python 3.9.16 | packaged by conda-forge | (main, Feb 1 2023, 21:42:20)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.10.0 -- An enhanced Interactive Python. Type '?' for help.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/03/08 13:30:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 3.5.0.dev0
/_/
Using Python version 3.9.16 (main, Feb 1 2023 21:42:20)
Client connected to the Spark Connect server at localhost
SparkSession available as 'spark'.
In [1]:
```
Closes apache#40327 from HyukjinKwon/SPARK-42266.
Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 8e83ab7)
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This PR proposes to remove the parent directory in
shell.pyexecution when IPython is used.This is a general issue for PySpark shell specifically with IPython - IPython temporarily adds the parent directory of the script into the Python path (
sys.path), which results in searching packages underpysparkdirectory. For example,import pandasattempts to importpyspark.pandas.So far, we haven't had such cases within PySpark itself importing code path, but Spark Connect now has the case via checking dependency checking (which attempts to import pandas) which exposes the actual problem.
Running it with IPython can easily reproduce the error:
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"Why are the changes needed?
To make PySpark shell properly import other packages even when the names conflict with subpackages (e.g.,
pyspark.pandasvspandas)Does this PR introduce any user-facing change?
No to the end users:
shell.pyexecution, and thankfully we didn't have such relative import case so far.How was this patch tested?
Manually tested.
PYSPARK_PYTHON=ipython bin/pyspark --remote "local[*]"Before:
After: