-
-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suggested improvements to resolve java, hadoop, python and resource errors #34
Comments
Hi @Nuglar :-) Which versions of the pyspark package did you try and for which versions did you experience the troubles? You only tested on the windows platform, right? I extracted four major points from your text:
>>> import pyspark
>>> spark = pyspark.sql.SparkSession.builder.getOrCreate()
22/07/25 13:24:28 WARN Utils: Your hostname ...
22/07/25 13:24:28 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/07/25 13:24:29 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
>>> spark.createDataFrame([(1,)], schema=['id']).show()
+---+
| id|
+---+
| 1|
+---+ Furthermore, I do not understand why you would need the
|
I have to say I disagree quite strongly - the package is not functional without openjdk, and we should ship it. Regarding different java flavours, that's definitely not the default, and should IMO be opt-out, e.g. there could be an output Regarding wanting to use different java versions, it's possible to package against different java versions (c.f. pyjnius). In the case where I just encountered this, the error message is not clear, particularly if the machine already has a JAVA_HOME that's somehow populated. |
This is exacerbated by the fact that the new java 17 LTS is only compatible with the very recent pyspark >=3.3, and anyone installing an older pyspark and doing "just add |
Will this ever be considered to get implemented?
Especially needing to change these environment variables in the system settings everytime you want to use a new environment, and copying winutils and hadoop to the newly installed pyspark instance. |
If it doesn't break the average use-case, then of course it will be considered! PRs are always welcome. :)
That's possible to fix with activation scripts. |
Thanks for the |
Solution to issue cannot be found in the documentation.
Issue
I have been unable to get conda-forge pyspark working out of the box, and have spent a couple of days figuring out what's going wrong. I am not versed enough to make a PR for myself, nor confident enough that this problem isn't observed by everyone to merit that PR. Regardless, I hope the info I put here is useful to the devs, or at least to people like me who are having trouble getting it working.
My process for installing pyspark locally:
There are four main issues:
findspark is required to link python to spark. Without first running import findspark; findspark.init(), the error is thrown: "Python worker failed to connect back" on some pyspark commands
spark version 2.4 (installed with pyspark) has a bug in it that fails when run on windows, resulting in an error ModuleNotFound error for "resource" when some pyspark commands are used.
I'm happy to elaborate or provide clearer errors/steps as needed.
Installed packages
Environment info
The text was updated successfully, but these errors were encountered: