-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential to use Jython for PySpark UDFs #13571
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential to use Jython for PySpark UDFs #13571
Conversation
…well together so limit the doctests. TODO: copy more tests in tests.py and update docstrings and doctests to be unfiorm and more clear about when/when not jython will probably work. Also consider porting wordcount example to jython
…ing dill tests when dill is missing
…ll , py3 w/o dill)
…ping. Also cleanup broadcast on python object delete
|
So this is a WIP of what this could look like, but I'd really like your thoughts on the draft @davies - do you think this is heading in the right direction given the performance #s from the benchmark? |
|
Test build #61781 has finished for PR 13571 at commit
|
|
jenkins, retest this please. |
|
Test build #61787 has finished for PR 13571 at commit
|
|
Test build #62528 has finished for PR 13571 at commit
|
|
Test build #62834 has finished for PR 13571 at commit
|
|
Test build #62843 has finished for PR 13571 at commit
|
|
Now that 2.0 is in the final RC would maybe @davies have a chance to take a look and see if this is something that would be interesting? |
| src = func | ||
| else: | ||
| try: | ||
| import dill |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it seems pyspark uses cloudpickle to serialize and deserialize otherwise non-serializable functions. What are the advantages of using dill here instead of cloudpickle?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So dill lets us get at the source, and cloudpickle doesn't get the source. Since Jython is a different VM we need to send the source - not the serialized function.
|
I've been thinking - I could change this so that the Jython jar is marked as provided and add a flag to Spark Submit to include Jython for people that want to use it (along with a check inside of registerJython which tells people about the flag) if we are concerned about adding hard dependency on Jython (but I'm not sure what peoples thoughts are on taking on that dependency) - cc @davies & @yanboliang |
|
Test build #64011 has finished for PR 13571 at commit
|
|
Ping @yanboliang & @davies for thoughts |
|
Test build #65120 has finished for PR 13571 at commit
|
…ter since the PR was originally created
|
Test build #65128 has finished for PR 13571 at commit
|
|
Test build #65730 has finished for PR 13571 at commit
|
|
Test build #66340 has finished for PR 13571 at commit
|
|
Thanks for the prototype and sending the PR out, this looks interesting (20% - 300% improvement is cool). I don't know how mature Jython currently is, never heard a company who use it. The last release took to years from beta to release, maybe that's a signal that the community behind Jython is not that active. The license Jython use is unique, I don't know whether it is OK to package with Spark or not. Also the standalone jar is 37M, that's pretty big for a experimental feature. This PR also introduce a public API, even we mark it as experimental, it still require some effort to maintain it even deprecate and remove it. I'm sure it's worth or not. We could leave the JIRA and PR open to gather more feedback, it could be useful in case that some people want to try it out. |
|
@davies Jython is relatively mature and it has some applications (think about Python UDFs with PIg) but it doesn't change the fact that it is years behind CPython (no released 3.x branch for starters), slowish and painful to use. Not to mention that for native libraries you need JyNI which is still in alpha. Moreover reasoning about PySpark dataflow is hard enough right now without adding another place where things can blow. |
|
So @rxin just commented with an explicit "Won't Fix" on the JIRA. @zero323 certainly Jython can be slow, but as the benchmark shows it can be much faster than our current Python UDF approach. @davies Some people have been actively using it in a similar use case (Python UDFs with Pig) (as @zero323 mentions). I'd rather try and find a way to expose this as a experimental API - but if the consesus is "Won't Fix" I'll put this on my back burner of things to contribute as a Spark Package (although the cost of maintaining Spark Packages is frustrating so I'll also take a look at adding it to one of the meta packages like Bahir if there is interest there). In the meantime I'll do some poking with arrow and also looking at pip install-ability since these seem to be of importance to much of the PySpark community. |
Heavily borrows from apache/spark#13571
This is an early work in progress / RFC PR to see what interest exists / thoughts are around offering Jython for some PySpark UDF evaluation.
What changes were proposed in this pull request?
Transferring data from the JVM to the Python executor can be a substantial bottleneck. While Jython is not suitable for all UDFs or map functions, it may be suitable for some simple ones. An early draft of this, with a tokenization UDF, found Jython UDF to be ~65% faster than Python UDF and ~2% slower than a native Scala UDF for multiple runs. The first run with a Jython UDF involves starting the Jython interpreter on the workers, but even in those cases it outperforms regular PySpark UDFs by ~20%.
How was this patch tested?
unit tests, doc tests, and benchmark (see https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit?usp=sharing ).