Skip to content

Conversation

@holdenk
Copy link
Contributor

@holdenk holdenk commented Jun 9, 2016

This is an early work in progress / RFC PR to see what interest exists / thoughts are around offering Jython for some PySpark UDF evaluation.

What changes were proposed in this pull request?

Transferring data from the JVM to the Python executor can be a substantial bottleneck. While Jython is not suitable for all UDFs or map functions, it may be suitable for some simple ones. An early draft of this, with a tokenization UDF, found Jython UDF to be ~65% faster than Python UDF and ~2% slower than a native Scala UDF for multiple runs. The first run with a Jython UDF involves starting the Jython interpreter on the workers, but even in those cases it outperforms regular PySpark UDFs by ~20%.

How was this patch tested?

unit tests, doc tests, and benchmark (see https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit?usp=sharing ).

holdenk added 30 commits May 20, 2016 11:57
…well together so limit the doctests. TODO: copy more tests in tests.py and update docstrings and doctests to be unfiorm and more clear about when/when not jython will probably work. Also consider porting wordcount example to jython
…ping. Also cleanup broadcast on python object delete
@holdenk
Copy link
Contributor Author

holdenk commented Jun 13, 2016

So this is a WIP of what this could look like, but I'd really like your thoughts on the draft @davies - do you think this is heading in the right direction given the performance #s from the benchmark?

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61781 has finished for PR 13571 at commit 0244f34.

  • This patch fails build dependency tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jul 5, 2016

jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Jul 5, 2016

Test build #61787 has finished for PR 13571 at commit 0244f34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 19, 2016

Test build #62528 has finished for PR 13571 at commit 0244f34.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62834 has finished for PR 13571 at commit bfa39e8.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jul 26, 2016

Test build #62843 has finished for PR 13571 at commit 06f753e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Jul 26, 2016

Now that 2.0 is in the final RC would maybe @davies have a chance to take a look and see if this is something that would be interesting?

src = func
else:
try:
import dill
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it seems pyspark uses cloudpickle to serialize and deserialize otherwise non-serializable functions. What are the advantages of using dill here instead of cloudpickle?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So dill lets us get at the source, and cloudpickle doesn't get the source. Since Jython is a different VM we need to send the source - not the serialized function.

@holdenk
Copy link
Contributor Author

holdenk commented Aug 2, 2016

I've been thinking - I could change this so that the Jython jar is marked as provided and add a flag to Spark Submit to include Jython for people that want to use it (along with a check inside of registerJython which tells people about the flag) if we are concerned about adding hard dependency on Jython (but I'm not sure what peoples thoughts are on taking on that dependency) - cc @davies & @yanboliang

@SparkQA
Copy link

SparkQA commented Aug 18, 2016

Test build #64011 has finished for PR 13571 at commit 7ab97b9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor Author

holdenk commented Sep 8, 2016

Ping @yanboliang & @davies for thoughts

@SparkQA
Copy link

SparkQA commented Sep 8, 2016

Test build #65120 has finished for PR 13571 at commit 75404b7.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 9, 2016

Test build #65128 has finished for PR 13571 at commit fbe4549.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 21, 2016

Test build #65730 has finished for PR 13571 at commit 6a127e5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 5, 2016

Test build #66340 has finished for PR 13571 at commit c00d71c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@davies
Copy link
Contributor

davies commented Oct 12, 2016

Thanks for the prototype and sending the PR out, this looks interesting (20% - 300% improvement is cool).

I don't know how mature Jython currently is, never heard a company who use it. The last release took to years from beta to release, maybe that's a signal that the community behind Jython is not that active.

The license Jython use is unique, I don't know whether it is OK to package with Spark or not. Also the standalone jar is 37M, that's pretty big for a experimental feature.

This PR also introduce a public API, even we mark it as experimental, it still require some effort to maintain it even deprecate and remove it. I'm sure it's worth or not.

We could leave the JIRA and PR open to gather more feedback, it could be useful in case that some people want to try it out.

@zero323
Copy link
Member

zero323 commented Oct 13, 2016

@davies Jython is relatively mature and it has some applications (think about Python UDFs with PIg) but it doesn't change the fact that it is years behind CPython (no released 3.x branch for starters), slowish and painful to use. Not to mention that for native libraries you need JyNI which is still in alpha.

Moreover reasoning about PySpark dataflow is hard enough right now without adding another place where things can blow.

@holdenk
Copy link
Contributor Author

holdenk commented Oct 13, 2016

So @rxin just commented with an explicit "Won't Fix" on the JIRA.

@zero323 certainly Jython can be slow, but as the benchmark shows it can be much faster than our current Python UDF approach.

@davies Some people have been actively using it in a similar use case (Python UDFs with Pig) (as @zero323 mentions).

I'd rather try and find a way to expose this as a experimental API - but if the consesus is "Won't Fix" I'll put this on my back burner of things to contribute as a Spark Package (although the cost of maintaining Spark Packages is frustrating so I'll also take a look at adding it to one of the meta packages like Bahir if there is interest there). In the meantime I'll do some poking with arrow and also looking at pip install-ability since these seem to be of importance to much of the PySpark community.

@holdenk holdenk closed this Oct 13, 2016
mariusvniekerk added a commit to mariusvniekerk/spark-jython-udf that referenced this pull request Nov 30, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants