[SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential to use Jython for PySpark UDFs #13571

holdenk · 2016-06-09T00:15:38Z

This is an early work in progress / RFC PR to see what interest exists / thoughts are around offering Jython for some PySpark UDF evaluation.

What changes were proposed in this pull request?

Transferring data from the JVM to the Python executor can be a substantial bottleneck. While Jython is not suitable for all UDFs or map functions, it may be suitable for some simple ones. An early draft of this, with a tokenization UDF, found Jython UDF to be ~65% faster than Python UDF and ~2% slower than a native Scala UDF for multiple runs. The first run with a Jython UDF involves starting the Jython interpreter on the workers, but even in those cases it outperforms regular PySpark UDFs by ~20%.

How was this patch tested?

unit tests, doc tests, and benchmark (see https://docs.google.com/document/d/1L-F12nVWSLEOW72sqOn6Mt1C0bcPFP9ck7gEMH2_IXE/edit?usp=sharing ).

…turns

…jython

…well together so limit the doctests. TODO: copy more tests in tests.py and update docstrings and doctests to be unfiorm and more clear about when/when not jython will probably work. Also consider porting wordcount example to jython

…ing dill tests when dill is missing

…jython

…ll , py3 w/o dill)

…jython

…ping. Also cleanup broadcast on python object delete

holdenk · 2016-06-13T01:46:39Z

So this is a WIP of what this could look like, but I'd really like your thoughts on the draft @davies - do you think this is heading in the right direction given the performance #s from the benchmark?

SparkQA · 2016-07-05T19:46:00Z

Test build #61781 has finished for PR 13571 at commit 0244f34.

This patch fails build dependency tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-07-05T20:43:12Z

jenkins, retest this please.

SparkQA · 2016-07-05T22:47:32Z

Test build #61787 has finished for PR 13571 at commit 0244f34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-19T16:03:01Z

Test build #62528 has finished for PR 13571 at commit 0244f34.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-07-25T19:31:33Z

Test build #62834 has finished for PR 13571 at commit bfa39e8.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

…jython

SparkQA · 2016-07-26T00:07:16Z

Test build #62843 has finished for PR 13571 at commit 06f753e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2016-07-26T00:15:29Z

Now that 2.0 is in the final RC would maybe @davies have a chance to take a look and see if this is something that would be interesting?

MechCoder · 2016-07-29T22:41:29Z

python/pyspark/sql/functions.py

+            src = func
+        else:
+            try:
+                import dill


Currently it seems pyspark uses cloudpickle to serialize and deserialize otherwise non-serializable functions. What are the advantages of using dill here instead of cloudpickle?

So dill lets us get at the source, and cloudpickle doesn't get the source. Since Jython is a different VM we need to send the source - not the serialized function.

holdenk · 2016-08-02T22:14:59Z

I've been thinking - I could change this so that the Jython jar is marked as provided and add a flag to Spark Submit to include Jython for people that want to use it (along with a check inside of registerJython which tells people about the flag) if we are concerned about adding hard dependency on Jython (but I'm not sure what peoples thoughts are on taking on that dependency) - cc @davies & @yanboliang

SparkQA · 2016-08-18T22:46:50Z

Test build #64011 has finished for PR 13571 at commit 7ab97b9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…jython

holdenk · 2016-09-08T22:01:24Z

Ping @yanboliang & @davies for thoughts

SparkQA · 2016-09-08T23:49:57Z

Test build #65120 has finished for PR 13571 at commit 75404b7.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

…ter since the PR was originally created

SparkQA · 2016-09-09T03:25:46Z

Test build #65128 has finished for PR 13571 at commit fbe4549.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…jython

SparkQA · 2016-09-21T22:16:55Z

Test build #65730 has finished for PR 13571 at commit 6a127e5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-10-05T00:17:00Z

Test build #66340 has finished for PR 13571 at commit c00d71c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

davies · 2016-10-12T20:37:08Z

Thanks for the prototype and sending the PR out, this looks interesting (20% - 300% improvement is cool).

I don't know how mature Jython currently is, never heard a company who use it. The last release took to years from beta to release, maybe that's a signal that the community behind Jython is not that active.

The license Jython use is unique, I don't know whether it is OK to package with Spark or not. Also the standalone jar is 37M, that's pretty big for a experimental feature.

This PR also introduce a public API, even we mark it as experimental, it still require some effort to maintain it even deprecate and remove it. I'm sure it's worth or not.

We could leave the JIRA and PR open to gather more feedback, it could be useful in case that some people want to try it out.

zero323 · 2016-10-13T11:21:21Z

@davies Jython is relatively mature and it has some applications (think about Python UDFs with PIg) but it doesn't change the fact that it is years behind CPython (no released 3.x branch for starters), slowish and painful to use. Not to mention that for native libraries you need JyNI which is still in alpha.

Moreover reasoning about PySpark dataflow is hard enough right now without adding another place where things can blow.

holdenk · 2016-10-13T21:55:15Z

So @rxin just commented with an explicit "Won't Fix" on the JIRA.

@zero323 certainly Jython can be slow, but as the benchmark shows it can be much faster than our current Python UDF approach.

@davies Some people have been actively using it in a similar use case (Python UDFs with Pig) (as @zero323 mentions).

I'd rather try and find a way to expose this as a experimental API - but if the consesus is "Won't Fix" I'll put this on my back burner of things to contribute as a Spark Package (although the cost of maintaining Spark Packages is frustrating so I'll also take a look at adding it to one of the meta packages like Bahir if there is interest there). In the meantime I'll do some poking with arrow and also looking at pip install-ability since these seem to be of importance to much of the PySpark community.

Heavily borrows from apache/spark#13571

holdenk added 30 commits May 20, 2016 11:57

Start work on Jython UDF support

09d0d5c

More work on the calling it from Python side

64954e4

Ok the basics work but maybe not to use reflection in base for int/long

b820135

Ok it now works for single elem inputs and integer/array of string re…

f6462e3

…turns

Take zero to 2 arguments

0c16ff3

PyLint and expand a bit on the error cases

0a74efc

Switch from json back to pickle

21e9f4e

Reeeeealllllly sketchy Row-ish-support-ish

4d72647

Style fixes

c978215

Use generic Row

fed0beb

Start on a bit of ScalaDoc and mark classes as private

d530712

Remove debug prints

68ba3b8

Doc params

b1b39bb

Start adding some tests

2788285

Remove some ignores

6e96430

Merge branch 'master' into SPARK-15369-investigate-selectively-using-…

b6f4aa3

…jython

Start adding tests for jython functionality (broken)

bd00c6c

PySpark tests

9e173d6

Update the tests, seems to work in py2 - need to fix issue with skipp…

764929e

…ing dill tests when dill is missing

Merge branch 'master' into SPARK-15369-investigate-selectively-using-…

e7bf7be

…jython

Skip on dill not being available

a404a5a

Handle closure arguments (aww yeah) and make the tests pass (py2 w/di…

ee57eef

…ll , py3 w/o dill)

Suppoer python 2 and 3 closures

be55ded

Merge branch 'master' into SPARK-15369-investigate-selectively-using-…

6e69628

…jython

broadcast the LazyJythonFunc

7055e66

Refactor a bit to simplify the imports/vars/setup code and allow skip…

aacc311

…ping. Also cleanup broadcast on python object delete

pep8 fixes

b4a8e22

Start adding sql udf perf

c84aca6

pep8ify the new example

80507b1

Merge in master - conflict in gitingore

0244f34

holdenk mentioned this pull request Jul 22, 2016

[SPARK-11775][PYSPARK][SQL] Allow PySpark to register Java UDF #9766

Closed

Merge in master

bfa39e8

Merge branch 'master' into SPARK-15369-investigate-selectively-using-…

06f753e

…jython

MechCoder reviewed Jul 29, 2016
View reviewed changes

Merge in master

7ab97b9

Merge branch 'master' into SPARK-15369-investigate-selectively-using-…

75404b7

…jython

Update to use SparkSession for parseDataType - change happened in mas…

fbe4549

…ter since the PR was originally created

Merge branch 'master' into SPARK-15369-investigate-selectively-using-…

6a127e5

…jython

Merge in master

c00d71c

holdenk closed this Oct 13, 2016

mariusvniekerk added a commit to mariusvniekerk/spark-jython-udf that referenced this pull request Nov 30, 2016

First initial commit for spark-jython-udf.

43e57e8

Heavily borrows from apache/spark#13571

[SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential to use Jython for PySpark UDFs #13571

[SPARK-15369][WIP][RFC][PySpark][SQL] Expose potential to use Jython for PySpark UDFs #13571

Uh oh!

Conversation

holdenk commented Jun 9, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

holdenk commented Jun 13, 2016

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

holdenk commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 5, 2016

Uh oh!

SparkQA commented Jul 19, 2016

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

SparkQA commented Jul 26, 2016

Uh oh!

holdenk commented Jul 26, 2016

Uh oh!

MechCoder Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk Jul 29, 2016

Choose a reason for hiding this comment

Uh oh!

holdenk commented Aug 2, 2016

Uh oh!

SparkQA commented Aug 18, 2016

Uh oh!

holdenk commented Sep 8, 2016

Uh oh!

SparkQA commented Sep 8, 2016

Uh oh!

SparkQA commented Sep 9, 2016

Uh oh!

SparkQA commented Sep 21, 2016

Uh oh!

SparkQA commented Oct 5, 2016

Uh oh!

davies commented Oct 12, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Oct 13, 2016

Uh oh!

holdenk commented Oct 13, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

davies commented Oct 12, 2016 •

edited

Loading