Skip to content

Conversation

@davies
Copy link
Contributor

@davies davies commented Aug 26, 2014

After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:

PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py

The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:

Job CPython 2.7 PyPy 2.3.1 Speed up
Word Count 41s 15s 2.7x
Sort 46s 44s 1.05x
Stats 174s 3.6s 48x

Here is the code used for benchmark:

rdd = sc.textFile("text")
def wordcount():
    rdd.flatMap(lambda x:x.split('/'))\
        .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
    rdd.sortBy(lambda x:x, 1).count()
def stats():
    sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2144 at commit 25b4ca7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

Tests timed out after a configured wait of 120m.

@davies
Copy link
Contributor Author

davies commented Aug 26, 2014

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

QA tests have started for PR 2144 at commit 25b4ca7.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 26, 2014

Tests timed out after a configured wait of 120m.

@mateiz
Copy link
Contributor

mateiz commented Aug 27, 2014

This looks like it will be tricky to maintain without automated testing. Can you update dev/run-tests to also run the PySpark tests with PyPy maybe? You might need help from Patrick or others on installing pypy on the Jenkins machines.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2144 at commit 9986692.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

Tests timed out after a configured wait of 120m.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have started for PR 2144 at commit cb2d724.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 28, 2014

QA tests have finished for PR 2144 at commit cb2d724.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ItemGetterType(ctypes.Structure):
    • class AttrGetterType(ctypes.Structure):

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have started for PR 2144 at commit 42fb5fa.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Aug 29, 2014

QA tests have finished for PR 2144 at commit 42fb5fa.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ItemGetterType(ctypes.Structure):
    • class AttrGetterType(ctypes.Structure):
    • case class Sqrt(child: Expression) extends UnaryExpression
    • class TreeNodeRef(val obj: TreeNode[_])

@mateiz
Copy link
Contributor

mateiz commented Aug 30, 2014

@davies just curious, do all the unit tests run if you do run-tests with pypy? We should make sure they do, and add a command in there to test this in Jenkins (ask Patrick for help with this).

@davies
Copy link
Contributor Author

davies commented Aug 31, 2014

Yes, I will do that next week.

@JoshRosen
Copy link
Contributor

One concern with adding tests for pypy is that it might significantly increase the runtime of the Jenkins tests. We should test regularly with pypy to make sure that we don't break compatibility, but we might not need to run those tests with every commit.

@mateiz
Copy link
Contributor

mateiz commented Sep 4, 2014

How long do the Python tests run now?

Anyway, we could do PyPy only if Python code changed (but I'd still do Python all the time).

@davies
Copy link
Contributor Author

davies commented Sep 4, 2014

PyPy is fully compatible with CPython for pure Python code, so it's not necessary to test against every commit with PyPy.

Maybe we could have nightly tests (for performance or scalability), we could put PyPy in this kind of tests.

@davies
Copy link
Contributor Author

davies commented Sep 4, 2014

PyPy does not fully support NumPy right now, so MLlib can not run with PyPy.

@mateiz
Copy link
Contributor

mateiz commented Sep 5, 2014

So you guys should figure out a way to run this so that it doesn't get stale. For example it's fine to add some code to the script that runs all the tests except the MLlib ones. But there's little point merging it unless we also automatically test that it keeps working, otherwise we'll only notice breakage on each release (if we remember to test with PyPy).

@mattf
Copy link

mattf commented Sep 5, 2014

So you guys should figure out a way to run this so that it doesn't get stale. For example it's fine to add some code to the script that runs all the tests except the MLlib ones. But there's little point merging it unless we also automatically test that it keeps working, otherwise we'll only notice breakage on each release (if we remember to test with PyPy).

+1

@JoshRosen
Copy link
Contributor

Let's just have the PyPy tests run by default on Jenkins. If this causes build speed problems later down the road, we can revisit the issue of selectively running tests.

@davies
Copy link
Contributor Author

davies commented Sep 5, 2014

@mateiz @JoshRosen @mattf run-tests will try to run tests for spark core and sql with PyPy.

One known issue is that serialization of array in PyPy is similar to Python2.6, which is not supported by Pyrite, so one test cases has been skipped for them. I had added another one which do not depend on serialization of array.

Also I had added some refactor in cloudpickle to do it in more portable ways (which is also used by dill).

@davies
Copy link
Contributor Author

davies commented Sep 5, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 5, 2014

QA tests have started for PR 2144 at commit fae8b19.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 6, 2014

QA tests have finished for PR 2144 at commit fae8b19.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor

I'm waiting to figure out the right procedure for installing pypy on our Jenkins boxes; once I have this figured out, I'll loop back to finish reviewing. From what I've seen so far, though, this looks good, although I should take a more careful look over the cloudpickle refactoring to make sure that I understand what's happening.

@JoshRosen
Copy link
Contributor

Thanks to @shaneknapp we now have pypy-2.0.2-1.el6.x86_64 on the Jenkins workers, so I'm going to try retesting this.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2144 at commit fae8b19.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2144 at commit fae8b19.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Dummy(object):

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2144 at commit b20ab3a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have started for PR 2144 at commit 9aed6c5.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2144 at commit b20ab3a.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Dummy(object):

@SparkQA
Copy link

SparkQA commented Sep 11, 2014

QA tests have finished for PR 2144 at commit 9aed6c5.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Dummy(object):

@JoshRosen
Copy link
Contributor

This looks good to me (Davies and I walked through the code offline). I'm going to merge this into master. Thanks!

@asfgit asfgit closed this in 71af030 Sep 13, 2014
@davies davies deleted the pypy branch September 15, 2014 22:19
asfgit pushed a commit that referenced this pull request Sep 24, 2014
function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names).

There is a regression introduced by #2144, revert part of changes in that PR.

cc JoshRosen

Author: Davies Liu <[email protected]>

Closes #2522 from davies/globals and squashes the following commits:

dfbccf5 [Davies Liu] fix bug while pickle globals of function
@staslos
Copy link

staslos commented Jul 20, 2015

Sorry for the stupid question. Does this work on executors in YARN cluster? Or it's just for local mode only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants