[SPARK-3478] [PySpark] Profile the Python tasks #2556

davies · 2014-09-27T04:32:02Z

This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:

============================================================
Profile of RDD<id=3>
============================================================
         5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
       20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
       20    0.017    0.001    0.017    0.001 {cPickle.dumps}
     1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
       20    0.001    0.000    0.001    0.000 {reduce}
       21    0.001    0.000    0.001    0.000 {cPickle.loads}
       20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
       41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
       40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
       62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
       20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
       20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
    40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
       41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
       40    0.000    0.000   71.072    1.777 rdd.py:304(func)
       20    0.000    0.000   71.094    3.555 worker.py:82(process)

Also, use can show profile result manually by sc.show_profiles() or dump it into disk
by sc.dump_profiles(path), such as

>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)

The profiling is disabled by default, can be enabled by "spark.python.profile=true".

Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"

This is bugfix of #2351 cc @JoshRosen

Conflicts: docs/configuration.md

Conflicts: python/pyspark/worker.py

AmplabJenkins · 2014-09-27T04:42:16Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/20903/

SparkQA · 2014-09-27T16:39:21Z

QA tests have started for PR 2556 at commit e68df5a.

This patch merges cleanly.

SparkQA · 2014-09-27T17:46:06Z

QA tests have finished for PR 2556 at commit e68df5a.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

This reverts commit 1aa549b.

JoshRosen · 2014-10-01T01:26:07Z

I've merged this. Thanks for the fix!

davies added 15 commits September 10, 2014 18:03

add profile for python

4b20494

fix Python UDF

0a5b6eb

address comment, add tests

4f8309d

add docs string and clear profiles after show or dump

dadee1a

add docs for two configs

15d6f18

Merge branch 'master' into profiler

c23865c

Merge branch 'master' into profiler

09d02c3

Conflicts: docs/configuration.md

Merge branch 'master' of github.com:apache/spark into profiler

116d52a

Conflicts: python/pyspark/worker.py

Merge branch 'master' of github.com:apache/spark into profiler

fb9565b

Conflicts: python/pyspark/worker.py

move show_profiles and dump_profiles to SparkContext

cba9463

bugfix

7a56c24

fix docs

2b0daf2

bugfix, add tests for show_profiles and dump_profiles()

7ef2aa0

compatitable with python 2.6

858e74c

Merge branch 'master' of github.com:apache/spark into profiler

e68df5a

JoshRosen referenced this pull request Oct 1, 2014

Revert "[SPARK-3478] [PySpark] Profile the Python tasks"

f872e4f

This reverts commit 1aa549b.

asfgit closed this in c5414b6 Oct 1, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-3478] [PySpark] Profile the Python tasks #2556

[SPARK-3478] [PySpark] Profile the Python tasks #2556

Uh oh!

davies commented Sep 27, 2014

Uh oh!

AmplabJenkins commented Sep 27, 2014

Uh oh!

SparkQA commented Sep 27, 2014

Uh oh!

SparkQA commented Sep 27, 2014

Uh oh!

JoshRosen commented Oct 1, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-3478] [PySpark] Profile the Python tasks #2556

[SPARK-3478] [PySpark] Profile the Python tasks #2556

Uh oh!

Conversation

davies commented Sep 27, 2014

Uh oh!

AmplabJenkins commented Sep 27, 2014

Uh oh!

SparkQA commented Sep 27, 2014

Uh oh!

SparkQA commented Sep 27, 2014

Uh oh!

JoshRosen commented Oct 1, 2014

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants