Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 20 additions & 49 deletions python/docs/source/development/debugging.rst
Original file line number Diff line number Diff line change
Expand Up @@ -215,13 +215,10 @@ Python/Pandas UDF
~~~~~~~~~~~~~~~~~

PySpark provides remote `memory_profiler <https://github.com/pythonprofilers/memory_profiler>`_ for
Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile.memory`` configuration to ``true``. That
can be used on editors with line numbers such as Jupyter notebooks. An example on a Jupyter notebook is as shown below.

.. code-block:: bash

pyspark --conf spark.python.profile.memory=true
Python/Pandas UDFs. That can be used on editors with line numbers such as Jupyter notebooks. UDFs with iterators as inputs/outputs are not supported.

SparkSession-based memory profiler can be enabled by setting the `Runtime SQL configuration <https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration>`_
``spark.sql.pyspark.udf.profiler`` to ``memory``. An example on a Jupyter notebook is as shown below.

.. code-block:: python

Expand All @@ -232,10 +229,11 @@ can be used on editors with line numbers such as Jupyter notebooks. An example o
def add1(x):
return x + 1

spark.conf.set("spark.sql.pyspark.udf.profiler", "memory")

added = df.select(add1("id"))
added.show()
sc.show_profiles()

spark.profile.show(type="memory")

The result profile is as shown below.

Expand All @@ -258,16 +256,18 @@ The UDF IDs can be seen in the query plan, for example, ``add1(...)#2L`` in ``Ar

added.explain()


.. code-block:: text

== Physical Plan ==
*(2) Project [pythonUDF0#11L AS add1(id)#3L]
+- ArrowEvalPython [add1(id#0L)#2L], [pythonUDF0#11L], 200
+- *(1) Range (0, 10, step=1, splits=16)

This feature is not supported with registered UDFs or UDFs with iterators as inputs/outputs.
We can clear the result memory profile as shown below.

.. code-block:: python

spark.profile.clear(id=2, type="memory")

Identifying Hot Loops (Python Profilers)
----------------------------------------
Expand Down Expand Up @@ -306,47 +306,14 @@ regular Python process unless you are running your driver program in another mac
276 0.000 0.000 0.002 0.000 <frozen importlib._bootstrap>:147(__enter__)
...

Executor Side
~~~~~~~~~~~~~

To use this on executor side, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
executor side, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.

.. code-block:: bash

pyspark --conf spark.python.profile=true


.. code-block:: python

>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
728 function calls (692 primitive calls) in 0.004 seconds

Ordered by: internal time, cumulative time

ncalls tottime percall cumtime percall filename:lineno(function)
12 0.001 0.000 0.001 0.000 serializers.py:210(load_stream)
12 0.000 0.000 0.000 0.000 {built-in method _pickle.dumps}
12 0.000 0.000 0.001 0.000 serializers.py:252(dump_stream)
12 0.000 0.000 0.001 0.000 context.py:506(f)
...

Python/Pandas UDF
~~~~~~~~~~~~~~~~~

To use this on Python/Pandas UDFs, PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.

.. code-block:: bash

pyspark --conf spark.python.profile=true
PySpark provides remote `Python Profilers <https://docs.python.org/3/library/profile.html>`_ for
Python/Pandas UDFs. UDFs with iterators as inputs/outputs are not supported.

SparkSession-based performance profiler can be enabled by setting the `Runtime SQL configuration <https://spark.apache.org/docs/latest/configuration.html#runtime-sql-configuration>`_
``spark.sql.pyspark.udf.profiler`` to ``perf``. An example is as shown below.

.. code-block:: python

Expand All @@ -358,14 +325,15 @@ Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` con
...
>>> added = df.select(add1("id"))

>>> spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
>>> added.show()
+--------+
|add1(id)|
+--------+
...
+--------+

>>> sc.show_profiles()
>>> spark.profile.show(type="perf")
============================================================
Profile of UDF<id=2>
============================================================
Expand All @@ -390,8 +358,11 @@ The UDF IDs can be seen in the query plan, for example, ``add1(...)#2L`` in ``Ar
+- ArrowEvalPython [add1(id#0L)#2L], [pythonUDF0#11L], 200
+- *(1) Range (0, 10, step=1, splits=16)

We can clear the result performance profile as shown below.

.. code-block:: python

This feature is not supported with registered UDFs.
>>> spark.profile.clear(id=2, type="perf")

Common Exceptions / Errors
--------------------------
Expand Down
1 change: 1 addition & 0 deletions python/docs/source/reference/pyspark.sql/spark_session.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ See also :class:`SparkSession`.
SparkSession.createDataFrame
SparkSession.getActiveSession
SparkSession.newSession
SparkSession.profile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have a dedicated section for profile.show, profile.dump.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit

[autosummary] failed to import pyspark.sql.SparkSession.profile.dump.
Possible hints:
* AttributeError: 'property' object has no attribute 'dump'
* ImportError: 
* ModuleNotFoundError: No module named 'pyspark.sql.SparkSession'

The profile property returns a Profile class instance, Sphinx might have difficulty accessing it. Do you happen to know the best way to resolve that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need

:template: autosummary/accessor_method.rst

?

See #44012 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I was thinking the same but it kept failing with the error message..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SparkSession.builder works because it is a classproperty whereas profile is a property of SparkSession.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a workaround 76e7387 by using autoclass, but it doesn't look consistent with the rest of the page, as shown below.

image

I'm wondering if we should have a follow-up designated for that part.

SparkSession.range
SparkSession.read
SparkSession.readStream
Expand Down
2 changes: 2 additions & 0 deletions python/pyspark/sql/connect/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -946,6 +946,8 @@ def _profiler_collector(self) -> ProfilerCollector:
def profile(self) -> Profile:
return Profile(self._client._profiler_collector)

profile.__doc__ = PySparkSession.profile.__doc__


SparkSession.__doc__ = PySparkSession.__doc__

Expand Down
12 changes: 12 additions & 0 deletions python/pyspark/sql/session.py
Original file line number Diff line number Diff line change
Expand Up @@ -908,6 +908,18 @@ def dataSource(self) -> "DataSourceRegistration":

@property
def profile(self) -> Profile:
"""Returns a :class:`Profile` for performance/memory profiling.

.. versionadded:: 4.0.0

Returns
-------
:class:`Profile`

Notes
-----
Supports Spark Connect.
"""
return Profile(self._profiler_collector)

def range(
Expand Down