Skip to content

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Mar 5, 2024

What changes were proposed in this pull request?

Introduce spark.profile.clear for SparkSession-based profiling.

Why are the changes needed?

A straightforward and unified interface for managing and resetting profiling results for SparkSession-based profilers.

Does this PR introduce any user-facing change?

Yes. spark.profile.clear is supported as shown below.

Preparation:

>>> from pyspark.sql.functions import pandas_udf
>>> df = spark.range(3)
>>> @pandas_udf("long")
... def add1(x):
...   return x + 1
... 
>>> added = df.select(add1("id"))
>>> spark.conf.set("spark.sql.pyspark.udf.profiler", "perf")
>>> added.show()
+--------+                                                                      
|add1(id)|
+--------+
...
+--------+
>>> spark.profile.show()
============================================================
Profile of UDF<id=2>
============================================================
         1410 function calls (1374 primitive calls) in 0.004 seconds
...

Example usage:

>>> spark.profile.profiler_collector._profile_results
{2: (<pstats.Stats object at 0x7ff6484d22e0>, None)}

>>> spark.profile.clear(1)  # id mismatch
>>> spark.profile.profiler_collector._profile_results
{2: (<pstats.Stats object at 0x7ff6484d22e0>, None)}

>>> spark.profile.clear(type="memory")  # type mismatch
>>> spark.profile.profiler_collector._profile_results
{2: (<pstats.Stats object at 0x7ff6484d22e0>, None)}

>>> spark.profile.clear()  # clear all
>>> spark.profile.profiler_collector._profile_results
{}
>>> spark.profile.show()
>>> 

How was this patch tested?

Unit tests.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Mar 5, 2024
@xinrong-meng xinrong-meng changed the title [WIP] Introduce spark.profile.clear for SparkSession-based profiling [SPARK-47276][PYTHON][CONNECT] Introduce spark.profile.clear for SparkSession-based profiling Mar 5, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review March 5, 2024 21:50
@xinrong-meng
Copy link
Member Author

Failed tests are irrelevant to changes proposed in this PR. Rerun failed tests https://github.com/xinrong-meng/spark/actions/runs/8162084262.

"""
Clear the perf profile results.
.. versionadded:: 4.0.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a user-facing API? If not, we don't need this version directive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a user-facing API, along with profile.show and profile.dump. We will also add it to API doc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this is not. The clear in Profile should be a user-facing API.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine but cc @ueshin

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise, LGTM, pending tests.

},
)

def test_clear_memory_type(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, it seems we don't have a parity test for test_session. does it make sense to move SparkSessionProfileTests out of test_session and add parity test for it?

Copy link
Member Author

@xinrong-meng xinrong-meng Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

For now, all logic tested by SparkSessionProfileTests is directly imported in Spark Connect with no modification. But I do agree separating it later will improve readability and ensure future parity. I'll refactor later. Thanks!

@xinrong-meng
Copy link
Member Author

Merged to master, thank you all!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants