Skip to content

Conversation

@xinrong-meng
Copy link
Member

@xinrong-meng xinrong-meng commented Feb 27, 2024

What changes were proposed in this pull request?

Documentation for SparkSession-based Profilers.

Why are the changes needed?

For easier user onboarding and better usability.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Manual test. Screenshots of built htmls are as shown below.

image

Was this patch authored or co-authored using generative AI tooling?

No.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add the API into python/docs/source/reference/pyspark.sql/spark_session.rst as well?

@xinrong-meng
Copy link
Member Author

I was looking for the API doc.. thank you @HyukjinKwon !

@xinrong-meng xinrong-meng changed the title [WIP] Documentation for SparkSession-based Profilers [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers Feb 28, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review February 28, 2024 19:37
SparkSession.createDataFrame
SparkSession.getActiveSession
SparkSession.newSession
SparkSession.profile
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should also have a dedicated section for profile.show, profile.dump.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hit

[autosummary] failed to import pyspark.sql.SparkSession.profile.dump.
Possible hints:
* AttributeError: 'property' object has no attribute 'dump'
* ImportError: 
* ModuleNotFoundError: No module named 'pyspark.sql.SparkSession'

The profile property returns a Profile class instance, Sphinx might have difficulty accessing it. Do you happen to know the best way to resolve that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need

:template: autosummary/accessor_method.rst

?

See #44012 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I was thinking the same but it kept failing with the error message..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SparkSession.builder works because it is a classproperty whereas profile is a property of SparkSession.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a workaround 76e7387 by using autoclass, but it doesn't look consistent with the rest of the page, as shown below.

image

I'm wondering if we should have a follow-up designated for that part.

Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``.
Python/Pandas UDFs.

SparkContext-based
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can just remove this, and just add one additional section called runtime profiler

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @ueshin do you have other thoughts?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about put the new doc to the first place?

  • Identifying Hot Loops (Python Profilers)
    • Driver Side
      ...
    • Executor Side
      • Python/Pandas UDF
        Show the new profiler usage
      • Legacy (for RDD or non-Spark Connect)
        Put the current doc here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe there are many existing users of SparkContext-based profilers. Shall we keep it in the debugging guide until SparkSession-based profilers gain more adoption and positive feedbacks? I'll adjust the order to show SparkSession-based profilers first as @ueshin suggested. What do you think @HyukjinKwon?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will remove "legacy" profilers for readability and clarity and start preparing migration guide.

@xinrong-meng xinrong-meng marked this pull request as draft March 5, 2024 20:00
@xinrong-meng xinrong-meng changed the title [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [WIP][SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers Mar 5, 2024
@xinrong-meng
Copy link
Member Author

xinrong-meng commented Mar 5, 2024

Marked WIP to wait for #45378 merged first and then adjusted.

@xinrong-meng xinrong-meng changed the title [WIP][SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers [SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers Mar 7, 2024
@xinrong-meng xinrong-meng marked this pull request as ready for review March 7, 2024 21:40
@HyukjinKwon
Copy link
Member

Merged to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants