-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-47078][DOCS][PYTHON] Documentation for SparkSession-based Profilers #45269
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add the API into python/docs/source/reference/pyspark.sql/spark_session.rst as well?
|
I was looking for the API doc.. thank you @HyukjinKwon ! |
| SparkSession.createDataFrame | ||
| SparkSession.getActiveSession | ||
| SparkSession.newSession | ||
| SparkSession.profile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should also have a dedicated section for profile.show, profile.dump.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hit
[autosummary] failed to import pyspark.sql.SparkSession.profile.dump.
Possible hints:
* AttributeError: 'property' object has no attribute 'dump'
* ImportError:
* ModuleNotFoundError: No module named 'pyspark.sql.SparkSession'
The profile property returns a Profile class instance, Sphinx might have difficulty accessing it. Do you happen to know the best way to resolve that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I was thinking the same but it kept failing with the error message..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think SparkSession.builder works because it is a classproperty whereas profile is a property of SparkSession.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a workaround 76e7387 by using autoclass, but it doesn't look consistent with the rest of the page, as shown below.
I'm wondering if we should have a follow-up designated for that part.
| Python/Pandas UDFs, which can be enabled by setting ``spark.python.profile`` configuration to ``true``. | ||
| Python/Pandas UDFs. | ||
|
|
||
| SparkContext-based |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can just remove this, and just add one additional section called runtime profiler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @ueshin do you have other thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about put the new doc to the first place?
- Identifying Hot Loops (Python Profilers)
- Driver Side
... - Executor Side
- Python/Pandas UDF
Show the new profiler usage - Legacy (for RDD or non-Spark Connect)
Put the current doc here
- Python/Pandas UDF
- Driver Side
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there are many existing users of SparkContext-based profilers. Shall we keep it in the debugging guide until SparkSession-based profilers gain more adoption and positive feedbacks? I'll adjust the order to show SparkSession-based profilers first as @ueshin suggested. What do you think @HyukjinKwon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will remove "legacy" profilers for readability and clarity and start preparing migration guide.
|
Marked WIP to wait for #45378 merged first and then adjusted. |
|
Merged to master. |

What changes were proposed in this pull request?
Documentation for SparkSession-based Profilers.
Why are the changes needed?
For easier user onboarding and better usability.
Does this PR introduce any user-facing change?
No.
How was this patch tested?
Manual test. Screenshots of built htmls are as shown below.
Was this patch authored or co-authored using generative AI tooling?
No.