Skip to content

Conversation

@yliou
Copy link
Contributor

@yliou yliou commented Mar 22, 2022

What changes were proposed in this pull request?

The changes proposed are to add a clickable field on SQL UI to show timing of spark phases and rule timing information from QueryPlanningTracker along with a separate REST endpoint to show the timing statistics. Adds a SQLConf to control the number of rules that
show up on the UI and the REST endpoint call (default value is 15).
UI Change: Screen Shot 2022-05-25 at 11 04 50 AM
New REST API endpoint output:
SPARK-38617RestOutput.txt

Why are the changes needed?

Currently user is unable to see Spark phase timing information or which Spark rules take the most time for a specific query execution on the UI. This PR adds that information.

Does this PR introduce any user-facing change?

Yes on the UI

How was this patch tested?

Tested locally and added a class to test the new endpoint.

@martin-g
Copy link
Member

SPARK38617.txt is in rich text format. It would be easier to review if it was plain text.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO it would be better to use a new method instead of overwriting #toString().
toString might be called implicitly somewhere and cause performance issue. I have seen this in the wild few times.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed method to compileStatsString

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what cases this exception might be thrown ? I guess when #read() cannot find it.
It is weird that it does not return an Option or Either.
Also planGraph() and executionsList() above do not try to recover.
def execution(executionId: Long) does the same as here.
IMO this class needs some love.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this try catch statement as a precaution after adding the if statement on SQLAppStatusListener.scala:393. Some unit tests were failing without the if statement. I can remove the try catch statement if desired.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should offset and length be validated to be positive ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't be needed. I added SqlCompilerResource.scala based on SqlResource.scala

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the filtering filters nothing with the provided two persons. Is this intentional ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@ulysses-you
Copy link
Contributor

#35856 has tried the similar things although at present it miss the details of the compile time for each phase. And its scope also includes the AQE, I think is good to go.

@yliou yliou force-pushed the SPARK-38617-SparkTimingMetrics branch from def7ee9 to 63cbefd Compare March 23, 2022 21:19
@yliou
Copy link
Contributor Author

yliou commented Mar 23, 2022

SPARK38617.txt is in rich text format. It would be easier to review if it was plain text.

Attached new file with plain text format.

@yliou
Copy link
Contributor Author

yliou commented Mar 24, 2022

#35856 has tried the similar things although at present it miss the details of the compile time for each phase. And its scope also includes the AQE, I think is good to go.

What does good to go mean here as I don't see an approval at #35856? From what I see the changes at #35856 are different and complimentary with this PR as that PR adds information regarding AQE plan description changes and this PR has timing of spark phases and rule timing information. The changes in both PRs will help for troubleshooting.

@sigmod
Copy link
Contributor

sigmod commented Mar 24, 2022

cc @jchen5

@ulysses-you
Copy link
Contributor

@yliou I agree both can help for troubleshooting. And I have mentioned the difference, it miss the details of the compile time for each phase. That PR introduces a disk-based store so I'm thinking if can re-use the same code path. IMO it would be better to go with that PR first at least.

@mridulm
Copy link
Contributor

mridulm commented Mar 24, 2022

+CC @shardulm94, @thejdeep

@sigmod
Copy link
Contributor

sigmod commented Mar 24, 2022

Can we use a more meaningful JSON field names? E.g.,
value -> timeMs
name -> phase

{
   “phase” : “optimization”,
   “timeMs” : “373"
 }, {
   “phase” : “analysis”,
   “timeMs” : “12"
 } ],
 “ruleInfo” : [ {
   “ruleName” : “org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”,
   “timeMs” : “32.659566”,
   “numInvocations” : 5,
   “numEffectiveInvocations” : 3
 }, 
 ....
]

Also, can timeMs be a number instead of a string?

@yliou
Copy link
Contributor Author

yliou commented Mar 24, 2022

@ulysses-you got it. I think it is possible to re-use the same code path but not sure how that'd work with ExecutionPage.scala to get the info on the UI.

@yliou
Copy link
Contributor Author

yliou commented Mar 24, 2022

Can we use a more meaningful JSON field names? E.g., value -> timeMs name -> phase

{
   “phase” : “optimization”,
   “timeMs” : “373"
 }, {
   “phase” : “analysis”,
   “timeMs” : “12"
 } ],
 “ruleInfo” : [ {
   “ruleName” : “org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation”,
   “timeMs” : “32.659566”,
   “numInvocations” : 5,
   “numEffectiveInvocations” : 3
 }, 
 ....
]

Also, can timeMs be a number instead of a string?

Updated Json field names and changed timeMs to be a number

@linhongliu-db
Copy link
Contributor

@ulysses-you @yliou adding debug information is always good. But I'm worried about two issues:

  1. we should be careful when adding information to live UI's in-memory KV store. sooner or later, it will bring headaches
  2. we should be careful when adding new REST API endpoints. since two PRs both for troubleshooting, maybe we can use the same endpoint.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The class name mismatches: SqlCompilerResource vs SqlCompileResourceSuite (the r in compiler, typo?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching the typo! I renamed the file.

@yliou yliou force-pushed the SPARK-38617-SparkTimingMetrics branch 2 times, most recently from ab89b29 to d47f6de Compare April 6, 2022 21:55
@yliou yliou force-pushed the SPARK-38617-SparkTimingMetrics branch 2 times, most recently from 851e44a to a492b63 Compare May 25, 2022 18:11
@yliou
Copy link
Contributor Author

yliou commented May 25, 2022

Updated the PR to be more concise regarding the names of rules so that Spark event logs aren't as impacted as before in terms of size.

@yliou yliou force-pushed the SPARK-38617-SparkTimingMetrics branch from c30d33e to 2a513a8 Compare August 23, 2022 22:03
@github-actions
Copy link

github-actions bot commented Dec 2, 2022

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Dec 2, 2022
@github-actions github-actions bot closed this Dec 3, 2022
@yliou
Copy link
Contributor Author

yliou commented Apr 4, 2023

@dependabot reopen

@yliou
Copy link
Contributor Author

yliou commented Jun 12, 2023

Hi @ulysses-you @linhongliu-db now that #35856 got reverted and #38567 got merged, should I create another pull request for this feature to try to get it merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants