[SPARK-38550][SQL][CORE] Use a disk-based store to save more debug information for live UI #35856

linhongliu-db · 2022-03-15T03:43:28Z

What changes were proposed in this pull request?

In Spark, the UI lacks troubleshooting abilities. For example:

AQE plan changes are not available
plan description of a large plan is truncated

This is because the live UI depends on an in-memory KV store. We should always be worried
about the stability issues when adding more information to the store. Therefore, it's better to
add a disk-based store to save more information

This PR includes:

A disk-based KV Store in AppStatusStore that allows adding information that does not fits in memory
A separate listener that collects diagnostic data and saves it to the disk store
New Rest API endpoint to expose the diagnostics data (AQE plan changes, untruncated plan)

Why are the changes needed?

The troubleshooting ability is highly needed. Because without this, it's hard to
debug AQE related issues. Once we solve the blockers, we can make a long-term plan to improve the
observability.

Does this PR introduce any user-facing change?

Yes, a new REST API to expose more information of the application.
Rest API endpoint: http://localhost:4040/api/v1/applications/local-1647312132944/diagnostics/sql/0
Example:

$ ./bin/spark-shell --conf spark.appStatusStore.diskStore.dir=/tmp/diskstore
spark-shell>
val df = sql(
  """SELECT t1.*, t2.c, t3.d
    |  FROM (SELECT 1 as a, 'b' as b) t1
    |  JOIN (SELECT 1 as a, 'c' as c) t2
    |  ON t1.a = t2.a
    |  JOIN (SELECT 1 as a, 'd' as d) t3
    |  ON t2.a = t3.a
    |""".stripMargin)
df.show()

Output:

{
  "id" : 0,
  "physicalPlan" : "<plan description string>",
  "submissionTime" : "2022-03-15T03:41:42.226GMT",
  "completionTime" : "2022-03-15T03:41:43.387GMT",
  "errorMessage" : "",
  "planChanges" : [ {
    "physicalPlan" : "<plan description string>",
    "updateTime" : "2022-03-15T03:41:42.268GMT"
  }, {
    "physicalPlan" : "<plan description string>",
    "updateTime" : "2022-03-15T03:41:43.262GMT"
  } ]
}

How was this patch tested?

manually test

core/src/main/scala/org/apache/spark/internal/config/Status.scala

dongjoon-hyun · 2022-03-15T19:11:31Z

core/src/main/scala/org/apache/spark/internal/config/Status.scala

We can use this even when app status fits in the memory, can't we?

dongjoon-hyun · 2022-03-15T19:13:18Z

core/src/main/scala/org/apache/spark/internal/config/Status.scala

Do we have multiple configurations under the prefix, spark.appStatusStore.diskStore.?

for now, no other config starts with spark.appStatusStore.diskStore

If there is no other config, Apache Spark community's configuration naming guide is not to introduce a namespace by removing .. In this case,

- spark.appStatusStore.diskStore.dir + spark.appStatusStore.diskStoreDir

done. thanks for the guide

core/src/main/scala/org/apache/spark/internal/config/Status.scala

dongjoon-hyun · 2022-03-15T19:14:15Z

core/src/main/scala/org/apache/spark/internal/config/Status.scala

I guess this should be 3.4.0 because Today is the feature freeze date for Apache Spark 3.3.0 and this PR arrives a little late for review.

@dongjoon-hyun thanks for the review!
I'm wondering if it's possible to include this in 3.3.0. Here are my two cents:

the community doesn't pay attention to Spark's troubleshooting ability for a while. If we can deliver this feature earlier, it could give a signal that the community starts to improve the debuggability and it can attract others to contribute (earlier).

I know the timing is not good. but as you may see, this PR aims to reduce the impact on the driver and introduces useful features (i.e. show AQE plan changes). Such as separate listener, separate event queue, disk store instead of memory, rest API instead of UI. I know you have concerns about the disk space. I think it's something we can resolve.

Hence, I think it's worth considering.

core/src/main/scala/org/apache/spark/internal/config/Status.scala

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticStore.scala

dongjoon-hyun · 2022-03-15T19:15:25Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticStore.scala

Could you add some class description, please?

dongjoon-hyun · 2022-03-15T19:18:21Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

This is a dedicated queue for this listener?

Yes, because collecting/saving the diagnostics could be slow, I don't want it to impact other critical listeners, e.g. UI listener

rename to sqlDiagnostics

@mridulm
I put this listener in the SQL folder because I need to capture some SQL events. But in the future, this listener can capture other events to provide diagnostics for other components (e.g. executor). So I think a general name may be better.
But sqlDiagnostics is also fine to me.

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

dongjoon-hyun · 2022-03-15T19:20:19Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

Do you know the required disk size with Int.MaxValue? It could kill driver pod due to OutOfDisk.

good point. It might be dangerous to get the untruncated plan description of a very large plan, but it's also hard to estimate the upper bound because theoretically, there is no limitation of the query plan size.
~~How about we add a flag to control plan truncation for the disk store? for example: spark.appStatusStore.diskStore.saveUntruncatedPlan=true~~

I tried very large plans locally with a 2GB memory spark driver. It turns out the Spark itself will OOM far before the query plan string becomes too large (~10MB). In addition, there is a retained number for diagnostic data (1000 by default), so the memory/disk consumption should be fine.

Agree with @dongjoon-hyun , we should impose a limit here.

@shardulm94 can comment more on his observations with something similar in terms of increase in cost - he had done a streaming serialization implementation to get around the issue (he was writing to hdfs, so the solution directly wont apply here).

dongjoon-hyun · 2022-03-15T19:21:40Z

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

FYI, currently, all disk stores are broken in Apple Silicon.

Thanks for letting me know. If the failure happens during initialization, I think we are safe here.

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

dongjoon-hyun · 2022-03-15T19:23:46Z

sql/core/src/main/scala/org/apache/spark/status/api/v1/sql/ApiSqlRootResource.scala

If you want to add this in this PR, REST API should be documented here.

https://spark.apache.org/docs/latest/monitoring.html#rest-api

sql/core/src/main/scala/org/apache/spark/status/api/v1/sql/ApiSqlRootResource.scala

dongjoon-hyun

Hi, @linhongliu-db . Thank you for making a PR. I left some comments.
I'd like to recommend you to consider this as Apache Spark 3.4 feature.

ulysses-you · 2022-03-17T04:30:43Z

hi @linhongliu-db , thank you for the feature !

I wonder, if is it possible to give a more accuracy time of the plan changes. In particular, the phase of re-optimize and the query stage optimization in AQE. The updatedTime itself can only be used with previous stage finished time together which is quite limited, otherwise it is less meaning.

We have encountered some performance issues, like SPARK-38406, SPARK-38401. I see the scope may be out of this pr. It would be very helpful if we can show more details. And as you have mentioned, It is friendly if we can also put some summary into UI.

mridulm · 2022-03-17T04:36:15Z

+CC @shardulm94, @thejdeep - since you worked on something similar recently.

linhongliu-db · 2022-03-28T03:17:42Z

@dongjoon-hyun, I addressed all the comments, could you please review this PR one more time? I also changed the version number to 3.4

linhongliu-db · 2022-03-28T03:22:38Z

@ulysses-you sure, I'll consider it. But in this PR, I'd like to minimize the changes in catalyst or execution in order to make the PR easy to review. Once it's accepted, we can keep improving or adding more diagnostic information

dongjoon-hyun · 2022-03-28T20:33:24Z

Thank you for your update, @linhongliu-db .

linhongliu-db · 2022-03-31T02:24:36Z

@dongjoon-hyun, just a soft ping. Do we have anything else to do to move this PR forward?

mridulm · 2022-03-31T02:34:38Z

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala

If we have enabled diskstore, thoughts on using it for everything at driver ?

for now, we couldn't. The SQL UI needs to maintain the task metrics in memory in order to render UI quickly. But in the future, I think we can build a 2-layer store.

mridulm · 2022-03-31T02:35:43Z

docs/monitoring.md

Move this under sql ?

how about:
/applications/[app-id]/diagnostics/***sql***/[execution-id]
Then, in the future, we can have:

diagnostics/executor/ diagnostics/streaming/ diagnostics/environment/ diganostics/xxx

mridulm · 2022-03-31T02:43:56Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

Agree with @dongjoon-hyun , we should impose a limit here.

@shardulm94 can comment more on his observations with something similar in terms of increase in cost - he had done a streaming serialization implementation to get around the issue (he was writing to hdfs, so the solution directly wont apply here).

mridulm · 2022-03-31T02:49:33Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

rename to sqlDiagnostics

ulysses-you · 2022-03-31T04:20:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

nit: ${MAX_TO_STRING_FIELDS.key}

ulysses-you · 2022-03-31T04:31:17Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

Int.MaxValue -> event.qe.sparkSession.sessionState.conf.maxToStringFieldsForDiagnostic

ulysses-you · 2022-03-31T04:31:39Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

we should use event.qe.sparkSession.sessionState.conf rather than SQLConf.get since the listener is in other thread

ulysses-you · 2022-03-31T04:41:36Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

event.physicalPlanDescription use the old maxToStringFields to do explain string, do you want to the same thing that re-explain it ?

just a small concern, there is a cost to do the explain if the plan is large but I do not have a good idea without re-explain it.

I'm not worried about the cost too much because a separate listener with a separate event queue won't slow down other listeners.
But I do want to make sure everything we added is necessary. I mean, usually, it's enough to only output full fields in the final plan, and plan change history can keep a truncated plan. Because, IIUC, the AQE aims to change the operators and expressions usually unchanged.

and if we need more fields in the plan change history, it's always easy to add more than removing something.

dongjoon-hyun · 2022-04-04T07:00:47Z

Here is an update from master branch. Previously, all disk stores were unavailable in Apple Silicon.

[SPARK-38550][SQL][CORE] Use a disk-based store to save more debug information for live UI #35856 (comment)

FYI, currently, all disk stores are broken in Apple Silicon.

After the following commits, now RocksDB can be used in all OSes for this PR.

[SPARK-38257][BUILD] Upgrade rockdbjni to 7.0.3
[SPARK-38678][TESTS] Enable RocksDB tests on Apple Silicon on MacOS

dongjoon-hyun · 2022-04-04T07:05:12Z

sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala

Why do we need to share the same kvStore?

@dongjoon-hyun
I used the same way that the in-memory kvStore used. which is:

define a shared kvStore in SparkContext

share the kvStore with the listeners that need to update the store (Jobs UI Listener, SQL UI Listener)

share the kvStore with the components that need to read the store (web UI, rest API)

linhongliu-db · 2022-04-07T04:05:54Z

cc @cloud-fan to review as well.

linhongliu-db · 2022-04-07T04:08:31Z

Here is an update from master branch. Previously, all disk stores were unavailable in Apple Silicon.

@dongjoon-hyun I updated the code based on the latest master. Thanks for the comments.

cloud-fan · 2022-04-11T05:19:49Z

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala

+import org.apache.spark.sql.internal.StaticSQLConf.UI_RETAINED_EXECUTIONS
+import org.apache.spark.status.{ElementTrackingStore, KVUtils}
+
+class DiagnosticListener(


can we add some classdoc?

cloud-fan · 2022-04-11T05:27:48Z

LGTM, @dongjoon-hyun do you have more comments?

Stale.

dongjoon-hyun · 2022-04-13T02:49:39Z

Sorry for the delay, @linhongliu-db and @cloud-fan . I dismissed my previous review because my review went Stale already. Feel free to merge, @cloud-fan .

cloud-fan · 2022-04-13T06:33:41Z

The test job: https://github.com/linhongliu-db/spark/actions/runs/2159086857

cloud-fan · 2022-04-14T06:01:01Z

thanks, merging to master!

github-actions bot added CORE SQL WEB UI labels Mar 15, 2022

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/Status.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/Status.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/internal/config/Status.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticStore.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/diagnostic/DiagnosticListener.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

core/src/main/scala/org/apache/spark/status/AppStatusStore.scala Outdated Show resolved Hide resolved

dongjoon-hyun reviewed Mar 15, 2022

View reviewed changes

dongjoon-hyun previously requested changes Mar 15, 2022

View reviewed changes

ulysses-you mentioned this pull request Mar 23, 2022

[SPARK-38617][SQL][WEBUI] Show Spark rule and phase timings in SQL UI and REST API #35939

Closed

linhongliu-db force-pushed the diagnostic branch from 0a0f1ca to 0af72d1 Compare March 28, 2022 03:17

github-actions bot added the DOCS label Mar 28, 2022

mridulm reviewed Mar 31, 2022

View reviewed changes

ulysses-you reviewed Mar 31, 2022

View reviewed changes

linhongliu-db force-pushed the diagnostic branch from faae926 to d0b6123 Compare March 31, 2022 04:52

dongjoon-hyun reviewed Apr 4, 2022

View reviewed changes

linhongliu-db added 6 commits April 7, 2022 12:06

Use a disk-based store to save more information in live UI to help debug

13eb42c

Update Status.scala

947ed17

address commens

8692b01

address comments

1ad98df

fix

847e547

fix style

b2efff6

linhongliu-db force-pushed the diagnostic branch from b072aef to b2efff6 Compare April 7, 2022 04:07

address comment

49b102e

cloud-fan reviewed Apr 11, 2022

View reviewed changes

linhongliu-db added 3 commits April 13, 2022 09:22

Merge remote-tracking branch 'apache/master' into diagnostic

4f7d651

add classdoc

699b6f2

trigger test

e1454b9

trigger test

59797e5

cloud-fan closed this in 4274fb8 Apr 14, 2022

[SPARK-38550][SQL][CORE] Use a disk-based store to save more debug information for live UI #35856

[SPARK-38550][SQL][CORE] Use a disk-based store to save more debug information for live UI #35856

Uh oh!

Conversation

linhongliu-db commented Mar 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linhongliu-db Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

linhongliu-db Mar 17, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dongjoon-hyun left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ulysses-you commented Mar 17, 2022

Uh oh!

mridulm commented Mar 17, 2022

Uh oh!

linhongliu-db commented Mar 28, 2022

Uh oh!

linhongliu-db commented Mar 28, 2022

Uh oh!

dongjoon-hyun commented Mar 28, 2022

linhongliu-db commented Mar 15, 2022 •

edited

Loading

linhongliu-db Mar 17, 2022 •

edited

Loading

linhongliu-db Mar 17, 2022 •

edited

Loading

dongjoon-hyun left a comment •

edited

Loading

ulysses-you Mar 31, 2022 •

edited

Loading