Skip to content

Conversation

@linhongliu-db
Copy link
Contributor

@linhongliu-db linhongliu-db commented Mar 15, 2022

What changes were proposed in this pull request?

In Spark, the UI lacks troubleshooting abilities. For example:

  • AQE plan changes are not available
  • plan description of a large plan is truncated

This is because the live UI depends on an in-memory KV store. We should always be worried
about the stability issues when adding more information to the store. Therefore, it's better to
add a disk-based store to save more information

This PR includes:

  • A disk-based KV Store in AppStatusStore that allows adding information that does not fits in memory
  • A separate listener that collects diagnostic data and saves it to the disk store
  • New Rest API endpoint to expose the diagnostics data (AQE plan changes, untruncated plan)

Why are the changes needed?

The troubleshooting ability is highly needed. Because without this, it's hard to
debug AQE related issues. Once we solve the blockers, we can make a long-term plan to improve the
observability.

Does this PR introduce any user-facing change?

Yes, a new REST API to expose more information of the application.
Rest API endpoint: http://localhost:4040/api/v1/applications/local-1647312132944/diagnostics/sql/0
Example:

$ ./bin/spark-shell --conf spark.appStatusStore.diskStore.dir=/tmp/diskstore
spark-shell>
val df = sql(
  """SELECT t1.*, t2.c, t3.d
    |  FROM (SELECT 1 as a, 'b' as b) t1
    |  JOIN (SELECT 1 as a, 'c' as c) t2
    |  ON t1.a = t2.a
    |  JOIN (SELECT 1 as a, 'd' as d) t3
    |  ON t2.a = t3.a
    |""".stripMargin)
df.show()

Output:

{
  "id" : 0,
  "physicalPlan" : "<plan description string>",
  "submissionTime" : "2022-03-15T03:41:42.226GMT",
  "completionTime" : "2022-03-15T03:41:43.387GMT",
  "errorMessage" : "",
  "planChanges" : [ {
    "physicalPlan" : "<plan description string>",
    "updateTime" : "2022-03-15T03:41:42.268GMT"
  }, {
    "physicalPlan" : "<plan description string>",
    "updateTime" : "2022-03-15T03:41:43.262GMT"
  } ]
}

How was this patch tested?

manually test

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can use this even when app status fits in the memory, can't we?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have multiple configurations under the prefix, spark.appStatusStore.diskStore.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, no other config starts with spark.appStatusStore.diskStore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no other config, Apache Spark community's configuration naming guide is not to introduce a namespace by removing .. In this case,

- spark.appStatusStore.diskStore.dir
+ spark.appStatusStore.diskStoreDir

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. thanks for the guide

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this should be 3.4.0 because Today is the feature freeze date for Apache Spark 3.3.0 and this PR arrives a little late for review.

Copy link
Contributor Author

@linhongliu-db linhongliu-db Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun thanks for the review!
I'm wondering if it's possible to include this in 3.3.0. Here are my two cents:

  1. the community doesn't pay attention to Spark's troubleshooting ability for a while. If we can deliver this feature earlier, it could give a signal that the community starts to improve the debuggability and it can attract others to contribute (earlier).
  2. I know the timing is not good. but as you may see, this PR aims to reduce the impact on the driver and introduces useful features (i.e. show AQE plan changes). Such as separate listener, separate event queue, disk store instead of memory, rest API instead of UI. I know you have concerns about the disk space. I think it's something we can resolve.

Hence, I think it's worth considering.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add some class description, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dedicated queue for this listener?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because collecting/saving the diagnostics could be slow, I don't want it to impact other critical listeners, e.g. UI listener

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to sqlDiagnostics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mridulm
I put this listener in the SQL folder because I need to capture some SQL events. But in the future, this listener can capture other events to provide diagnostics for other components (e.g. executor). So I think a general name may be better.
But sqlDiagnostics is also fine to me.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know the required disk size with Int.MaxValue? It could kill driver pod due to OutOfDisk.

Copy link
Contributor Author

@linhongliu-db linhongliu-db Mar 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. It might be dangerous to get the untruncated plan description of a very large plan, but it's also hard to estimate the upper bound because theoretically, there is no limitation of the query plan size.
How about we add a flag to control plan truncation for the disk store? for example: spark.appStatusStore.diskStore.saveUntruncatedPlan=true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried very large plans locally with a 2GB memory spark driver. It turns out the Spark itself will OOM far before the query plan string becomes too large (~10MB). In addition, there is a retained number for diagnostic data (1000 by default), so the memory/disk consumption should be fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @dongjoon-hyun , we should impose a limit here.

@shardulm94 can comment more on his observations with something similar in terms of increase in cost - he had done a streaming serialization implementation to get around the issue (he was writing to hdfs, so the solution directly wont apply here).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, currently, all disk stores are broken in Apple Silicon.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for letting me know. If the failure happens during initialization, I think we are safe here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to add this in this PR, REST API should be documented here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, @linhongliu-db . Thank you for making a PR. I left some comments.
I'd like to recommend you to consider this as Apache Spark 3.4 feature.

@ulysses-you
Copy link
Contributor

hi @linhongliu-db , thank you for the feature !

I wonder, if is it possible to give a more accuracy time of the plan changes. In particular, the phase of re-optimize and the query stage optimization in AQE. The updatedTime itself can only be used with previous stage finished time together which is quite limited, otherwise it is less meaning.

We have encountered some performance issues, like SPARK-38406, SPARK-38401. I see the scope may be out of this pr. It would be very helpful if we can show more details. And as you have mentioned, It is friendly if we can also put some summary into UI.

@mridulm
Copy link
Contributor

mridulm commented Mar 17, 2022

+CC @shardulm94, @thejdeep - since you worked on something similar recently.

@linhongliu-db
Copy link
Contributor Author

@dongjoon-hyun, I addressed all the comments, could you please review this PR one more time? I also changed the version number to 3.4

@linhongliu-db
Copy link
Contributor Author

@ulysses-you sure, I'll consider it. But in this PR, I'd like to minimize the changes in catalyst or execution in order to make the PR easy to review. Once it's accepted, we can keep improving or adding more diagnostic information

@dongjoon-hyun
Copy link
Member

Thank you for your update, @linhongliu-db .

@linhongliu-db
Copy link
Contributor Author

@dongjoon-hyun, just a soft ping. Do we have anything else to do to move this PR forward?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have enabled diskstore, thoughts on using it for everything at driver ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now, we couldn't. The SQL UI needs to maintain the task metrics in memory in order to render UI quickly. But in the future, I think we can build a 2-layer store.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this under sql ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about:
/applications/[app-id]/diagnostics/***sql***/[execution-id]
Then, in the future, we can have:

diagnostics/executor/
diagnostics/streaming/
diagnostics/environment/
diganostics/xxx

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @dongjoon-hyun , we should impose a limit here.

@shardulm94 can comment more on his observations with something similar in terms of increase in cost - he had done a streaming serialization implementation to get around the issue (he was writing to hdfs, so the solution directly wont apply here).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to sqlDiagnostics

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: ${MAX_TO_STRING_FIELDS.key}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Int.MaxValue -> event.qe.sparkSession.sessionState.conf.maxToStringFieldsForDiagnostic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use event.qe.sparkSession.sessionState.conf rather than SQLConf.get since the listener is in other thread

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor

@ulysses-you ulysses-you Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

event.physicalPlanDescription use the old maxToStringFields to do explain string, do you want to the same thing that re-explain it ?

just a small concern, there is a cost to do the explain if the plan is large but I do not have a good idea without re-explain it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not worried about the cost too much because a separate listener with a separate event queue won't slow down other listeners.
But I do want to make sure everything we added is necessary. I mean, usually, it's enough to only output full fields in the final plan, and plan change history can keep a truncated plan. Because, IIUC, the AQE aims to change the operators and expressions usually unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and if we need more fields in the plan change history, it's always easy to add more than removing something.

@dongjoon-hyun
Copy link
Member

dongjoon-hyun commented Apr 4, 2022

Here is an update from master branch. Previously, all disk stores were unavailable in Apple Silicon.

FYI, currently, all disk stores are broken in Apple Silicon.

After the following commits, now RocksDB can be used in all OSes for this PR.

  • [SPARK-38257][BUILD] Upgrade rockdbjni to 7.0.3
  • [SPARK-38678][TESTS] Enable RocksDB tests on Apple Silicon on MacOS

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to share the same kvStore?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongjoon-hyun
I used the same way that the in-memory kvStore used. which is:

  1. define a shared kvStore in SparkContext
  2. share the kvStore with the listeners that need to update the store (Jobs UI Listener, SQL UI Listener)
  3. share the kvStore with the components that need to read the store (web UI, rest API)

@linhongliu-db
Copy link
Contributor Author

linhongliu-db commented Apr 7, 2022

cc @cloud-fan to review as well.

@linhongliu-db
Copy link
Contributor Author

linhongliu-db commented Apr 7, 2022

Here is an update from master branch. Previously, all disk stores were unavailable in Apple Silicon.

@dongjoon-hyun I updated the code based on the latest master. Thanks for the comments.

import org.apache.spark.sql.internal.StaticSQLConf.UI_RETAINED_EXECUTIONS
import org.apache.spark.status.{ElementTrackingStore, KVUtils}

class DiagnosticListener(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add some classdoc?

@cloud-fan
Copy link
Contributor

LGTM, @dongjoon-hyun do you have more comments?

@dongjoon-hyun
Copy link
Member

Sorry for the delay, @linhongliu-db and @cloud-fan . I dismissed my previous review because my review went Stale already. Feel free to merge, @cloud-fan .

@cloud-fan
Copy link
Contributor

The test job: https://github.com/linhongliu-db/spark/actions/runs/2159086857

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 4274fb8 Apr 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants