Skip to content

Conversation

@pan3793
Copy link
Member

@pan3793 pan3793 commented Oct 31, 2025

What changes were proposed in this pull request?

Escalate display ordering priority of connect server execution threads in the thread dump page by defining a custom threadInfoOrdering.

For connect server runs in local deploy mode, tasks also run in driver, in driver thread dump page, task threads display first, then connect operation threads.

For connect server runs in other deploy modes (YARN, K8s, Standalone), in driver thread dump page, connect operation threads display first.

Why are the changes needed?

Currently, Spark executor displays the task threads first on the thread dump page, this does improve the user experience in troubleshooting "task stuck" issues.

There are a lot of similar stuck issues on the driver's side too, e.g. driver may be stuck at the HMS/HDFS RPC call on the query planning phase, displaying the connect operation threads at the top on the driver thread dump pages makes the users easy to diagnose driver stuck issues for the Connect server.

Does this PR introduce any user-facing change?

Yes, it affects the live UI thread dump page.

How was this patch tested?

Add UT. Also manually tested.

Start a connect server in local mode, and use a client to run some queries.

Note, since it runs in local mode, the task also executes at the driver side, the page display task threads first, then connect operation threads.

Xnip2025-10-31_18-14-16

Was this patch authored or co-authored using generative AI tooling?

No.

@pan3793
Copy link
Member Author

pan3793 commented Oct 31, 2025

cc @sarutak @LuciferYang @HyukjinKwon, could you please take a look?

@pan3793
Copy link
Member Author

pan3793 commented Oct 31, 2025

cc @grundprinzip

@HyukjinKwon
Copy link
Member

cc @hvanhovell @grundprinzip

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you revise the following section of the PR description? The AS-IS sentences seem to advertise the current Spark executor behavior, @pan3793 .

Why are the changes needed?

Currently, Spark executor displays the task threads first on the thread dump page, this does improve the user experience in troubleshooting "task stuck" issues. There are a lot of similar stuck issues on the driver's side too, e.g. driver may be stuck at the HMS RPC call

@pan3793
Copy link
Member Author

pan3793 commented Nov 3, 2025

@dongjoon-hyun, I mean the executor's behavior is good, this PR replicates a similar behavior on the driver side, for the Connect server case. I modified the PR description to explain more, I hope it is clear now

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for revisions. I got your intention.

Could you revise it a little more correctly by mentioning the full ordering priority, e.g., task threads > connect operation threads > normal thread ?

I mean the executor's behavior is good, this PR replicates a similar behavior on the driver side, for the Connect server case.

As you mentioned, the current title is misleading because it's not true in some cases.

Note, since it runs in local mode, the task also executes at the driver side, the page display task threads first, then connect operation threads.

For instance, Apache Spark 3.4+ officially provides local mode in the small production jobs as a single-pod job to eliminate the overhead of the executor pod resource allocation and coordination.

In short, I hope the PR title should mention more correctly instead of connect server execution first phrase.

@pan3793 pan3793 changed the title [SPARK-54115][CORE][UI][CONNECT] Display connect server execution threads first in thread dump page [SPARK-54115][CORE][UI][CONNECT] Escalate display ordering priority of connect server execution threads in thread dump page Nov 4, 2025
@pan3793
Copy link
Member Author

pan3793 commented Nov 4, 2025

@dongjoon-hyun I updated the PR title and description, could you please take another look?

@pan3793 pan3793 changed the title [SPARK-54115][CORE][UI][CONNECT] Escalate display ordering priority of connect server execution threads in thread dump page [SPARK-54115][CORE][UI][CONNECT] Escalate display ordering priority of connect server operation threads in thread dump page Nov 4, 2025
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you.

@dongjoon-hyun dongjoon-hyun changed the title [SPARK-54115][CORE][UI][CONNECT] Escalate display ordering priority of connect server operation threads in thread dump page [SPARK-54115][CORE][UI] Escalate display ordering priority of connect server operation threads in thread dump page Nov 4, 2025
dongjoon-hyun pushed a commit that referenced this pull request Nov 4, 2025
… server operation threads in thread dump page

### What changes were proposed in this pull request?

Escalate display ordering priority of connect server execution threads in the thread dump page by defining a custom `threadInfoOrdering`.

For connect server runs in local deploy mode, tasks also run in driver, in driver thread dump page, task threads display first, then connect operation threads.

For connect server runs in other deploy modes (YARN, K8s, Standalone), in driver thread dump page, connect operation threads display first.

### Why are the changes needed?

Currently, Spark executor displays the task threads first on the thread dump page, this does improve the user experience in troubleshooting "task stuck" issues.

There are a lot of similar stuck issues on the driver's side too, e.g. driver may be stuck at the HMS/HDFS RPC call on the query planning phase, displaying the connect operation threads at the top on the driver thread dump pages makes the users easy to diagnose driver stuck issues for the Connect server.

### Does this PR introduce _any_ user-facing change?

Yes, it affects the live UI thread dump page.

### How was this patch tested?

Add UT. Also manually tested.

Start a connect server in local mode, and use a client to run some queries.

Note, since it runs in local mode, the task also executes at the driver side, the page display task threads first, then connect operation threads.

<img width="1561" height="1038" alt="Xnip2025-10-31_18-14-16" src="https://github.com/user-attachments/assets/0c48412a-c6cd-4422-8126-226888ce2c56" />

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #52816 from pan3793/SPARK-54115.

Authored-by: Cheng Pan <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit bdac5cd)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun
Copy link
Member

Merged to master/4.1. Thank you again, @pan3793 and @LuciferYang .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants