-
Notifications
You must be signed in to change notification settings - Fork 25.7k
Capture thread dump on ShardLockObtainFailedException #93458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
elasticsearchmachine
merged 3 commits into
elastic:main
from
DaveCTurner:2023-02-02-hot-threads-on-shard-lock-failure
Feb 2, 2023
Merged
Capture thread dump on ShardLockObtainFailedException #93458
elasticsearchmachine
merged 3 commits into
elastic:main
from
DaveCTurner:2023-02-02-hot-threads-on-shard-lock-failure
Feb 2, 2023
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
We sometimes see a `ShardLockObtainFailedException` when a shard failed to shut down as fast as we expected, often because a node left and rejoined the cluster. Sometimes this is because it was held open by ongoing scrolls or PITs, but other times it may be because the shutdown process itself is too slow. With this commit we add the ability to capture and log a thread dump at the time of the failure to give us more information about where the shutdown process might be running slowly. Relates elastic#93226
Contributor
|
Documentation preview: |
Collaborator
|
Pinging @elastic/es-distributed (Team:Distributed) |
Contributor
Author
|
This will help investigate #93226, but is also more generally useful. |
fcofdez
approved these changes
Feb 2, 2023
Contributor
fcofdez
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this will be really useful 👍
Contributor
Author
|
@elasticmachine please run elasticsearch-ci/part-1 |
mark-vieira
pushed a commit
to mark-vieira/elasticsearch
that referenced
this pull request
Feb 2, 2023
We sometimes see a `ShardLockObtainFailedException` when a shard failed to shut down as fast as we expected, often because a node left and rejoined the cluster. Sometimes this is because it was held open by ongoing scrolls or PITs, but other times it may be because the shutdown process itself is too slow. With this commit we add the ability to capture and log a thread dump at the time of the failure to give us more information about where the shutdown process might be running slowly. Relates elastic#93226
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this pull request
Dec 10, 2023
Since elastic#93458 we capture and log the local node's hot threads when something is holding on to a shard lock for longer than expected. In fact there's various other reasons we might want to automatically capture and log the local node's hot threads. This commit extracts a utility method to do this.
DaveCTurner
added a commit
to DaveCTurner/elasticsearch
that referenced
this pull request
Dec 11, 2023
Since elastic#93458 we capture and log the local node's hot threads when something is holding on to a shard lock for longer than expected. In fact there's various other reasons we might want to automatically capture and log the local node's hot threads. This commit extracts a utility method to do this.
DaveCTurner
added a commit
that referenced
this pull request
Dec 11, 2023
Since #93458 we capture and log the local node's hot threads when something is holding on to a shard lock for longer than expected. In fact there's various other reasons we might want to automatically capture and log the local node's hot threads. This commit extracts a utility method to do this.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
auto-merge-without-approval
Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!)
:Distributed Indexing/Recovery
Anything around constructing a new shard, either from a local or a remote source.
>non-issue
Supportability
Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better.
Team:Distributed (Obsolete)
Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
v8.7.0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We sometimes see a
ShardLockObtainFailedExceptionwhen a shard failed to shut down as fast as we expected, often because a node left and rejoined the cluster. Sometimes this is because it was held open by ongoing scrolls or PITs, but other times it may be because the shutdown process itself is too slow. With this commit we add the ability to capture and log a thread dump at the time of the failure to give us more information about where the shutdown process might be running slowly.Relates #93226