Skip to content

Conversation

@DaveCTurner
Copy link
Contributor

We sometimes see a ShardLockObtainFailedException when a shard failed to shut down as fast as we expected, often because a node left and rejoined the cluster. Sometimes this is because it was held open by ongoing scrolls or PITs, but other times it may be because the shutdown process itself is too slow. With this commit we add the ability to capture and log a thread dump at the time of the failure to give us more information about where the shutdown process might be running slowly.

Relates #93226

We sometimes see a `ShardLockObtainFailedException` when a shard failed
to shut down as fast as we expected, often because a node left and
rejoined the cluster. Sometimes this is because it was held open by
ongoing scrolls or PITs, but other times it may be because the shutdown
process itself is too slow. With this commit we add the ability to
capture and log a thread dump at the time of the failure to give us more
information about where the shutdown process might be running slowly.

Relates elastic#93226
@DaveCTurner DaveCTurner added >non-issue :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. v8.7.0 labels Feb 2, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 2, 2023

Documentation preview:

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Feb 2, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@DaveCTurner
Copy link
Contributor Author

This will help investigate #93226, but is also more generally useful.

Copy link
Contributor

@fcofdez fcofdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this will be really useful 👍

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Feb 2, 2023
@DaveCTurner
Copy link
Contributor Author

@elasticmachine please run elasticsearch-ci/part-1

@elasticsearchmachine elasticsearchmachine merged commit 4c68382 into elastic:main Feb 2, 2023
@DaveCTurner DaveCTurner deleted the 2023-02-02-hot-threads-on-shard-lock-failure branch February 2, 2023 16:18
mark-vieira pushed a commit to mark-vieira/elasticsearch that referenced this pull request Feb 2, 2023
We sometimes see a `ShardLockObtainFailedException` when a shard failed
to shut down as fast as we expected, often because a node left and
rejoined the cluster. Sometimes this is because it was held open by
ongoing scrolls or PITs, but other times it may be because the shutdown
process itself is too slow. With this commit we add the ability to
capture and log a thread dump at the time of the failure to give us more
information about where the shutdown process might be running slowly.

Relates elastic#93226
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Dec 10, 2023
Since elastic#93458 we capture and log the local node's hot threads when
something is holding on to a shard lock for longer than expected. In
fact there's various other reasons we might want to automatically
capture and log the local node's hot threads. This commit extracts a
utility method to do this.
DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this pull request Dec 11, 2023
Since elastic#93458 we capture and log the local node's hot threads when
something is holding on to a shard lock for longer than expected. In
fact there's various other reasons we might want to automatically
capture and log the local node's hot threads. This commit extracts a
utility method to do this.
DaveCTurner added a commit that referenced this pull request Dec 11, 2023
Since #93458 we capture and log the local node's hot threads when
something is holding on to a shard lock for longer than expected. In
fact there's various other reasons we might want to automatically
capture and log the local node's hot threads. This commit extracts a
utility method to do this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) :Distributed Indexing/Recovery Anything around constructing a new shard, either from a local or a remote source. >non-issue Supportability Improve our (devs, SREs, support eng, users) ability to troubleshoot/self-service product better. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.7.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants