-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Detect and down suspended process, #30323 #30335
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| if (suspendDetector.wasSuspended(suspendTimeout)) { | ||
| // note that suspend detection is running on all nodes | ||
| suspendDetected() | ||
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changes here is mostly indentation, added above if suspended case
| log.warning( | ||
| ClusterLogMarker.sbrInstability, | ||
| "SBR detected that the process was suspended too long and will down itself.") | ||
| actOnDecision(DowningStrategy.DownSelfSuspended) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is how the logs from a kill -STOP followed by kill -CONT looks like:
[2021-06-21 12:17:18,570] [WARN] [akka.dispatch.SuspendDetector] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-32] - Process was suspended for [21 seconds]
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.ClusterHeartbeat] [akka://[email protected]:2551] [akkaHeartbeatStarvation] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://[email protected]:2551] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [21634] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, CPU overload, or GC.
[2021-06-21 12:17:18,584] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://[email protected]:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://[email protected]:2551, Up), Member(akka://[email protected]:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://[email protected]:2551,-2763929823204594617)]
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://[email protected]:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://[email protected]:2551, Up), Member(akka://[email protected]:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [akkaMemberChanged] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://[email protected]:2551] - Marking node [akka://[email protected]:2551] as [Down]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://[email protected]:2551,-2763929823204594617)]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sharding.ShardRegion] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ShoppingCart: Self downed, stopping ShardRegion [akka://ShoppingCartService/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Self downed, stopping
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Cluster Node [akka://[email protected]:2551] - event MemberDowned(Member(akka://[email protected]:2551, Down))
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Singleton manager stopping singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton]
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://[email protected]:2551] - event LeaderChanged(Some(akka://[email protected]:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - This node is not the leader any more and not responsible for taking SBR decisions.
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://[email protected]:2551] - event RoleLeaderChanged(dc-default,Some(akka://[email protected]:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ClusterSingletonManager state change [Oldest -> Stopping]
[2021-06-21 12:17:18,588] [DEBUG] [shopping.cart.ShoppingCart$] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - PostStop cart [cart2]
[2021-06-21 12:17:18,589] [DEBUG] [akka.cluster.sharding.Shard] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Shard [506] shutting down
[2021-06-21 12:17:18,591] [DEBUG] [akka.cluster.sharding.DDataShardCoordinator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Received termination message.
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Unwatching: [Actor[akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton/coordinator#1190468820] -> Actor[akka://[email protected]:2552/system/sharding/ShoppingCart#279229579]]
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Cleanup self watch of [akka://[email protected]:2552/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,593] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [akkaClusterSingletonTerminated] [ShoppingCartService-akka.actor.default-dispatcher-32] - Singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton] was terminated
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - is no longer leader
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR took decision DownSelfSuspended and is downing [akka://[email protected]:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://[email protected]:2551, Up), Member(akka://[email protected]:2552, Up)], full reachability status: []
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR is downing [UniqueAddress(akka://[email protected]:2551,-2763929823204594617)]
[2021-06-21 12:17:19,576] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - Node has been marked as DOWN. Shutting down myself
[2021-06-21 12:17:19,577] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - Shutting down...
[2021-06-21 12:17:19,577] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-14] - ClusterReceptionist [akka://[email protected]:2551] - terminated/removed
[2021-06-21 12:17:19,578] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - Successfully shut down
[2021-06-21 12:17:19,579] [DEBUG] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-21] - Publish AddressTerminated [akka://[email protected]:2552]
[2021-06-21 12:17:19,580] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-22] - Cancel all timers
[2021-06-21 12:17:19,583] [INFO] [akka.actor.LocalActorRef] [akka://[email protected]:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.ClusterUserAction$Down] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Running CoordinatedShutdown with reason [ClusterDowningReason]
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-service-unbind] with [0] tasks
[2021-06-21 12:17:19,584] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-unbind] with [1] tasks.
[2021-06-21 12:17:19,586] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [http-unbind-/127.0.0.1:8101] in CoordinatedShutdown phase [service-unbind]
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing phase [service-requests-done] with [1] tasks.
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing task [http-terminate-/127.0.0.1:8101] in CoordinatedShutdown phase [service-requests-done]
[2021-06-21 12:17:19,589] [INFO] [akka.actor.LocalActorRef] [akka://[email protected]:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://[email protected]:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [2] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,590] [INFO] [akka.actor.LocalActorRef] [akka://[email protected]:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://[email protected]:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [3] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-stop] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-cluster-shutdown] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-sharding-shutdown-region] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [region-shutdown] in CoordinatedShutdown phase [cluster-sharding-shutdown-region]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-leave] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [leave] in CoordinatedShutdown phase [cluster-leave]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting] with [4] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-singleton-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-2] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-1] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting-done] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [exiting-completed] in CoordinatedShutdown phase [cluster-exiting-done]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-shutdown] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-shutdown] in CoordinatedShutdown phase [cluster-shutdown]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-actor-system-terminate] with [0] tasks
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [actor-system-terminate] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [terminate-system] in CoordinatedShutdown phase [actor-system-terminate]
[2021-06-21 12:17:19,594] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Shutting down remote daemon.
[2021-06-21 12:17:19,595] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remote daemon shut down; proceeding with flushing remote transports.
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://[email protected]:2552], control stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://[email protected]:2552], message stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,639] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remoting shut down.
[2021-06-21 12:17:19,646] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown initiated...
[2021-06-21 12:17:19,650] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown completed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thing to note is that sharding already acts quickly on the downing, see
ShoppingCart: Self downed, stopping ShardRegion
PostStop cart [cart2]
| import SuspendDetector.SuspendDetected | ||
|
|
||
| // FIXME config | ||
| private val tickInterval = 100.millis |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might not have to be this frequently, because I think the scheduled task will trigger immediately when the process is woken up.
|
Test FAILed. |
johanandren
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, I guess it's mostly just missing config for the various timeouts?
yeah, I wanted to throw out the idea before doing too much. Also thinking if there are more places where it should check? In persistent actors, or should that be in the plugins? Thinking it could abort any updates in flight (that haven't been sent to db yet). |
|
Hmm, yeah, for persistence would be to avoid delayed writes when the actor was already started on another node? I can't think of any other place where it is immediately obvious to me that it would be good, I think I'll have to "marinate" the idea for a while and see what comes up. |
| @volatile private var aliveTime = System.nanoTime() | ||
| @volatile private var suspendDetectedTime = aliveTime - 1.day.toNanos | ||
|
|
||
| system.scheduler.scheduleWithFixedDelay(tickInterval, tickInterval) { () => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this approach (scheduling checkTime on the scheduler and dispatcher) yield a false positive in cases of thread starvation within the application? Thread starvation like that is definitely https://www.youtube.com/watch?v=MpUWrl3-mc8, but the response of pre-emptively tearing things down might be out of proportion; so we have a very loose deadline.
It seems like the intent is to capture scenarios where nothing in the JVM process gets scheduled (suspended process/VM, but also sufficiently-high levels of host load or priority inversions etc.), so maybe having the check run in a thread that is nearly always sleeping would be more effective and could have a faster reaction time? My understanding is that the kernel schedulers tend to be eager to give CPU to processes that haven't been runnable in a long time (e.g. because they're sleeping).
Early draft for discussion...
References #30323