Skip to content

Conversation

@patriknw
Copy link
Contributor

Early draft for discussion...

References #30323

@akka-ci akka-ci added the validating PR is currently being validated by Jenkins label Jun 21, 2021
if (suspendDetector.wasSuspended(suspendTimeout)) {
// note that suspend detection is running on all nodes
suspendDetected()
} else {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changes here is mostly indentation, added above if suspended case

log.warning(
ClusterLogMarker.sbrInstability,
"SBR detected that the process was suspended too long and will down itself.")
actOnDecision(DowningStrategy.DownSelfSuspended)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is how the logs from a kill -STOP followed by kill -CONT looks like:

[2021-06-21 12:17:18,570] [WARN] [akka.dispatch.SuspendDetector] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-32] - Process was suspended for [21 seconds]
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,575] [WARN] [akka.cluster.ClusterHeartbeat] [akka://[email protected]:2551] [akkaHeartbeatStarvation] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://[email protected]:2551] - Scheduled sending of heartbeat was delayed. Previous heartbeat was sent [21634] ms ago, expected interval is [1000] ms. This may cause failure detection to mark members as unreachable. The reason can be thread starvation, CPU overload, or GC.
[2021-06-21 12:17:18,584] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://[email protected]:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://[email protected]:2551, Up), Member(akka://[email protected]:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://[email protected]:2551,-2763929823204594617)]
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:18,585] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR took decision DownSelfSuspended and is downing [akka://[email protected]:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://[email protected]:2551, Up), Member(akka://[email protected]:2552, Up)], full reachability status: []
[2021-06-21 12:17:18,585] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [akkaMemberChanged] [ShoppingCartService-akka.actor.default-dispatcher-32] - Cluster Node [akka://[email protected]:2551] - Marking node [akka://[email protected]:2551] as [Down]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-32] - SBR is downing [UniqueAddress(akka://[email protected]:2551,-2763929823204594617)]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.sharding.ShardRegion] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ShoppingCart: Self downed, stopping ShardRegion [akka://ShoppingCartService/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Self downed, stopping
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Cluster Node [akka://[email protected]:2551] - event MemberDowned(Member(akka://[email protected]:2551, Down))
[2021-06-21 12:17:18,586] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - Singleton manager stopping singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton]
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://[email protected]:2551] - event LeaderChanged(Some(akka://[email protected]:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - This node is not the leader any more and not responsible for taking SBR decisions.
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - Cluster Node [akka://[email protected]:2551] - event RoleLeaderChanged(dc-default,Some(akka://[email protected]:2552))
[2021-06-21 12:17:18,587] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-33] - ClusterSingletonManager state change [Oldest -> Stopping]
[2021-06-21 12:17:18,588] [DEBUG] [shopping.cart.ShoppingCart$] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-31] - PostStop cart [cart2]
[2021-06-21 12:17:18,589] [DEBUG] [akka.cluster.sharding.Shard] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Shard [506] shutting down
[2021-06-21 12:17:18,591] [DEBUG] [akka.cluster.sharding.DDataShardCoordinator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - ShoppingCart: Received termination message.
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Unwatching: [Actor[akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton/coordinator#1190468820] -> Actor[akka://[email protected]:2552/system/sharding/ShoppingCart#279229579]]
[2021-06-21 12:17:18,592] [DEBUG] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Cleanup self watch of [akka://[email protected]:2552/system/sharding/ShoppingCart]
[2021-06-21 12:17:18,593] [INFO] [akka.cluster.singleton.ClusterSingletonManager] [akka://[email protected]:2551] [akkaClusterSingletonTerminated] [ShoppingCartService-akka.actor.default-dispatcher-32] - Singleton actor [akka://ShoppingCartService/system/sharding/ShoppingCartCoordinator/singleton] was terminated
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrInstability] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR detected that the process was suspended too long and will down itself.
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - is no longer leader
[2021-06-21 12:17:19,575] [WARN] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowning] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR took decision DownSelfSuspended and is downing [akka://[email protected]:2551] including myself,, [0] unreachable of [2] members, all members in DC [Member(akka://[email protected]:2551, Up), Member(akka://[email protected]:2552, Up)], full reachability status: []
[2021-06-21 12:17:19,575] [INFO] [akka.cluster.sbr.SplitBrainResolver] [akka://[email protected]:2551] [akkaSbrDowningNode] [ShoppingCartService-akka.actor.default-dispatcher-35] - SBR is downing [UniqueAddress(akka://[email protected]:2551,-2763929823204594617)]
[2021-06-21 12:17:19,576] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - Node has been marked as DOWN. Shutting down myself
[2021-06-21 12:17:19,577] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - Shutting down...
[2021-06-21 12:17:19,577] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-14] - ClusterReceptionist [akka://[email protected]:2551] - terminated/removed
[2021-06-21 12:17:19,578] [INFO] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Cluster Node [akka://[email protected]:2551] - Successfully shut down
[2021-06-21 12:17:19,579] [DEBUG] [akka.cluster.Cluster] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-21] - Publish AddressTerminated [akka://[email protected]:2552]
[2021-06-21 12:17:19,580] [DEBUG] [akka.cluster.typed.internal.receptionist.ClusterReceptionist] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.internal-dispatcher-22] - Cancel all timers
[2021-06-21 12:17:19,583] [INFO] [akka.actor.LocalActorRef] [akka://[email protected]:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.ClusterUserAction$Down] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [1] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Running CoordinatedShutdown with reason [ClusterDowningReason]
[2021-06-21 12:17:19,583] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-service-unbind] with [0] tasks
[2021-06-21 12:17:19,584] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-unbind] with [1] tasks.
[2021-06-21 12:17:19,586] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [http-unbind-/127.0.0.1:8101] in CoordinatedShutdown phase [service-unbind]
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing phase [service-requests-done] with [1] tasks.
[2021-06-21 12:17:19,588] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Performing task [http-terminate-/127.0.0.1:8101] in CoordinatedShutdown phase [service-requests-done]
[2021-06-21 12:17:19,589] [INFO] [akka.actor.LocalActorRef] [akka://[email protected]:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://[email protected]:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [2] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,590] [INFO] [akka.actor.LocalActorRef] [akka://[email protected]:2551] [akkaDeadLetter] [ShoppingCartService-akka.actor.default-dispatcher-35] - Message [akka.cluster.GossipEnvelope] from Actor[akka://[email protected]:2552/system/cluster/core/daemon#1146845275] to Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] was not delivered. [3] dead letters encountered. If this is not an expected behavior then Actor[akka://ShoppingCartService/system/cluster/core/daemon#572391440] may have terminated unexpectedly. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [service-stop] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-cluster-shutdown] with [0] tasks
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-sharding-shutdown-region] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [region-shutdown] in CoordinatedShutdown phase [cluster-sharding-shutdown-region]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-leave] with [1] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [leave] in CoordinatedShutdown phase [cluster-leave]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting] with [4] tasks.
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-singleton-exiting] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-2] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [singleton-exiting-1] in CoordinatedShutdown phase [cluster-exiting]
[2021-06-21 12:17:19,591] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-exiting-done] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [exiting-completed] in CoordinatedShutdown phase [cluster-exiting-done]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [cluster-shutdown] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [wait-shutdown] in CoordinatedShutdown phase [cluster-shutdown]
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [before-actor-system-terminate] with [0] tasks
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing phase [actor-system-terminate] with [1] tasks.
[2021-06-21 12:17:19,592] [DEBUG] [akka.actor.CoordinatedShutdown] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - Performing task [terminate-system] in CoordinatedShutdown phase [actor-system-terminate]
[2021-06-21 12:17:19,594] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Shutting down remote daemon.
[2021-06-21 12:17:19,595] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remote daemon shut down; proceeding with flushing remote transports.
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://[email protected]:2552], control stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,631] [WARN] [akka.stream.Materializer] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - [outbound connection to [akka://[email protected]:2552], message stream] Upstream failed, cause: StreamTcpException: The connection has been aborted
[2021-06-21 12:17:19,639] [INFO] [akka.remote.RemoteActorRefProvider$RemotingTerminator] [akka://[email protected]:2551] [] [ShoppingCartService-akka.actor.default-dispatcher-3] - Remoting shut down.
[2021-06-21 12:17:19,646] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown initiated...
[2021-06-21 12:17:19,650] [INFO] [com.zaxxer.hikari.HikariDataSource] [] [] [ShoppingCartService-akka.actor.default-dispatcher-35] - db - Shutdown completed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing to note is that sharding already acts quickly on the downing, see

ShoppingCart: Self downed, stopping ShardRegion
PostStop cart [cart2]

import SuspendDetector.SuspendDetected

// FIXME config
private val tickInterval = 100.millis
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not have to be this frequently, because I think the scheduled task will trigger immediately when the process is woken up.

@akka-ci akka-ci added needs-attention Indicates a PR validation failure (set by CI infrastructure) and removed validating PR is currently being validated by Jenkins labels Jun 21, 2021
@akka-ci
Copy link

akka-ci commented Jun 21, 2021

Test FAILed.

Copy link
Contributor

@johanandren johanandren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, I guess it's mostly just missing config for the various timeouts?

@patriknw
Copy link
Contributor Author

patriknw commented Jun 21, 2021

I guess it's mostly just missing config for the various timeouts?

yeah, I wanted to throw out the idea before doing too much. Also thinking if there are more places where it should check? In persistent actors, or should that be in the plugins? Thinking it could abort any updates in flight (that haven't been sent to db yet).

@johanandren
Copy link
Contributor

Hmm, yeah, for persistence would be to avoid delayed writes when the actor was already started on another node?
If we want it for persistence I think it should be in the Akka Persistence journal infra and not in each individual journal plugin.

I can't think of any other place where it is immediately obvious to me that it would be good, I think I'll have to "marinate" the idea for a while and see what comes up.

@volatile private var aliveTime = System.nanoTime()
@volatile private var suspendDetectedTime = aliveTime - 1.day.toNanos

system.scheduler.scheduleWithFixedDelay(tickInterval, tickInterval) { () =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this approach (scheduling checkTime on the scheduler and dispatcher) yield a false positive in cases of thread starvation within the application? Thread starvation like that is definitely https://www.youtube.com/watch?v=MpUWrl3-mc8, but the response of pre-emptively tearing things down might be out of proportion; so we have a very loose deadline.

It seems like the intent is to capture scenarios where nothing in the JVM process gets scheduled (suspended process/VM, but also sufficiently-high levels of host load or priority inversions etc.), so maybe having the check run in a thread that is nearly always sleeping would be more effective and could have a faster reaction time? My understanding is that the kernel schedulers tend to be eager to give CPU to processes that haven't been runnable in a long time (e.g. because they're sleeping).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-attention Indicates a PR validation failure (set by CI infrastructure)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants