Differentiate between initial and reconnect RCS connections#134415
Differentiate between initial and reconnect RCS connections#134415JeremyDahlgren merged 10 commits intoelastic:mainfrom
Conversation
Adds a connection attempt counter to RemoteConnectionStrategy, with info logging on connection success and warning logging on connection failure, and 30 secs between repeat failure attempt logging. This change will be used in a follow up PR where we will increment either an initial connection failure metric or a reconnection attempt failure metric. Resolves: ES-12694
|
Hi @JeremyDahlgren, I've created a changelog YAML for you. |
|
Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination) |
ywangd
left a comment
There was a problem hiding this comment.
Thanks for working on this. I left one "composite" comment.
| } else { | ||
| if (lastFailedConnectionAttemptWarningTimeMillis == -1L | ||
| || nowMillis - lastFailedConnectionAttemptWarningTimeMillis >= CONNECTION_FAILURE_WARN_INTERVAL.getMillis()) { | ||
| logger.warn(msgSupplier, e); | ||
| lastFailedConnectionAttemptWarningTimeMillis = nowMillis; | ||
| } | ||
| // TODO: ES-12695: Increment either the initial (connectionAttempts == 1) or retry connection failure metric. |
There was a problem hiding this comment.
I have a few comments:
- I think we don't really need them. One AtomicBoolean should be sufficient to differentiate between initial and subsequent connections? Does not seem useful to keep track of the number of attempts. We could increment an APM metric counter (in separate PR) instead if needed.
- The counting fields probably won't work correctly without being volatile or something similar since I think this method can be called from different threads.
- We can always log the warning message if connection fails entirely, we do that already in Proxy and Sniff strategies. If we want to throttle them, the interval should be configurable instead of hardcoded. But I don't think its needs throttling at this point. I will be fine if we remove the warning logs from the individual strategy and let the new one replace them.
- I'd prefer the log message more clearly to say whether it is the initial connection or subsequent connection instead of relying on the number of attempts.
There was a problem hiding this comment.
- Ok - I refactored to use the
AtomicBooleanin e57e238. - I'd like to understand this better, from my reading of the code it appeared that the synchronization and use of the
listenerslist inconnect()ensured that there was only ever a singleAbstractRunnableexecutingconnectImpl(), with the listener that would invoke the newconnectionAttemptCompleted()method. - I removed the throttling and removed the relevant
warnlog line in the strategy implementation classes. Please verify that this is what you had in mind. - I updated the message in e57e238.
Thank you for the review Yang!
There was a problem hiding this comment.
Thanks for addressing the comment!
For 2, as discussed separately, you might be right on this but it's also not 100% certain to me in all cases. The synchronization and single thread actually proceeds with building connection, plus the usage of executor.submit to the manage threadpool should ensure visibility in the request path. But the final listener can be invoked in a few different threads in either success or failure cases, e.g. generic, management and transport_worker (can be a different one from the one opened the initial connection). I am not fully sure that all these threads in all cases will have the right visibility.
server/src/main/java/org/elasticsearch/transport/RemoteConnectionStrategy.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/RemoteConnectionStrategy.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/transport/RemoteConnectionStrategy.java
Outdated
Show resolved
Hide resolved
| } else { | ||
| if (lastFailedConnectionAttemptWarningTimeMillis == -1L | ||
| || nowMillis - lastFailedConnectionAttemptWarningTimeMillis >= CONNECTION_FAILURE_WARN_INTERVAL.getMillis()) { | ||
| logger.warn(msgSupplier, e); | ||
| lastFailedConnectionAttemptWarningTimeMillis = nowMillis; | ||
| } | ||
| // TODO: ES-12695: Increment either the initial (connectionAttempts == 1) or retry connection failure metric. |
There was a problem hiding this comment.
Thanks for addressing the comment!
For 2, as discussed separately, you might be right on this but it's also not 100% certain to me in all cases. The synchronization and single thread actually proceeds with building connection, plus the usage of executor.submit to the manage threadpool should ensure visibility in the request path. But the final listener can be invoked in a few different threads in either success or failure cases, e.g. generic, management and transport_worker (can be a different one from the one opened the initial connection). I am not fully sure that all these threads in all cases will have the right visibility.
This change registers counters for initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteConnectionStrategy constructor. This change builds on the work done in elastic#134415. Resolves: ES-12695
This change registers a counter to track initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteClusterService constructor. This change builds on the work done in #134415. Resolves: ES-12695
…37406) This change registers a counter to track initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteClusterService constructor. This change builds on the work done in elastic#134415. Resolves: ES-12695
Adds a connection attempt counter to
RemoteConnectionStrategy, with info logging on connection success and warning logging on connection failure, and 30 secs between repeat failure attempt logging. This change will be used in a follow up PR where we will increment either an initial connection failure metric or a reconnection attempt failure metric.Resolves: ES-12694
Relates: ES-12695