Differentiate between initial and reconnect RCS connections by JeremyDahlgren · Pull Request #134415 · elastic/elasticsearch

JeremyDahlgren · 2025-09-10T01:35:04Z

Adds a connection attempt counter to RemoteConnectionStrategy, with info logging on connection success and warning logging on connection failure, and 30 secs between repeat failure attempt logging. This change will be used in a follow up PR where we will increment either an initial connection failure metric or a reconnection attempt failure metric.

Resolves: ES-12694
Relates: ES-12695

Adds a connection attempt counter to RemoteConnectionStrategy, with info logging on connection success and warning logging on connection failure, and 30 secs between repeat failure attempt logging. This change will be used in a follow up PR where we will increment either an initial connection failure metric or a reconnection attempt failure metric. Resolves: ES-12694

elasticsearchmachine · 2025-09-10T01:35:29Z

Hi @JeremyDahlgren, I've created a changelog YAML for you.

elasticsearchmachine · 2025-09-10T01:49:11Z

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

ywangd

Thanks for working on this. I left one "composite" comment.

ywangd · 2025-09-10T06:05:23Z

server/src/main/java/org/elasticsearch/transport/RemoteConnectionStrategy.java

+        } else {
+            if (lastFailedConnectionAttemptWarningTimeMillis == -1L
+                || nowMillis - lastFailedConnectionAttemptWarningTimeMillis >= CONNECTION_FAILURE_WARN_INTERVAL.getMillis()) {
+                logger.warn(msgSupplier, e);
+                lastFailedConnectionAttemptWarningTimeMillis = nowMillis;
+            }
+            // TODO: ES-12695: Increment either the initial (connectionAttempts == 1) or retry connection failure metric.


I have a few comments:

I think we don't really need them. One AtomicBoolean should be sufficient to differentiate between initial and subsequent connections? Does not seem useful to keep track of the number of attempts. We could increment an APM metric counter (in separate PR) instead if needed.

The counting fields probably won't work correctly without being volatile or something similar since I think this method can be called from different threads.

We can always log the warning message if connection fails entirely, we do that already in Proxy and Sniff strategies. If we want to throttle them, the interval should be configurable instead of hardcoded. But I don't think its needs throttling at this point. I will be fine if we remove the warning logs from the individual strategy and let the new one replace them.

I'd prefer the log message more clearly to say whether it is the initial connection or subsequent connection instead of relying on the number of attempts.

Ok - I refactored to use the AtomicBoolean in e57e238.

I'd like to understand this better, from my reading of the code it appeared that the synchronization and use of the listeners list in connect() ensured that there was only ever a single AbstractRunnable executing connectImpl(), with the listener that would invoke the new connectionAttemptCompleted() method.

I removed the throttling and removed the relevant warn log line in the strategy implementation classes. Please verify that this is what you had in mind.

I updated the message in e57e238.

Thank you for the review Yang!

Thanks for addressing the comment!

For 2, as discussed separately, you might be right on this but it's also not 100% certain to me in all cases. The synchronization and single thread actually proceeds with building connection, plus the usage of executor.submit to the manage threadpool should ensure visibility in the request path. But the final listener can be invoked in a few different threads in either success or failure cases, e.g. generic, management and transport_worker (can be a different one from the one opened the initial connection). I am not fully sure that all these threads in all cases will have the right visibility.

…rning

ywangd

LGTM

server/src/main/java/org/elasticsearch/transport/RemoteConnectionStrategy.java

ywangd · 2025-09-11T02:40:17Z

server/src/main/java/org/elasticsearch/transport/RemoteConnectionStrategy.java

+        } else {
+            if (lastFailedConnectionAttemptWarningTimeMillis == -1L
+                || nowMillis - lastFailedConnectionAttemptWarningTimeMillis >= CONNECTION_FAILURE_WARN_INTERVAL.getMillis()) {
+                logger.warn(msgSupplier, e);
+                lastFailedConnectionAttemptWarningTimeMillis = nowMillis;
+            }
+            // TODO: ES-12695: Increment either the initial (connectionAttempts == 1) or retry connection failure metric.


Thanks for addressing the comment!

For 2, as discussed separately, you might be right on this but it's also not 100% certain to me in all cases. The synchronization and single thread actually proceeds with building connection, plus the usage of executor.submit to the manage threadpool should ensure visibility in the request path. But the final listener can be invoked in a few different threads in either success or failure cases, e.g. generic, management and transport_worker (can be a different one from the one opened the initial connection). I am not fully sure that all these threads in all cases will have the right visibility.

This change registers counters for initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteConnectionStrategy constructor. This change builds on the work done in elastic#134415. Resolves: ES-12695

This change registers a counter to track initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteClusterService constructor. This change builds on the work done in #134415. Resolves: ES-12695

…37406) This change registers a counter to track initial and reconnect attempt failures. The change also required minor refactoring to make the metrics registry available from the TransportService that is passed to the RemoteClusterService constructor. This change builds on the work done in elastic#134415. Resolves: ES-12695

JeremyDahlgren added >enhancement :Distributed/Network Http and internode communication implementations Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.2.0 labels Sep 10, 2025

Update docs/changelog/134415.yaml

aae60f9

JeremyDahlgren requested a review from ywangd September 10, 2025 01:48

JeremyDahlgren marked this pull request as ready for review September 10, 2025 01:48

ywangd reviewed Sep 10, 2025

View reviewed changes

JeremyDahlgren and others added 3 commits September 10, 2025 16:06

Use boolean instead of attempts counter, remove throttling

e57e238

Remove strategy impl warn msg in favor of more detailed base class wa…

d940014

…rning

[CI] Auto commit changes from spotless

02d7cfd

JeremyDahlgren requested a review from ywangd September 10, 2025 20:14

ywangd approved these changes Sep 11, 2025

View reviewed changes

JeremyDahlgren added 5 commits September 11, 2025 09:01

add @nullable to exception param in connectionAttemptCompleted()

af918bf

ammend the log msg in connectionAttemptCompleted()

52ed56b

change success msg log level to debug in connectionAttemptCompleted()

4a5683b

Merge branch 'main' into es-12694-rcs-initial-conn

285727d

Merge branch 'main' into es-12694-rcs-initial-conn

82f0b3d

JeremyDahlgren merged commit 355920e into elastic:main Sep 11, 2025
34 checks passed

JeremyDahlgren mentioned this pull request Oct 30, 2025

Add connection failure metrics in RemoteConnectionStrategy #137406

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differentiate between initial and reconnect RCS connections#134415

Differentiate between initial and reconnect RCS connections#134415
JeremyDahlgren merged 10 commits intoelastic:mainfrom
JeremyDahlgren:es-12694-rcs-initial-conn

JeremyDahlgren commented Sep 10, 2025

Uh oh!

elasticsearchmachine commented Sep 10, 2025

Uh oh!

elasticsearchmachine commented Sep 10, 2025

Uh oh!

ywangd left a comment

Uh oh!

ywangd Sep 10, 2025

Uh oh!

JeremyDahlgren Sep 10, 2025

Uh oh!

ywangd Sep 11, 2025

Uh oh!

ywangd left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ywangd Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JeremyDahlgren commented Sep 10, 2025

Uh oh!

elasticsearchmachine commented Sep 10, 2025

Uh oh!

elasticsearchmachine commented Sep 10, 2025

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

ywangd Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

JeremyDahlgren Sep 10, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

ywangd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ywangd Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants