Skip to content

Differentiate between initial and reconnect RCS connections#134415

Merged
JeremyDahlgren merged 10 commits intoelastic:mainfrom
JeremyDahlgren:es-12694-rcs-initial-conn
Sep 11, 2025
Merged

Differentiate between initial and reconnect RCS connections#134415
JeremyDahlgren merged 10 commits intoelastic:mainfrom
JeremyDahlgren:es-12694-rcs-initial-conn

Conversation

@JeremyDahlgren
Copy link
Contributor

Adds a connection attempt counter to RemoteConnectionStrategy, with info logging on connection success and warning logging on connection failure, and 30 secs between repeat failure attempt logging. This change will be used in a follow up PR where we will increment either an initial connection failure metric or a reconnection attempt failure metric.

Resolves: ES-12694
Relates: ES-12695

Adds a connection attempt counter to RemoteConnectionStrategy,
with info logging on connection success and warning logging on
connection failure, and 30 secs between repeat failure attempt
logging.  This change will be used in a follow up PR where we
will increment either an initial connection failure metric or
a reconnection attempt failure metric.

Resolves: ES-12694
@JeremyDahlgren JeremyDahlgren added >enhancement :Distributed/Network Http and internode communication implementations Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.2.0 labels Sep 10, 2025
@elasticsearchmachine
Copy link
Collaborator

Hi @JeremyDahlgren, I've created a changelog YAML for you.

@JeremyDahlgren JeremyDahlgren marked this pull request as ready for review September 10, 2025 01:48
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed-coordination (Team:Distributed Coordination)

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. I left one "composite" comment.

Comment on lines +232 to +238
} else {
if (lastFailedConnectionAttemptWarningTimeMillis == -1L
|| nowMillis - lastFailedConnectionAttemptWarningTimeMillis >= CONNECTION_FAILURE_WARN_INTERVAL.getMillis()) {
logger.warn(msgSupplier, e);
lastFailedConnectionAttemptWarningTimeMillis = nowMillis;
}
// TODO: ES-12695: Increment either the initial (connectionAttempts == 1) or retry connection failure metric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few comments:

  1. I think we don't really need them. One AtomicBoolean should be sufficient to differentiate between initial and subsequent connections? Does not seem useful to keep track of the number of attempts. We could increment an APM metric counter (in separate PR) instead if needed.
  2. The counting fields probably won't work correctly without being volatile or something similar since I think this method can be called from different threads.
  3. We can always log the warning message if connection fails entirely, we do that already in Proxy and Sniff strategies. If we want to throttle them, the interval should be configurable instead of hardcoded. But I don't think its needs throttling at this point. I will be fine if we remove the warning logs from the individual strategy and let the new one replace them.
  4. I'd prefer the log message more clearly to say whether it is the initial connection or subsequent connection instead of relying on the number of attempts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Ok - I refactored to use the AtomicBoolean in e57e238.
  2. I'd like to understand this better, from my reading of the code it appeared that the synchronization and use of the listeners list in connect() ensured that there was only ever a single AbstractRunnable executing connectImpl(), with the listener that would invoke the new connectionAttemptCompleted() method.
  3. I removed the throttling and removed the relevant warn log line in the strategy implementation classes. Please verify that this is what you had in mind.
  4. I updated the message in e57e238.

Thank you for the review Yang!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comment!

For 2, as discussed separately, you might be right on this but it's also not 100% certain to me in all cases. The synchronization and single thread actually proceeds with building connection, plus the usage of executor.submit to the manage threadpool should ensure visibility in the request path. But the final listener can be invoked in a few different threads in either success or failure cases, e.g. generic, management and transport_worker (can be a different one from the one opened the initial connection). I am not fully sure that all these threads in all cases will have the right visibility.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines +232 to +238
} else {
if (lastFailedConnectionAttemptWarningTimeMillis == -1L
|| nowMillis - lastFailedConnectionAttemptWarningTimeMillis >= CONNECTION_FAILURE_WARN_INTERVAL.getMillis()) {
logger.warn(msgSupplier, e);
lastFailedConnectionAttemptWarningTimeMillis = nowMillis;
}
// TODO: ES-12695: Increment either the initial (connectionAttempts == 1) or retry connection failure metric.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing the comment!

For 2, as discussed separately, you might be right on this but it's also not 100% certain to me in all cases. The synchronization and single thread actually proceeds with building connection, plus the usage of executor.submit to the manage threadpool should ensure visibility in the request path. But the final listener can be invoked in a few different threads in either success or failure cases, e.g. generic, management and transport_worker (can be a different one from the one opened the initial connection). I am not fully sure that all these threads in all cases will have the right visibility.

@JeremyDahlgren JeremyDahlgren merged commit 355920e into elastic:main Sep 11, 2025
34 checks passed
JeremyDahlgren added a commit to JeremyDahlgren/elasticsearch that referenced this pull request Oct 30, 2025
This change registers counters for initial and reconnect
attempt failures.  The change also required minor refactoring
to make the metrics registry available from the TransportService
that is passed to the RemoteConnectionStrategy constructor.
This change builds on the work done in elastic#134415.

Resolves: ES-12695
JeremyDahlgren added a commit that referenced this pull request Nov 8, 2025
This change registers a counter to track initial and reconnect
attempt failures.  The change also required minor refactoring
to make the metrics registry available from the TransportService
that is passed to the RemoteClusterService constructor.
This change builds on the work done in #134415.

Resolves: ES-12695
Kubik42 pushed a commit to Kubik42/elasticsearch that referenced this pull request Nov 10, 2025
…37406)

This change registers a counter to track initial and reconnect
attempt failures.  The change also required minor refactoring
to make the metrics registry available from the TransportService
that is passed to the RemoteClusterService constructor.
This change builds on the work done in elastic#134415.

Resolves: ES-12695
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

:Distributed/Network Http and internode communication implementations >enhancement Team:Distributed Coordination (obsolete) Meta label for Distributed Coordination team. Obsolete. Please do not use. v9.2.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants