Skip to content

Conversation

@Vladsz83
Copy link
Contributor

@Vladsz83 Vladsz83 commented May 22, 2020

This PR is first step of improvement and quickening of node failure detection. We should obtain simple, predictable and configurable node pinging.

Fixes:

  1. Connection failure is kept within IgniteConfiguration.failureDetectionTimeout instead of 500ms + IgniteConfiguration.failureDetectionTimeout.

  2. Interval of connection checking in TCP discovery made rely on configured failure detection timeout. Previous 500ms is the minimal interval at now. This is done to get robust node pinging and keep failure detection timeout accurate.

  3. Removed additional connection checking. This premature node ping relied also on any received message. Imagine: if node 2 receives no message from previous node 1 within some time, it decides to do extra ping next node 3 not waiting for regular ping. This brought mess, confusion and gave no considerable guaranties.

Behavior changes:

  1. TcpDiscoveryConnectionCheckMessage is not sent if there is a message traffic within actual failure detection timeout because any message checks connection.

  2. Failure detection timeout is now overal timeout since last message sent. Not a timeout on current message exchange.

@rkondakov
Copy link
Contributor

Please check the correctness of the jira issue number in the PR heading. It looks like it is incorrect. IGNITE-13021 is about the new SQL engine, not about the connectivity.

@Vladsz83 Vladsz83 changed the title IGNITE-13021 Make node connection checking rely on the configuration. Simplify node ping routine. IGNITE-13012 Make node connection checking rely on the configuration. Simplify node ping routine. May 22, 2020
@Vladsz83
Copy link
Contributor Author

Please check the correctness of the jira issue number in the PR heading. It looks like it is incorrect. IGNITE-13021 is about the new SQL engine, not about the connectivity.

Fixed on IGNITE-13012. Thanks!

@Vladsz83
Copy link
Contributor Author

Vladsz83 commented Jun 10, 2020

@sergey-chugunov-1985 , I find you have good experience in TcpDiscoverySpi. Could you take a look at this ticket too?

hasRemoteSrvNodes = ring.hasRemoteServerNodes();

if (hasRemoteSrvNodes) {
if (hasRemoteSrvNodes)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to call updateLastSentMessageTime method here as well?

Copy link
Contributor Author

@Vladsz83 Vladsz83 Jun 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to call updateLastSentMessageTime method here as well?

We hasn't successfully sent message here, we hasn't received RES_OK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you can see, we call updateLastSentMessageTime() after successful reading spi.readReceipt or proper TcpDiscoveryHandshakeResponse. These are the places where we are sure the message was sent and connection is OK.

@anton-vinogradov anton-vinogradov merged commit 87aad40 into apache:master Jun 24, 2020
@Vladsz83 Vladsz83 deleted the IGNITE-13012 branch June 24, 2020 10:22
kartiksomani pushed a commit to kartiksomani/ignite that referenced this pull request Sep 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants