Skip to content

Do not report failure after connections are made#99117

Merged
elasticsearchmachine merged 5 commits intoelastic:mainfrom
ywangd:es-99113
Sep 12, 2023
Merged

Do not report failure after connections are made#99117
elasticsearchmachine merged 5 commits intoelastic:mainfrom
ywangd:es-99113

Conversation

@ywangd
Copy link
Member

@ywangd ywangd commented Sep 1, 2023

Today, when the number of attempts is exhausted, ProxyConnectionStrategy checks the number of connections before returns. It reports connection failure if the number of connections is zero at the time of checking. However, this behaviour is incorrect. In rare cases, a connection can be dropped right after it is initially established and before the number checking. From the perspective of the openConnections method, it should not care whether or when opened connections are subsequently closed. As long as connections have been initially established, it should report success instead of failure.

This PR adjusts the code to report success in above situation.

Relates: #94998
Resolves: #99113

This PR removes a bogus assertion from ProxyConnectionStrategy. The
statement tries to assert that all connection exceptions are caught in
the if branch and we should have at least one connection when the code
reaches the else branch.

This is simply not true because the process of checking whether we
should open more connections and subsequently openning connections are
*not* atomic. It is possible that the number of connections changes,
e.g. remote end drops the connection, in between which makes the code
flow go to the else branch and trips the bogus assertion.

Relates: elastic#94998
Resolves: elastic#99113
@ywangd ywangd added >test Issues or PRs that are addressing/adding tests :Distributed/Network Http and internode communication implementations v8.11.0 labels Sep 1, 2023
@ywangd ywangd requested a review from DaveCTurner September 1, 2023 06:26
@elasticsearchmachine elasticsearchmachine added the Team:Distributed Meta label for distributed team. label Sep 1, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@ywangd
Copy link
Member Author

ywangd commented Sep 1, 2023

@DaveCTurner Could you please help review this PR since you were involved in the PR (#94998) that introduced the bogus assertion? Thanks!

It is a rare failure only happens if connections are dropped quickly enough. For the failed test, it reaches the end of the test and starts to cleanup connections once the addressResolver is called. This can cause connections to be closed when the openConnections is still running. For the two places (1, 2) where we call openConnections, we assume we have at least one connection (connectionManager.size() > 0). This assumption can be false when connections are dropped in between.

I considered whether we could check attempNumber before the call to openConnections in the onResponse callback.

if (shouldOpenMoreConnections()) {
openConnections(finished, attemptNumber + 1);

But it does not help the overall situation where number of connections can change right after we check it. Since the else branch already handles the case of zero connection, I think it is reasonable and easier to just remove the overly optimistic assertion.

@ywangd
Copy link
Member Author

ywangd commented Sep 11, 2023

@DaveCTurner Ping for awareness. Thank you!

@DaveCTurner
Copy link
Contributor

Ah yes this is not a valid assertion. But why check the number of open connections at all? We may as well just report success at this point, because it makes no difference if the connections have closed before or after we read connectionManager.size().

@ywangd
Copy link
Member Author

ywangd commented Sep 11, 2023

Checking the number of open connections is an existing behaviour before the assertion was added. I can see the argument that we don't need to check the connection again in the else branch because we can safely assume that at least one connection has been successful when the code reaches here. Whether the connection subsequently gets disconnected is a separate issue and the disconnetion can happen anytime after the connection is initially established.

That said, can I address it in a separate PR because it requires changing production code and writing tests? For this PR, I'd prefer to resolve the CI failure first. Thanks!

@DaveCTurner
Copy link
Contributor

Can we just mute the test until it's fixed? IMO the behaviour we have today is a genuine bug.

@ywangd ywangd changed the title [Test] Remove bogus assertion ProxyConnectionStrategy no longer reports failure after connections are made Sep 12, 2023
@ywangd ywangd changed the title ProxyConnectionStrategy no longer reports failure after connections are made Do not report failure after connections are made Sep 12, 2023
@ywangd ywangd added >bug and removed >test Issues or PRs that are addressing/adding tests labels Sep 12, 2023
@elasticsearchmachine
Copy link
Collaborator

Hi @ywangd, I've created a changelog YAML for you.

@ywangd
Copy link
Member Author

ywangd commented Sep 12, 2023

@DaveCTurner I have turned this PR into an actual fix as you suggested. I did not mute the failed test because (1) it is a very rare failure and (2) the failure in theory can happen to any tests that use proxy connection strategy and it is rather inconvenient to mute all of them. So hopefully we can get this PR merged before the next failure happens. Thanks!

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM thanks Yang

@ywangd ywangd added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Sep 12, 2023
@elasticsearchmachine elasticsearchmachine merged commit 4c8888d into elastic:main Sep 12, 2023
@ywangd ywangd deleted the es-99113 branch September 12, 2023 08:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >bug :Distributed/Network Http and internode communication implementations Team:Distributed Meta label for distributed team. v8.11.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[CI] ProxyConnectionStrategyTests testProxyStrategyWillResolveAddressesEachConnect failing

3 participants