ConnectionStateListener change #15991

xinlian12 · 2020-10-06T16:29:47Z

This PR is mostly a duplicate of ConnectionStateListener change incuded in PR #14697 with few differences.

Implementation:
Each RntbdServiceEndpoint has a RntbdConnectionStateListener to keep tracking the partitionKeyRangeIdentity sets. New partitionKeyRangeIdentity will be added when RntbdServiceEndpoint.request() called.

When detecting a server is going down, we will remove all the effected PartitionKeyRangeIdentity from gateway address cache. Currently, only ClosedChannelException will trigger onConnectionEvent since this is more sure as a signal the server is going down.

Workflow:
Three major workflows could be triggered.

Server starts a graceful connection closure. (FIN)
Normal netty close channle flow will be triggered. (closeFuture, inactive, unregister). RntbdRequestManager will complete all pending requests for the channel with ClosedChannelException.
Server ungracefully close a connection (RST).
This will trigger netty exceptionCaught() flow. 'RntbdRequestManager.exceptionCaught()` will be called, and complete all pending requests for the channel with IOException. Channel will be closed.
When the server down, client trying to start a new connection:
Eventually get ConnectTimeOutException

Test:
Test33:
Did TCP packets capture for package upgrade and manually retart VM. Based on the tcp traces, the first and third workflow will be triggered.

Test Results from benchmark run:
Pending

Test Results from CTL run:
Pending

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/implementation/RequestTimeline.java

...main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionEvent.java

xinlian12 · 2020-10-06T20:06:55Z

/azp run java - cosmos - tests

azure-pipelines · 2020-10-06T20:07:06Z

Azure Pipelines successfully started running 1 pipeline(s).

moderakh

RntbdConnectionStateListener::partitionAddressCache is a local cache and can be stale. as a result of that RntbdConnectionStateListener may remove addresses of unrelated pkranges causing unnecessary address refresh.

Consider this scenario:

PKR1, PKR2, PKR3 are hosted on the same physical node.
request1 with PKR1, request2 with PKR2, and request3 with PKR3 are sent and as a result RntbdConnectionStateListener:partitionAddressCache is populated with PKR1, PKR2, and PKR3.
due to partition movement PKR1 moves out of the physical node, but PKR2 and PKR3 stay on the node.
now the node hosting PKR2 and PKR3 shuts down.
RntbdConnectionStateListener thinks PKR1 is still on this physical node, and hence will remove addresses of PKR1, PKR2, and PKR3. But PKR1 addresses shouldn't have been removed!!.
this will result in unnecessary address refresh for PKR1.

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java

...cosmos/src/main/java/com/azure/cosmos/implementation/directconnectivity/AddressResolver.java

...s/src/main/java/com/azure/cosmos/implementation/directconnectivity/RntbdTransportClient.java

...in/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdClientChannelPool.java

...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java

...mos/src/main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequest.java

.../main/java/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdRequestManager.java

...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java

moderakh

LGTM. (please make sure we run CTL)

@xinlian12 very great work.

Please track the following as an improvement outside of the scope of this PR:

RntbdConnectionStateListener::partitionAddressCache is a local cache and can be stale. as a result of that RntbdConnectionStateListener may remove addresses of unrelated pkranges causing unnecessary address refresh.

Consider this scenario:

PKR1, PKR2, PKR3 are hosted on the same physical node.
request1 with PKR1, request2 with PKR2, and request3 with PKR3 are sent and as a result RntbdConnectionStateListener:partitionAddressCache is populated with PKR1, PKR2, and PKR3.
due to partition movement PKR1 moves out of the physical node, but PKR2 and PKR3 stay on the node.
now the node hosting PKR2 and PKR3 shuts down.
RntbdConnectionStateListener thinks PKR1 is still on this physical node, and hence will remove addresses of PKR1, PKR2, and PKR3. But PKR1 addresses shouldn't have been removed!!.
this will result in unnecessary address refresh for PKR1.

...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java

moderakh

The PR looks good code-wise.

There are a few improvement which Annie is tracking outside of the scope of this PR.

for This PR however we are waiting for

CTL result
perf testing (there was some discussion that on dotNet side enabling this feature resulted in perf impact) we need to measure if that's the case in Java as well or not.

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java

xinlian12 · 2020-10-18T16:28:43Z

/azp run java - cosmos - tests

azure-pipelines · 2020-10-18T16:29:37Z

Azure Pipelines successfully started running 1 pipeline(s).

xinlian12 · 2020-10-19T23:50:45Z

Test Result:

1. Adobe Tests (default config):
WestUS2 database, West US2 VM (16 core)

2. Benchmark Runner-ReadLatency:
Test33, 10,000 Documents, 1-1024 concurrency, each concurrency will run 5 mins:

P95:

P99:

P999:

Throughput:

3. CTL Run:
Test33

Test results from 4 core CPU:
Adobe Tests:

Benchmark test results:

ConnectionStateListener change

850e718

xinlian12 requested review from David-Noble-at-work, kirankumarkolli, kushagraThapar, mbhaskar, milismsft, moderakh and simplynaveen20 as code owners October 6, 2020 16:29

ghost added the Cosmos label Oct 6, 2020

Annie Liang added 2 commits October 6, 2020 10:31

clean

49b62d6

fix

c38d31c

kushagraThapar reviewed Oct 6, 2020

View reviewed changes

resolve comments

116d4c2

moderakh suggested changes Oct 8, 2020

View reviewed changes

Annie Liang added 7 commits October 9, 2020 12:46

Merge branch 'master' into ConnectionStateListener

1a8bccc

revert back change not related to ConnectionStateListener

fcc29d6

Merge branch 'master' into ConnectionStateListener

718c754

resolve comments

df9af7f

fix tests

d964363

changeConnectionStateListener default to false and resolve comments

bb34f16

fix tests

2148f92

xinlian12 mentioned this pull request Oct 12, 2020

[FEATURE] Cosmos | Connection endpoint rediscovery (connection state listener) #14697

Closed

7 tasks

moderakh approved these changes Oct 13, 2020

View reviewed changes

...a/com/azure/cosmos/implementation/directconnectivity/rntbd/RntbdConnectionStateListener.java Show resolved Hide resolved

Merge branch 'master' into ConnectionStateListener

bd12a8b

moderakh reviewed Oct 15, 2020

View reviewed changes

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java Show resolved Hide resolved

sdk/cosmos/azure-cosmos/src/main/java/com/azure/cosmos/DirectConnectionConfig.java Show resolved Hide resolved

Annie Liang added 2 commits October 18, 2020 09:18

Merge branch 'master' into ConnectionStateListener

b7ea580

add beta annotation

166947a

xinlian12 merged commit f0445e4 into Azure:master Oct 21, 2020

xinlian12 deleted the ConnectionStateListener branch February 3, 2021 22:13

ConnectionStateListener change #15991

ConnectionStateListener change #15991

Uh oh!

Conversation

xinlian12 commented Oct 6, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented Oct 6, 2020

Uh oh!

azure-pipelines bot commented Oct 6, 2020

Uh oh!

moderakh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

moderakh left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

moderakh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

xinlian12 commented Oct 18, 2020

Uh oh!

azure-pipelines bot commented Oct 18, 2020

Uh oh!

xinlian12 commented Oct 19, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xinlian12 commented Oct 6, 2020 •

edited

Loading

moderakh left a comment •

edited

Loading

moderakh left a comment •

edited

Loading

xinlian12 commented Oct 19, 2020 •

edited

Loading