Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework client, connect, to simplify and prepare for simpler healing. #1083

Merged
merged 20 commits into from
Oct 13, 2021

Conversation

edwarnicke
Copy link
Member

@edwarnicke edwarnicke commented Sep 20, 2021

Previously we had a single 'client' per grpc.ClientConnInterface.
This created a lot of complexity, particularly in connect.

This rework of client moves to a single 'client' that can handle
multiple grpc.ClientConnInterfaces.

This vastly simplified both the client chain and connect chain.

Finally, heal has been temporarily removed in preparation for a rework
of heal focused on back propogating Connection 'DOWN' events via
monitor. The net net at the end of the process should be:

  1. Refresh is only originated by actual clients, not the 'client' part
    of passthrough NSEs
  2. Heal will only be originated by actual clients, not the 'client' part
    pf passthrough NSEs.

Signed-off-by: Ed Warnicke [email protected]

Description

Issue link

How Has This Been Tested?

  • Added unit testing to cover
  • Tested manually
  • Tested by integration testing
  • Have not tested

Types of changes

  • Bug fix
  • New functionallity
  • Documentation
  • Refactoring
  • CI

edwarnicke added a commit to edwarnicke/sdk-vpp that referenced this pull request Sep 20, 2021
edwarnicke added a commit to edwarnicke/cmd-forwarder-vpp that referenced this pull request Sep 20, 2021
@@ -57,7 +57,6 @@ func DialOptions(options ...DialOption) []grpc.DialOption {
),
grpc.WithBlock(),
grpc.WithDefaultCallOptions(
grpc.WaitForReady(true),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bolodya1997 I removed this line because without it

func (s *nsmgrSuite) Test_ConnectToDeadNSEUsecase() {
t := s.T()
ctx, cancel := context.WithTimeout(context.Background(), time.Second*5)
defer cancel()
nsReg, err := s.nsRegistryClient.Register(ctx, defaultRegistryService())
require.NoError(t, err)
nseReg := defaultRegistryEndpoint(nsReg.Name)
counter := new(count.Server)
nse := s.domain.Nodes[0].NewEndpoint(ctx, nseReg, sandbox.GenerateTestToken, counter)
nsc := s.domain.Nodes[0].NewClient(ctx, sandbox.GenerateTestToken)
request := defaultRequest(nsReg.Name)
reqCtx, reqCancel := context.WithTimeout(ctx, time.Second)
defer reqCancel()
conn, err := nsc.Request(reqCtx, request.Clone())
require.NoError(t, err)
require.NotNil(t, conn)
require.Equal(t, 1, counter.Requests())
require.Equal(t, 5, len(conn.Path.PathSegments))
nse.Cancel()
// Simulate refresh from client
refreshRequest := request.Clone()
refreshRequest.Connection = conn.Clone()
refreshReqCtx, refreshReqCancel := context.WithTimeout(ctx, time.Second)
defer refreshReqCancel()
_, err = nsc.Request(refreshReqCtx, refreshRequest)
require.Error(t, err)
require.NoError(t, reqCtx.Err())
// Close
_, _ = nsc.Close(ctx, conn)
// Endpoint unregister
_, err = s.domain.Nodes[0].NSMgr.NetworkServiceEndpointRegistryServer().Unregister(ctx, nseReg)
require.NoError(t, err)
}
fails. In digging into it, it was failing because we had WaitForReady set to true... no idea why it didn't fail before. Any insights? Anything I missed here?

Copy link
Member

@denis-tingaikin denis-tingaikin Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the error do you see in Test_ConnectToDeadNSEUsecase with grpc.WaitForReady(true) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that there are some reasons why we may want to use WaitForReady flag, but I am not actually sure because we most probably had started using it before I have started working on NSM :)
There was an issue with Request hanging in such case and so it is the reason why do we have such test.
We solved this issue in connect several times with several approaches during the heal development. Currently connect uses heal client (or some other monitoring client) to understand that remote side is dead and so close the gRPC connection to prevent Request hanging with WaitForReady flag set.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edwarnicke
Any update for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've restored the grpc.WaitForReady(true)

Copy link
Member

@denis-tingaikin denis-tingaikin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like parts with simplifications related to dial/clientconn, but it this PR also removes heal. So it seems to me to merge this PR we also need to do one of these options:

  1. Remove/Disable all heal tests.
  2. Wait for heal rework.

// See the License for the specific language governing permissions and
// limitations under the License.

package nsmgr_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove heal tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we don't have the heal replacement yet :) Will revive them when that comes in.

)

// ExampleForwarder - example of how to use the connect chain element in a forwarder
func ExampleNewServer() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like idea on adding golang examples,
Do we need to add some asserts and outputs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would make it more of a test wouldn't it?

// See the License for the specific language governing permissions and
// limitations under the License.

package connect_test
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests cover many corner cases. Can we keep them?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree with Denis, most of our connect tests have been motivated by some painful issues. Removing them will most probably lead us back to the old problems.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bolodya1997 Added the tests back... and yes, they caught some issues :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added few missed goleak checks to connect tests (#1087), please can you rebase your tests?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -57,7 +57,6 @@ func DialOptions(options ...DialOption) []grpc.DialOption {
),
grpc.WithBlock(),
grpc.WithDefaultCallOptions(
grpc.WaitForReady(true),
Copy link
Member

@denis-tingaikin denis-tingaikin Sep 21, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What the error do you see in Test_ConnectToDeadNSEUsecase with grpc.WaitForReady(true) ?

}

// WithoutRefresh disables refresh
func WithoutRefresh() Option {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we have different token timeout for the different applications?

NSC(1h) -> NSMgr(30m) -> Forwarder(15m) -> NSMgr(30m) -> NSE

If we use such option on NSMgr, Forwarder we should probably compute the least token timeout in refresh on NSC to become sure that there will be no timeouts in the chain.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it can be some rare case for the local scenario, because most probably all of the applications share the same spire server with same SVID TTL for all of them, but it also can be an interdomain scenario with different spire federations and so different SVID TTLs for local and remote applications.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Bolodya1997 You are completely correct we need to watch out for this. We could do as you suggested and compute that in refresh. I'm tempted though to do it in updatetoken by clamping down expiretime to the smallest in the path on the return direction of the path.

edwarnicke added a commit to edwarnicke/sdk that referenced this pull request Sep 21, 2021
edwarnicke added a commit to edwarnicke/sdk that referenced this pull request Sep 21, 2021
edwarnicke added a commit to edwarnicke/sdk that referenced this pull request Sep 21, 2021
edwarnicke added a commit to edwarnicke/sdk-sriov that referenced this pull request Oct 13, 2021
edwarnicke added a commit to edwarnicke/sdk-sriov that referenced this pull request Oct 13, 2021
edwarnicke added a commit to edwarnicke/sdk-vpp that referenced this pull request Oct 13, 2021
edwarnicke added a commit to edwarnicke/cmd-nsmgr-proxy that referenced this pull request Oct 13, 2021
edwarnicke added a commit to edwarnicke/cmd-nsmgr-proxy that referenced this pull request Oct 13, 2021
edwarnicke added a commit to edwarnicke/cmd-forwarder-vpp that referenced this pull request Oct 13, 2021
Bolodya1997 pushed a commit to Bolodya1997/cmd-nsc-vpp that referenced this pull request Oct 14, 2021
Bolodya1997 pushed a commit to Bolodya1997/cmd-nse-firewall-vpp that referenced this pull request Oct 14, 2021
Bolodya1997 pushed a commit to Bolodya1997/cmd-nsc that referenced this pull request Oct 14, 2021
Bolodya1997 pushed a commit to Bolodya1997/cmd-nsc that referenced this pull request Oct 14, 2021
Bolodya1997 pushed a commit to Bolodya1997/sdk-ovs that referenced this pull request Oct 14, 2021
denis-tingaikin added a commit to networkservicemesh/sdk-sriov that referenced this pull request Oct 14, 2021
denis-tingaikin pushed a commit to networkservicemesh/sdk-vpp that referenced this pull request Oct 14, 2021
denis-tingaikin added a commit to networkservicemesh/cmd-nsmgr-proxy that referenced this pull request Oct 14, 2021
nsmbot pushed a commit to networkservicemesh/cmd-nse-icmp-responder-vpp that referenced this pull request Oct 14, 2021
…k-vpp@main

PR link: networkservicemesh/sdk-vpp#386

Commit: d52a3eb
Author: Ed Warnicke
Date: 2021-10-14 01:04:19 -0500
Message:
  - Adapt to networkservicemesh/sdk#1083 (#386)
Signed-off-by: NSMBot <[email protected]>
nsmbot pushed a commit to networkservicemesh/cmd-nsc-vpp that referenced this pull request Oct 14, 2021
…k-vpp@main

PR link: networkservicemesh/sdk-vpp#386

Commit: d52a3eb
Author: Ed Warnicke
Date: 2021-10-14 01:04:19 -0500
Message:
  - Adapt to networkservicemesh/sdk#1083 (#386)
Signed-off-by: NSMBot <[email protected]>
denis-tingaikin added a commit to networkservicemesh/cmd-nse-firewall-vpp that referenced this pull request Oct 14, 2021
pperiyasamy pushed a commit to networkservicemesh/sdk-ovs that referenced this pull request Oct 14, 2021
nsmbot pushed a commit to networkservicemesh/cmd-forwarder-ovs that referenced this pull request Oct 14, 2021
…k-ovs@main

PR link: networkservicemesh/sdk-ovs#43

Commit: a199406
Author: Vladimir Popov
Date: 2021-10-14 13:59:29 +0700
Message:
  - Manually update to networkservicemesh/sdk#1083 (#43)
Signed-off-by: NSMBot <[email protected]>
denis-tingaikin pushed a commit to networkservicemesh/cmd-nsc that referenced this pull request Oct 14, 2021
denis-tingaikin pushed a commit to networkservicemesh/cmd-nsc-vpp that referenced this pull request Oct 14, 2021
nsmbot pushed a commit to networkservicemesh/deployments-k8s that referenced this pull request Oct 14, 2021
…d-nsc@main

PR link: https://github.com/networkservicemesh/cmd-nsc/pull/

Commit: adc8748
Author: Vladimir Popov
Date: 2021-10-14 16:23:34 +0700
Message:
  - Manually update to networkservicemesh/sdk#1083 (#290)
Signed-off-by: NSMBot <[email protected]>
nsmbot pushed a commit to networkservicemesh/deployments-k8s that referenced this pull request Oct 14, 2021
…d-nsc-vpp@main

PR link: https://github.com/networkservicemesh/cmd-nsc-vpp/pull/

Commit: 4371e4d
Author: Vladimir Popov
Date: 2021-10-14 16:23:46 +0700
Message:
  - Manually update to networkservicemesh/sdk#1083 (#267)
Signed-off-by: NSMBot <[email protected]>
nsmbot pushed a commit to networkservicemesh/deployments-k8s that referenced this pull request Oct 14, 2021
…d-nsc-vpp@main

PR link: https://github.com/networkservicemesh/cmd-nsc-vpp/pull/

Commit: 4371e4d
Author: Vladimir Popov
Date: 2021-10-14 16:23:46 +0700
Message:
  - Manually update to networkservicemesh/sdk#1083 (#267)
Signed-off-by: NSMBot <[email protected]>
edwarnicke added a commit to edwarnicke/cmd-forwarder-vpp that referenced this pull request Oct 14, 2021
denis-tingaikin pushed a commit to networkservicemesh/cmd-forwarder-vpp that referenced this pull request Oct 14, 2021
nsmbot pushed a commit to networkservicemesh/deployments-k8s that referenced this pull request Oct 14, 2021
…d-forwarder-vpp@main

PR link: https://github.com/networkservicemesh/cmd-forwarder-vpp/pull/

Commit: 5494d34
Author: Ed Warnicke
Date: 2021-10-14 10:05:42 -0500
Message:
  - Adapt to Adapt to networkservicemesh/sdk#1083 (#342)
Signed-off-by: NSMBot <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants