Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xds: add Outlier Detection Balancer #5435

Merged
merged 22 commits into from
Sep 6, 2022
Merged

Conversation

zasweq
Copy link
Contributor

@zasweq zasweq commented Jun 17, 2022

This PR adds the Outlier Detection Balancer as part of the xDS configured tree of balancers (see https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md).

Contains #5371.

RELEASE NOTES:

  • xDS: add support for Outlier Detection

@zasweq zasweq requested review from easwars and dfawley June 17, 2022 01:52
@zasweq zasweq assigned easwars and dfawley and unassigned easwars Jun 17, 2022
@zasweq zasweq added this to the 1.48 Release milestone Jun 17, 2022
@zasweq zasweq added the Type: Feature New features or improvements in behavior label Jun 17, 2022
@dfawley
Copy link
Member

dfawley commented Jun 17, 2022

It seems this still needs to be rebased due to outlierdetection files.

@zasweq zasweq force-pushed the od-bal-rebased-again branch 2 times, most recently from aeb3c19 to d8e42c2 Compare July 1, 2022 22:32
@dfawley dfawley modified the milestones: 1.48 Release, 1.49 Release Jul 1, 2022
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just skimmed through. Expect more comments as I spend more time on this.

Comment on lines +29 to +42
v3clusterpb "github.com/envoyproxy/go-control-plane/envoy/config/cluster/v3"
v3endpointpb "github.com/envoyproxy/go-control-plane/envoy/config/endpoint/v3"
v3listenerpb "github.com/envoyproxy/go-control-plane/envoy/config/listener/v3"
v3routepb "github.com/envoyproxy/go-control-plane/envoy/config/route/v3"
"google.golang.org/grpc"
"google.golang.org/grpc/credentials/insecure"
"google.golang.org/grpc/internal"
"google.golang.org/grpc/internal/envconfig"
"google.golang.org/grpc/internal/stubserver"
"google.golang.org/grpc/internal/testutils/xds/e2e"
testgrpc "google.golang.org/grpc/test/grpc_testing"
testpb "google.golang.org/grpc/test/grpc_testing"
"google.golang.org/protobuf/types/known/durationpb"
"google.golang.org/protobuf/types/known/wrapperspb"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: please group renamed proto imports separately.

// TestOutlierDetection tests an xDS configured ClientConn with an Outlier
// Detection present in the system which is a logical no-op. An RPC should
// proceed as normal.
func (s) TestOutlierDetection(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe TestOutlierDetection_EmptyConfig or TestOutlierDetection_NoopConfig to be explicit about what is being tested here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chose no-op config, empty management server update but od config is better termed "no-op". Thanks for suggestion.

"google.golang.org/protobuf/types/known/wrapperspb"
)

// TestOutlierDetection tests an xDS configured ClientConn with an Outlier
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about?

// TestOutlierDetectionXxx tests the scenario where the outlier detection feature
// is enabled on the gRPC client, but it receives no outlier detection configuration
// from the management server. This should result in a no-op outlier configuration
// being used. This test verifies that an RPC is able to proceed normally with this
// configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (with a little rewording).

}
}

// defaultClientResourcesSpecifyingMultipleBackendsAndOutlierDetection returns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And the award for the author with the longest identifier name in our codebase goes to ...... lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol. This Java book I read called clean code told me to be verbose and explanatory in my variable names. Switched to "defaultClientResourcesMultipleBackendsAndOD". Is that better or too verbose/if so do you have a better suggestion?

Comment on lines 111 to 113
Interval: &durationpb.Duration{
Nanos: 500000000,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: single line
And probably a comment with a human readable value for this time duration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added comment // .5 seconds.

}
}

type object struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

object is not a very descriptive name for a type which is quite central to the implementation. Maybe something like subConnState or subConnInfo or subConnRunTimeInfo or subConnRunTimeState or something better. And a comment for the type describing its purpose would be useful too.

Also, this type does not contain any mutexes. It would be helpful for the above comment to talk about how synchronization is handled for this type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct represents information about a certain address, not a SubConn. Each SubConn does hold a ref to this though. As such, I chose the name "addressInfo".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment for addressInfo explaining what it is and also explaining how it is synchronized. Let me know if it's too verbose or if you like it.

// The call result counter object
callCounter *callCounter

// The latest ejection timestamp, or null if the address is currently not
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with
// The latest ejection timestamp, or zero if the address is currently not ejected.
And get rid of the // We represent the branching logic on the null with a time.Zero() value.

Please terminate comment sentences with periods. go/go-style/decisions#comment-sentences

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good merging. Switched. Added periods to every comment about each field in this struct.

type subConnWrapper struct {
balancer.SubConn

// "The subchannel wrappers created by the outlier_detection LB policy will
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider not quoting the gRFC verbatim. The gRFC text makes sense in its context. But here (and everywhere else in this PR), please make comments self-explanatory and self-contained. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched as many as I could over. I left the one for swap() though, since that defines an implementation detail that I didn't follow with a subsequent explanation. I also left the ones in NewSubConn() as I thought they made sense, and the actual Outlier Detection Interval algorithm itself as it's explaining the algorithm step by step.

Comment on lines 45 to 66
// eject(): "The wrapper will report a state update with the TRANSIENT_FAILURE
// state, and will stop passing along updates from the underlying subchannel."
func (scw *subConnWrapper) eject() {
scw.scUpdateCh.Put(&ejectedUpdate{
scw: scw,
ejected: true,
})
}

// uneject(): "The wrapper will report a state update with the latest update
// from the underlying subchannel, and resume passing along updates from the
// underlying subchannel."
func (scw *subConnWrapper) uneject() {
scw.scUpdateCh.Put(&ejectedUpdate{
scw: scw,
ejected: false,
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these have to be methods on the subConnWrapper type?

Can callsites of scw.eject() be replaced by:

  b.scUpdateCh.Put(&ejectedUpdate{scw: scw, ejected: true})

and similarly for uneject().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't have to be, technically, but A50 clearly states to have methods on the subChannelWrapper type: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md#subchannel-wrapper. To keep implementations similar, I think it's best to do it as such. Plus, there's 8 callsites for both of these methods and I think it's cleaner for scw.eject()/scw.uneject() instead of putting stuff on update channel for each call site. However, main reason is that is explicitly defined in gRFC.

@easwars easwars assigned zasweq and unassigned easwars Jul 6, 2022
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments :D!

// TestOutlierDetection tests an xDS configured ClientConn with an Outlier
// Detection present in the system which is a logical no-op. An RPC should
// proceed as normal.
func (s) TestOutlierDetection(t *testing.T) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chose no-op config, empty management server update but od config is better termed "no-op". Thanks for suggestion.

"google.golang.org/protobuf/types/known/wrapperspb"
)

// TestOutlierDetection tests an xDS configured ClientConn with an Outlier
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (with a little rewording).

}
}

// defaultClientResourcesSpecifyingMultipleBackendsAndOutlierDetection returns
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol. This Java book I read called clean code told me to be verbose and explanatory in my variable names. Switched to "defaultClientResourcesMultipleBackendsAndOD". Is that better or too verbose/if so do you have a better suggestion?

Comment on lines 111 to 113
Interval: &durationpb.Duration{
Nanos: 500000000,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Added comment // .5 seconds.

Comment on lines 45 to 66
// eject(): "The wrapper will report a state update with the TRANSIENT_FAILURE
// state, and will stop passing along updates from the underlying subchannel."
func (scw *subConnWrapper) eject() {
scw.scUpdateCh.Put(&ejectedUpdate{
scw: scw,
ejected: true,
})
}

// uneject(): "The wrapper will report a state update with the latest update
// from the underlying subchannel, and resume passing along updates from the
// underlying subchannel."
func (scw *subConnWrapper) uneject() {
scw.scUpdateCh.Put(&ejectedUpdate{
scw: scw,
ejected: false,
})
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They don't have to be, technically, but A50 clearly states to have methods on the subChannelWrapper type: https://github.com/grpc/proposal/blob/master/A50-xds-outlier-detection.md#subchannel-wrapper. To keep implementations similar, I think it's best to do it as such. Plus, there's 8 callsites for both of these methods and I think it's cleaner for scw.eject()/scw.uneject() instead of putting stuff on update channel for each call site. However, main reason is that is explicitly defined in gRFC.

// The call result counter object
callCounter *callCounter

// The latest ejection timestamp, or null if the address is currently not
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good merging. Switched. Added periods to every comment about each field in this struct.

}
}

type object struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This struct represents information about a certain address, not a SubConn. Each SubConn does hold a ref to this though. As such, I chose the name "addressInfo".

}
}

type object struct {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added comment for addressInfo explaining what it is and also explaining how it is synchronized. Let me know if it's too verbose or if you like it.

Comment on lines 176 to 177
odAddrs map[string]*object
odCfg *LBConfig
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted od prefix. Kept outlierDetectionBalancer though, as clusterImplBalancer had prefix on Balancer type. GracefulSwitch doesn't though so let me know if you want me to switch outlierDetectionBalancer too.

type subConnWrapper struct {
balancer.SubConn

// "The subchannel wrappers created by the outlier_detection LB policy will
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched as many as I could over. I left the one for swap() though, since that defines an implementation detail that I didn't follow with a subsequent explanation. I also left the ones in NewSubConn() as I thought they made sense, and the actual Outlier Detection Interval algorithm itself as it's explaining the algorithm step by step.

@zasweq zasweq assigned easwars and dfawley and unassigned dfawley and zasweq Jul 6, 2022

}

func (b *outlierDetectionBalancer) successRateAlgorithm() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: ejectLowSuccessRateBackends? or something like that to indicate more precisely what the function is doing. Same with the other Algorithm functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummmmmmm this is the configuration defines the algorithm as and A50, I agree that would be nicer, I'll write a comment for each xAlgorithm function.

return numAddrs
}

// meanAndStdDevOfSucceseesAtLeastRequestVolume returns the mean and std dev of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Succesees?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops. Switched.

// Reject whole config if any errors, don't persist it for later
bb := balancer.Get(lbCfg.ChildPolicy.Name)
if bb == nil {
return fmt.Errorf("balancer %q not registered", lbCfg.ChildPolicy.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Include the name of this LB policy, please.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to "outlier detection: child balancer...".

if b.child != nil {
b.child.Close()
}
// What if this is nil? Seems fine
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Note that 0 addresses is a valid update/state for a SubConn to be in.
// This is correctly handled by this algorithm (handled as part of a non singular
// old address/new address).
if len(scw.addresses) == 1 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would probably be a little clearer as:

if len(scw.addresses == 1) && len(addrs) == 1 {
	// single address to single address
} else if len(scw.addresses == 1) {
	// single address to multiple/no addresses
} else if len(addrs) == 1 {
	// multiple address to single address
} // else multiple/no addresses to multiple/no addresses; ignore

or:

switch {
case len(scw.addresses == 1) && len(addrs) == 1:
case len(scw.addresses == 1):
case len(addrs) == 1:
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, great clean suggestion. I chose option two, Easwar has made me use switches for scenarios like this and he knows go style so I'll go with option 2 thanks.

obj := b.appendIfPresent(addrs[0].Addr, scw)
// 3. Relay state with eject() recalculated (using the corresponding
// map entry to see if it's currently ejected).
if obj == nil { // uneject unconditionally because could have come from an ejected address
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This says "uneject unconditionally" but goes on to eject?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops, good catch. Switched to uneject().

// ejection_timestamp + min(base_ejection_time (type: time.Time) *
// multiplier (type: int), max(base_ejection_time (type: time.Time),
// max_ejection_time (type: time.Time))), un-eject the address.
if !obj.latestEjectionTimestamp.IsZero() && now().After(obj.latestEjectionTimestamp.Add(time.Duration(min(b.cfg.BaseEjectionTime.Nanoseconds()*obj.ejectionTimeMultiplier, max(b.cfg.BaseEjectionTime.Nanoseconds(), b.cfg.MaxEjectionTime.Nanoseconds()))))) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with the use of max and min, this is still too long to follow. Please simplify into multiple statements or temporary variables if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried my best. Refactored. Let me know if it looks off still.

Comment on lines 611 to 612
// This conditional only for testing (since the interval timer algorithm is
// called manually), will never hit in production.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe the test can set this field manually first, then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixed a race condition where the intervalTimer algorithm would go off at an arbitrary point in the future if not manually calling Stop() here. (Since we call the intervalTimerAlgorithm manually in test (much cleaner than setting field in test, since it'll get written to in UpdateClientConnState(), there would still be a configured timer from the UpdateClientConnState() that would leak if not closed here).

Comment on lines 828 to 829
obj, ok := b.addrs[addr]
if !ok { // Shouldn't happen
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we pass b.addrs[addr] here (and unejectAddress) instead to guarantee this won't happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I just realized at the call site you have access to the addrInfo (for addr, addrInfo := range b.addrs. Passed that instead.

@zasweq zasweq removed their assignment Aug 27, 2022
od, tcc, _ := setup(t)
defer internal.UnregisterOutlierDetectionBalancerForTesting()

// This first config update should a child to be built and forwarded it's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing verb between should and a child?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh lol, sorry. Added "should".

Comment on lines 369 to 385
BalancerConfig: &LBConfig{
Interval: 10 * time.Second,
BaseEjectionTime: 30 * time.Second,
MaxEjectionTime: 300 * time.Second,
MaxEjectionPercent: 10,
SuccessRateEjection: &SuccessRateEjection{
StdevFactor: 1900,
EnforcementPercentage: 100,
MinimumHosts: 5,
RequestVolume: 100,
},
FailurePercentageEjection: &FailurePercentageEjection{
Threshold: 85,
EnforcementPercentage: 5,
MinimumHosts: 5,
RequestVolume: 50,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably reduce some lines in the test by making a variable for the configuration without the child policy config.

var baseCfgWithoutChildConfig = BalancerConfig: &LBConfig{
	Interval:           10 * time.Second,
	BaseEjectionTime:   30 * time.Second,
	MaxEjectionTime:    300 * time.Second,
	MaxEjectionPercent: 10,
	SuccessRateEjection: &SuccessRateEjection{
		StdevFactor:           1900,
		EnforcementPercentage: 100,
		MinimumHosts:          5,
		RequestVolume:         100,
	},
	FailurePercentageEjection: &FailurePercentageEjection{
		Threshold:             85,
		EnforcementPercentage: 5,
		MinimumHosts:          5,
		RequestVolume:         50,
	},
}

And change just the child policy config here and down below when calling UpdateClientConnState.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummmm sorry I don't like this suggestion, it feels hacky and takes away readability in my opinion, since it's appending to something declared earlier in the funciton (think your comment on the other file about moving var declarations nearer to where they're used). I'll switch it to noop config though, since it doesn't need all the other fields, similar to your comment down below.

}
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{})
if err != nil {
t.Fatalf("error in od.NewSubConn call: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI: Calling t.Fatal from goroutines other than the main test goroutine only causes that particular goroutine to exit. The main test goroutine will keep running. So, it is better to call t.Error in these cases, to make it clear to the reader of the test.

Also, the err != nil check is repeated here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. I actually did not know that. Switched to error and deleted duplicate err != nil check.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually breaks the package documentation: "FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. This isn't created in a new goroutine, this is called in the main test goroutine. The calls down to the child all happen inline from the test goroutine, which then calls this. Switched regardless.

Comment on lines 692 to 711
Interval: 8 * time.Second,
BaseEjectionTime: 30 * time.Second,
MaxEjectionTime: 300 * time.Second,
MaxEjectionPercent: 10,
SuccessRateEjection: &SuccessRateEjection{
StdevFactor: 1900,
EnforcementPercentage: 100,
MinimumHosts: 5,
RequestVolume: 100,
},
FailurePercentageEjection: &FailurePercentageEjection{
Threshold: 85,
EnforcementPercentage: 5,
MinimumHosts: 5,
RequestVolume: 50,
},
ChildPolicy: &internalserviceconfig.BalancerConfig{
Name: t.Name(),
Config: emptyChildConfig{},
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the complete config required for this test case? Can some fields be omitted?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Deleted everything but interval and SuccessRateEjection, since that causes it to be a non noop config. For this and the next UpdateClientConnState() below.

},
},
BalancerConfig: &LBConfig{
Interval: 1<<63 - 1, // so the interval will never run unless called manually in test.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use maxInt here as well. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and everywhere else.

Comment on lines 215 to 218
okAddresses := []resolver.Address{
{Addr: addresses[0]},
{Addr: addresses[1]},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this definition closer to where it is used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

// The full list of addresses.
fullAddresses := []resolver.Address{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

defer mr.Close()

sc := internal.ParseServiceConfig.(func(string) *serviceconfig.ParseResult)(test.odscJSON)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nix newline here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 290 to 301
// The addresses which don't return errors.
okAddresses := []resolver.Address{
{Addr: addresses[0]},
{Addr: addresses[1]},
}

// The full list of addresses.
fullAddresses := []resolver.Address{
{Addr: addresses[0]},
{Addr: addresses[1]},
{Addr: addresses[2]},
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 312 to 316
"childPolicy": [
{
"round_robin": {}
}
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And this one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and everywhere else.

@easwars easwars assigned zasweq and unassigned easwars and dfawley Aug 29, 2022
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments :D

//
// Returns a non-nil error if context deadline expires before RPCs start to get
// roundrobined across the given backends.
func checkRoundRobinRPCs(ctx context.Context, client testpb.TestServiceClient, addrs []resolver.Address) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, unfortunately this can't. I had to switch the logic for this one, because I start bad backends which return errors that I know will get called. Yours expect the EmptyCall RPC to return a non nil error.

od, tcc, _ := setup(t)
defer internal.UnregisterOutlierDetectionBalancerForTesting()

// This first config update should a child to be built and forwarded it's
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh lol, sorry. Added "should".

},
},
BalancerConfig: &LBConfig{
Interval: 1<<63 - 1, // so the interval will never run unless called manually in test.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and everywhere else.

},
},
BalancerConfig: &LBConfig{
Interval: 1<<63 - 1, // so the interval will never run unless called manually in test.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done here and everywhere else.

if err := backend.StartServer(); err != nil {
t.Fatalf("Failed to start backend: %v", err)
}
t.Logf("Started good TestService backend at: %q", backend.Address)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol, good point. Switched.

Comment on lines 369 to 385
BalancerConfig: &LBConfig{
Interval: 10 * time.Second,
BaseEjectionTime: 30 * time.Second,
MaxEjectionTime: 300 * time.Second,
MaxEjectionPercent: 10,
SuccessRateEjection: &SuccessRateEjection{
StdevFactor: 1900,
EnforcementPercentage: 100,
MinimumHosts: 5,
RequestVolume: 100,
},
FailurePercentageEjection: &FailurePercentageEjection{
Threshold: 85,
EnforcementPercentage: 5,
MinimumHosts: 5,
RequestVolume: 50,
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummmm sorry I don't like this suggestion, it feels hacky and takes away readability in my opinion, since it's appending to something declared earlier in the funciton (think your comment on the other file about moving var declarations nearer to where they're used). I'll switch it to noop config though, since it doesn't need all the other fields, similar to your comment down below.

Comment on lines 692 to 711
Interval: 8 * time.Second,
BaseEjectionTime: 30 * time.Second,
MaxEjectionTime: 300 * time.Second,
MaxEjectionPercent: 10,
SuccessRateEjection: &SuccessRateEjection{
StdevFactor: 1900,
EnforcementPercentage: 100,
MinimumHosts: 5,
RequestVolume: 100,
},
FailurePercentageEjection: &FailurePercentageEjection{
Threshold: 85,
EnforcementPercentage: 5,
MinimumHosts: 5,
RequestVolume: 50,
},
ChildPolicy: &internalserviceconfig.BalancerConfig{
Name: t.Name(),
Config: emptyChildConfig{},
},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Deleted everything but interval and SuccessRateEjection, since that causes it to be a non noop config. For this and the next UpdateClientConnState() below.

}
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{})
if err != nil {
t.Fatalf("error in od.NewSubConn call: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting. I actually did not know that. Switched to error and deleted duplicate err != nil check.

}
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{})
if err != nil {
t.Fatalf("error in od.NewSubConn call: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This actually breaks the package documentation: "FailNow must be called from the goroutine running the test or benchmark function, not from other goroutines created during the test."

}
scw2, err = bd.ClientConn.NewSubConn([]resolver.Address{{Addr: "address2"}}, balancer.NewSubConnOptions{})
if err != nil {
t.Fatalf("error in od.NewSubConn call: %v", err)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait. This isn't created in a new goroutine, this is called in the main test goroutine. The calls down to the child all happen inline from the test goroutine, which then calls this. Switched regardless.

@zasweq zasweq assigned easwars and dfawley and unassigned zasweq Aug 29, 2022
@easwars easwars removed their assignment Aug 29, 2022
pr.Done(di)
}
}
// Shouldn't happen, defensive programming.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is illegal (though it should be handled gracefully here, of course). SubConn is required to be valid or the channel will re-pick after a new picker is provided. Can you make the tests not set it to nil?

pickerUpdateCh: buffer.NewUnbounded(),
}
b.logger = prefixLogger(b)
b.logger.Infof("Created")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this show the config too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm I wouldn't mind that, but I just triaged the rest of the xDS balancer tree (which I based this on in the first place), and every balancer logs this word exactly. Thus, for the sake of consistency, I would prefer to keep it as such. Of course, if you feel strongly about this I'll go ahead and switch it over.

return nil
b := &outlierDetectionBalancer{
cc: cc,
bOpts: bOpts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This field seems to be unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Graceful Switch gets constructed with this so we are good here. Removed this field.

err := b.child.SwitchTo(bb)
if err != nil {
b.childMu.Unlock()
return err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should wrap this error so there is more context whenever it's printed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to (pun unintended): return fmt.Errorf("outlier detection: error switching to child of type %q: %v", lbCfg.ChildPolicy.Name, err)

xds/internal/balancer/outlierdetection/balancer.go Outdated Show resolved Hide resolved
xds/internal/balancer/outlierdetection/callcounter.go Outdated Show resolved Hide resolved
Copy link
Contributor Author

@zasweq zasweq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pass bossman!

xds/internal/balancer/outlierdetection/balancer.go Outdated Show resolved Hide resolved
xds/internal/balancer/outlierdetection/callcounter.go Outdated Show resolved Hide resolved
return nil
b := &outlierDetectionBalancer{
cc: cc,
bOpts: bOpts,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Graceful Switch gets constructed with this so we are good here. Removed this field.

pickerUpdateCh: buffer.NewUnbounded(),
}
b.logger = prefixLogger(b)
b.logger.Infof("Created")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Umm I wouldn't mind that, but I just triaged the rest of the xDS balancer tree (which I based this on in the first place), and every balancer logs this word exactly. Thus, for the sake of consistency, I would prefer to keep it as such. Of course, if you feel strongly about this I'll go ahead and switch it over.

err := b.child.SwitchTo(bb)
if err != nil {
b.childMu.Unlock()
return err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switched to (pun unintended): return fmt.Errorf("outlier detection: error switching to child of type %q: %v", lbCfg.ChildPolicy.Name, err)

xds/internal/balancer/outlierdetection/balancer.go Outdated Show resolved Hide resolved
xds/internal/balancer/outlierdetection/balancer.go Outdated Show resolved Hide resolved
@dfawley dfawley assigned zasweq and unassigned dfawley Sep 1, 2022
Comment on lines 173 to 180
// childMu protects the closing of the child and also guarantees updates to
// the child are sent synchronously (to uphold the balancer.Balancer API
// guarantee of synchronous calls).
//
// For example, run() could read that the child is not nil while processing
// SubConn updates, and then Close() could write to the the child, clearing
// the child, making it nil, then you try and update a cleared and already
// closed child, which breaks the balancer.Balancer API.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about:

// childMu guards calls into child and also synchronizes reads of child in run() and the write in Close().

If you make sure b.run() exits before calling b.child.Close(), though, you don't need that last part of the sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, waited for b.run() to exit. Changed docstring to just the first part of that sentence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also removed the clearing of the child as an invariant of close, and the reading of the child (determining whether it's been cleared or not).

@zasweq zasweq assigned dfawley and unassigned zasweq Sep 1, 2022
Copy link
Member

@dfawley dfawley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

@dfawley dfawley assigned zasweq and unassigned dfawley Sep 6, 2022
@zasweq zasweq merged commit f7d2036 into grpc:master Sep 6, 2022
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Type: Feature New features or improvements in behavior
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants