Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OFSwitch connection check to Agent's liveness probes #4126

Merged
merged 1 commit into from
Oct 18, 2022

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Aug 17, 2022

This helps automatic recovery if some issues cause OFSwitch reconnection to
not work properly.

It also fixes a race condition between the IsConnected and SwitchConnected
methods of OFBridge and makes necessary changes to the constructor of
APIServer to allow testing.

For #4092

Signed-off-by: Quan Tian [email protected]

wenyingd
wenyingd previously approved these changes Aug 17, 2022
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@codecov
Copy link

codecov bot commented Aug 17, 2022

Codecov Report

Merging #4126 (834dc34) into main (52db699) will decrease coverage by 3.96%.
The diff coverage is 76.78%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #4126      +/-   ##
==========================================
- Coverage   65.96%   61.99%   -3.97%     
==========================================
  Files         304      309       +5     
  Lines       46625    47736    +1111     
==========================================
- Hits        30754    29596    -1158     
- Misses      13461    15802    +2341     
+ Partials     2410     2338      -72     
Flag Coverage Δ
e2e-tests 39.35% <86.20%> (?)
integration-tests 35.02% <81.25%> (+0.06%) ⬆️
kind-e2e-tests 42.82% <89.65%> (-6.46%) ⬇️
unit-tests 44.10% <73.21%> (-0.85%) ⬇️
Impacted Files Coverage Δ
cmd/antrea-agent/agent.go 0.00% <0.00%> (ø)
cmd/antrea-agent/options.go 3.49% <82.35%> (ø)
pkg/agent/apiserver/apiserver.go 71.27% <100.00%> (+4.25%) ⬆️
pkg/ovs/openflow/ofctrl_bridge.go 76.09% <100.00%> (-0.42%) ⬇️
pkg/agent/proxy/endpointslicecache.go 0.00% <0.00%> (-83.60%) ⬇️
pkg/agent/secondarynetwork/cnipodcache/cache.go 0.00% <0.00%> (-77.56%) ⬇️
pkg/agent/controller/networkpolicy/packetin.go 21.62% <0.00%> (-54.73%) ⬇️
pkg/controller/egress/store/egressgroup.go 1.72% <0.00%> (-54.32%) ⬇️
pkg/agent/secondarynetwork/podwatch/controller.go 0.00% <0.00%> (-39.02%) ⬇️
...g/agent/controller/serviceexternalip/controller.go 35.83% <0.00%> (-33.79%) ⬇️
... and 67 more

@tnqn tnqn requested a review from jianjuns August 17, 2022 16:51
@tnqn
Copy link
Member Author

tnqn commented Aug 17, 2022

/test-all

jianjuns
jianjuns previously approved these changes Aug 17, 2022
path: /livez
port: api
scheme: HTTPS
initialDelaySeconds: 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question - why we increase it to 10?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It typically takes more than 2 seconds to connect to OVS from agent logs in my testbed. I think 5 seconds may cause a liveness failure at startup in production clusters when nodes are under resource pressure. The liveness is a backup solution just in case reconnection doesn't work as expected so it doesn't need to be too aggresive.

Besides, all Kubernetes components use 10 seconds initialDelaySeconds.

@tnqn tnqn added action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Aug 17, 2022
@tnqn tnqn dismissed stale reviews from jianjuns and wenyingd via 9d1be1e August 24, 2022 12:09
@tnqn tnqn force-pushed the add-liveness-probe branch 5 times, most recently from 1f23fcf to 2071cbc Compare August 24, 2022 16:56
@tnqn
Copy link
Member Author

tnqn commented Aug 24, 2022

Unit test failures of Test_ofPacketOutBuilder_Done were because the test relies on specific pseudo-random numbers. Created #4148 to fix it.

@tnqn tnqn force-pushed the add-liveness-probe branch 3 times, most recently from 78839f5 to f184dc0 Compare August 25, 2022 04:03
This helps automatic recovery if some issues cause OFSwitch reconnection to
not work properly.

It also fixes a race condition between the IsConnected and SwitchConnected
methods of OFBridge and makes necessary changes to the constructor of
APIServer to allow testing.

For antrea-io#4092

Signed-off-by: Quan Tian <[email protected]>
@tnqn
Copy link
Member Author

tnqn commented Aug 25, 2022

/test-all

@tnqn tnqn requested review from wenyingd and jianjuns August 25, 2022 10:33
secureServing.BindAddress = bindAddress
secureServing.BindPort = o.config.APIPort
secureServing.CipherSuites = o.tlsCipherSuites
secureServing.MinTLSVersion = o.config.TLSMinVersion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In original code, secureServing.MinTLSVersion should be ipher.TLSVersionMap[o.config.TLSMinVersion], it is different from the current version?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are different. The original code sets the GenericAPIServer's attributes s.SecureServingInfo.CipherSuites and s.SecureServingInfo.MinTLSVersion directly after the server is created, which is bad from the perspective of encapsulation. The right way to configure the GenericAPIServer should be setting the options used to create the GenericAPIServer, which is SecureServingOptions.CipherSuites and SecureServingOptions.MinTLSVersion in this case. The options will be translated to s.SecureServingInfo.CipherSuites and s.SecureServingInfo.MinTLSVersion when creating GenericAPIServer.
https://github.com/kubernetes/kubernetes/blob/8206c9d458e321d7ad22ea9fc2e21a890790fc09/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L274-L286

Basically the previous code passed empty options when creating APIServer, translated the options itself with duplicate code in Antrea, and mutated APIServer directly.

if err != nil {
return fmt.Errorf("invalid TLSMinVersion: %v", err)
}
trimmedTLSCipherSuites := strings.ReplaceAll(o.config.TLSCipherSuites, " ", "")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 186-192 is similar to cipher.GenerateCipherSuitesList(), the difference is new code has a check with trimmedTLSCipherSuites != "", why not use this version to re-write GenerateCipherSuitesList?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As explained above, the code in pkg/util/cipher/ is duplicated with the code in https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/component-base/cli/flag/ciphersuites_flag.go and https://github.com/kubernetes/kubernetes/blob/8206c9d458e321d7ad22ea9fc2e21a890790fc09/staging/src/k8s.io/apiserver/pkg/server/options/serving.go#L274-L286. The right way to set TLS configurations is setting the SecureServingOptions with raw strings instead of calculating the results and mutating APIServer directly. The difference is not only the check, but also the data type of the slice. We are passing []string to SecureServingOptions, not []int16 to GenericAPIServer.

pkg/util/cipher/ will be removed after all APIServers are changed to use upstream code, I don't want to touch antrea-controller and flow-aggregator code in this PR so keep them for now.

@tnqn tnqn added this to the Antrea v1.9 release milestone Oct 11, 2022
Copy link
Contributor

@wenyingd wenyingd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

heanlan pushed a commit to heanlan/antrea that referenced this pull request Mar 29, 2023
)

This helps automatic recovery if some issues cause OFSwitch reconnection to
not work properly.

It also fixes a race condition between the IsConnected and SwitchConnected
methods of OFBridge and makes necessary changes to the constructor of
APIServer to allow testing.

For antrea-io#4092

Signed-off-by: Quan Tian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants