Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows]Check HostNamespace value when removing HcnEndpoint #2306

Merged
merged 1 commit into from
Jun 29, 2021

Conversation

wenyingd
Copy link
Contributor

@wenyingd wenyingd commented Jun 24, 2021

For docker on Windows, no HostComputeNamespace is created before calling CNI to add network configuration. But containerd would do it. If a HnsEndpoint is created without HostComputeNamespace, a default value "00000000-0000-0000-0000-000000000000" should be set. HCN API should return an Error if a HnsEndpoint is removed from host Namespace "00000000-0000-0000-0000-000000000000". Hence, we skip invocation of "hcn.RemoveNamespaceEndpoint" for docker case.

Signed-off-by: wenyingd [email protected]

Fixes #2301

@wenyingd wenyingd requested review from lzhecheng and tnqn June 24, 2021 12:33
@wenyingd
Copy link
Contributor Author

/test-all

@codecov-commenter
Copy link

codecov-commenter commented Jun 24, 2021

Codecov Report

Merging #2306 (b96fdac) into main (ef14ad1) will increase coverage by 3.43%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2306      +/-   ##
==========================================
+ Coverage   62.05%   65.48%   +3.43%     
==========================================
  Files         277      277              
  Lines       21588    21588              
==========================================
+ Hits        13396    14137     +741     
+ Misses       6792     6008     -784     
- Partials     1400     1443      +43     
Flag Coverage Δ
e2e-tests 55.81% <ø> (?)
kind-e2e-tests 52.41% <ø> (+0.06%) ⬆️
unit-tests 41.87% <ø> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...gent/controller/networkpolicy/status_controller.go 75.34% <0.00%> (-2.74%) ⬇️
pkg/ovs/openflow/ofctrl_bridge.go 50.36% <0.00%> (+0.36%) ⬆️
pkg/agent/openflow/network_policy.go 76.41% <0.00%> (+0.59%) ⬆️
pkg/agent/cniserver/pod_configuration.go 54.68% <0.00%> (+0.78%) ⬆️
pkg/agent/flowexporter/exporter/exporter.go 79.58% <0.00%> (+0.91%) ⬆️
pkg/controller/egress/store/egressgroup.go 1.38% <0.00%> (+1.38%) ⬆️
pkg/ovs/openflow/ofctrl_flow.go 47.36% <0.00%> (+1.75%) ⬆️
pkg/flowaggregator/certificate.go 79.27% <0.00%> (+2.70%) ⬆️
pkg/agent/route/route_linux.go 47.87% <0.00%> (+3.19%) ⬆️
pkg/agent/openflow/cookie/allocator.go 67.74% <0.00%> (+3.22%) ⬆️
... and 39 more

@antoninbas
Copy link
Contributor

@wenyingd please backport to 0.13, 1.0 and 1.1 once this is merged (see https://github.com/antrea-io/antrea/blob/main/docs/contributors/cherry-picks.md for directions)

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wenyingd I feel the commit message is a bit hard to understand.

If a HnsEndpoint is created without HostComputeNamespace, a default
value "00000000-0000-0000-0000-000000000000" should be set.

In which situation the if condition could happen? does it depend on user input or it's always like this. And do you mean "If ..., a default value ... will be set."?

But we can't find and HnsEndpoint in this namespace.

s/and/the/ ?

@@ -363,6 +363,10 @@ func (ic *ifConfigurator) removeHNSEndpoint(endpoint *hcsshim.HNSEndpoint, conta
return nil
}

func isValidHostNamespace(hostNamespace string) bool {
return hostNamespace != "" && hostNamespace != "00000000-0000-0000-0000-000000000000"
Copy link
Member

@tnqn tnqn Jun 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In which case hostNamespace will be "00000000-0000-0000-0000-000000000000"? Does it apply to all Pods we create or it's because the pod is removed in some specific order?
Is it safe to not call RemoveNamespaceEndpoint at all in this case?

Copy link
Contributor Author

@wenyingd wenyingd Jun 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"00000000-0000-0000-0000-000000000000" is used if we use "docker" to create container on Windows. It is because docker doesn't create host namespace when creating the sandbox container. But for "containerd", creating host namespace is a step before calling CNI to configure container network, so there should be a valid value.
To support containerd, we have a check here to ensure we could remove the HcnEndpoint(HnsEndpoint) from the host namespace.
However, if we remove HcnEndpoint from host namespace "00000000-0000-0000-0000-000000000000", Windows HCN API might return an Error, and it will break the continual steps (deleting HndEndpoint), and kept the HnsEndpoint on the host.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation. could we also add the comment to the code for future reference? and maybe add the runtime information to #2301 so people know it's specific to docker.

@wenyingd wenyingd requested a review from tnqn June 28, 2021 03:52
@wenyingd
Copy link
Contributor Author

/test-windows-conformance
/test-ipv6-conformance
/test-ipv6-e2e
/test-ipv6-networkpolicy
/test-ipv6-only-conformance
/test-ipv6-only-e2e
/test-ipv6-only-networkpolicy

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

Do you think it's possible to have a windows version of TestDeletePod to ensure all resources associated with a Pod is deleted after the Pod is deleted to avoid regression in the future?

func TestDeletePod(t *testing.T) {

If a HnsEndpoint is created without HostComputeNamespace, a default
value "00000000-0000-0000-0000-000000000000" should be set. But we
can't find and HnsEndpoint in this namespace.

Signed-off-by: wenyingd <[email protected]>
@wenyingd
Copy link
Contributor Author

/test-all

@wenyingd
Copy link
Contributor Author

Do you think it's possible to have a windows version of TestDeletePod to ensure all resources associated with a Pod is deleted after the Pod is deleted to avoid regression in the future?

func TestDeletePod(t *testing.T) {

That would be great. But I would like to add it in next release. Besides, I am just confused do we have Windows e2e testbed now?

Copy link
Member

@tnqn tnqn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

Do you think it's possible to have a windows version of TestDeletePod to ensure all resources associated with a Pod is deleted after the Pod is deleted to avoid regression in the future?

func TestDeletePod(t *testing.T) {

That would be great. But I would like to add it in next release. Besides, I am just confused do we have Windows e2e testbed now?

There are already some Windows specific e2e tests. I believe test-windows-e2e runs e2e on windows testbed?

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

/test-windows-e2e

3 similar comments
@lzhecheng
Copy link
Contributor

/test-windows-e2e

@lzhecheng
Copy link
Contributor

/test-windows-e2e

@lzhecheng
Copy link
Contributor

/test-windows-e2e

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

/test-windows-networkpolicy
/test-windows-conformance

@wenyingd
Copy link
Contributor Author

/test-e2e

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

/test-windows-e2e

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

windows-e2e failed once but I don't think it's related to this PR, and the same test passed in normal e2e test: https://jenkins.antrea-ci.rocks/job/antrea-e2e-for-pull-request/2780/.

=== CONT  TestClusterIP
    service_test.go:59: 
        	Error Trace:	service_test.go:59
        	            				service_test.go:81
        	Error:      	Received unexpected error:
        	            	nc stdout: <>, stderr: <nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	nc: 10.99.89.26 (10.99.89.26:80): Connection timed out
        	            	>, err: <command terminated with exit code 1>
        	Test:       	TestClusterIP
        	Messages:   	Pod client-on-same-node should be able to connect 10.99.89.26:80, but was not able to connect
=== CONT  TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service
    testing.go:1103: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
=== CONT  TestClusterIP
    fixtures.go:220: Exporting test logs to '/var/lib/jenkins/workspace/antrea-windows-e2e-for-pull-request/antrea-test-logs/TestClusterIP/beforeTeardown.Jun29-09-18-08'
    fixtures.go:324: Error when exporting kubelet logs: error when running journalctl on Node 'a-ms-0005-0', is it available? Error: <nil>
    fixtures.go:345: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestClusterIP (74.15s)
    --- FAIL: TestClusterIP/ClusterIP (0.00s)
        --- PASS: TestClusterIP/ClusterIP/Same_Linux_Node_can_access_the_Service (0.75s)
        --- PASS: TestClusterIP/ClusterIP/Different_Linux_Node_can_access_the_Service (0.90s)
        --- PASS: TestClusterIP/ClusterIP/Windows_host_can_access_the_Service (5.51s)
        --- PASS: TestClusterIP/ClusterIP/Linux_Pod_on_different_Node_can_access_the_Service (18.12s)
        --- FAIL: TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service (36.10s)

Merging it to start backporting, will watch latest windows-e2e test for this PR to see if the failure is consistent.

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

Weird, windows-e2e failed another time for same reason:

=== CONT  TestClusterIP
    service_test.go:59: 
        	Error Trace:	service_test.go:59
        	            				service_test.go:81
        	Error:      	Received unexpected error:
        	            	nc stdout: <>, stderr: <nc: 10.106.134.164 (10.106.134.164:80): Connection timed out
        	            	nc: 10.106.134.164 (10.106.134.164:80): Connection timed out
        	            	nc: 10.106.134.164 (10.106.134.164:80): Connection timed out
        	            	nc: 10.106.134.164 (10.106.134.164:80): Connection timed out
        	            	nc: 10.106.134.164 (10.106.134.164:80): Connection timed out
        	            	>, err: <command terminated with exit code 1>
        	Test:       	TestClusterIP
        	Messages:   	Pod client-on-same-node should be able to connect 10.106.134.164:80, but was not able to connect
=== CONT  TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service
    testing.go:1048: test executed panic(nil) or runtime.Goexit: subtest may have called FailNow on a parent test
=== CONT  TestClusterIP
    fixtures.go:220: Exporting test logs to '/var/lib/jenkins/workspace/antrea-windows-e2e-for-pull-request/antrea-test-logs/TestClusterIP/beforeTeardown.Jun29-11-59-00'
    fixtures.go:324: Error when exporting kubelet logs: error when running journalctl on Node 'a-ms-0001-0', is it available? Error: <nil>
    fixtures.go:345: Deleting 'antrea-test' K8s Namespace
--- FAIL: TestClusterIP (73.73s)
    --- FAIL: TestClusterIP/ClusterIP (0.00s)
        --- PASS: TestClusterIP/ClusterIP/Same_Linux_Node_can_access_the_Service (0.74s)
        --- PASS: TestClusterIP/ClusterIP/Windows_host_can_access_the_Service (5.23s)
        --- PASS: TestClusterIP/ClusterIP/Different_Linux_Node_can_access_the_Service (0.60s)
        --- PASS: TestClusterIP/ClusterIP/Linux_Pod_on_different_Node_can_access_the_Service (20.13s)
        --- FAIL: TestClusterIP/ClusterIP/Linux_Pod_on_same_Node_can_access_the_Service (36.16s)

@tnqn
Copy link
Member

tnqn commented Jun 29, 2021

Figured out the reason why it failed in windows-e2e only, filed #2330 to track the issue. It's a regression when Egress and AntreaProxy are enabled together.

@wenyingd wenyingd deleted the issue_2301 branch October 8, 2021 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

HNS Endpoint is still left on the host after Windows Pod is deleted
5 participants