Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Windows] Reconcile host-network Pods after agent is restarted #6944

Merged
merged 1 commit into from
Feb 3, 2025

Conversation

wenyingd
Copy link
Contributor

This change is to support reconciling the host-network Pods on Windows because k8s expects to let CNI manage such Pods as long as they are not using host-process containers.

Antrea has received the CmdAdd request for such Pods when they are created, so they should be included in the Pod reconcile list after agent is restarted.

Fix: #6943

@wenyingd wenyingd requested a review from tnqn January 20, 2025 08:18
@luolanzone
Copy link
Contributor

@wenyingd there is a golang-lint issue, please fix it.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 547d964 to da952ef Compare January 20, 2025 08:36
@wenyingd
Copy link
Contributor Author

@wenyingd there is a golang-lint issue, please fix it.

updated.

@wenyingd wenyingd requested a review from luolanzone January 20, 2025 08:37
@luolanzone luolanzone added the area/OS/windows Issues or PRs related to the Windows operating system. label Jan 20, 2025
@wenyingd
Copy link
Contributor Author

/test-all
/test-windows-all

pkg/agent/cniserver/server.go Outdated Show resolved Hide resolved
@@ -98,3 +99,13 @@ func getInfraContainer(containerID, netNS string) string {
func (c *CNIConfig) getInfraContainer() string {
return getInfraContainer(c.ContainerId, c.Netns)
}

// getPodsListOptions returns the Pods running on the current Node. Note, the host-network Pods are not filtered
// out on Windows because they are also managed by antrea as long as "spec.SecurityContext.windowsOptions.hostProcess"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what will happen if "spec.SecurityContext.windowsOptions.hostProcess" is configured and the Pod is not filtered out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing was performed since we didn't create OVS interfaces for such Pods.

Having offline sync with @tnqn , we would not perform other filters on Windows in reconcile logic except for the node name to get local Pods and resourceVersion=0 to avoid performance issue, this is to try our best to ensure the reconciled Pod set can cover those Antrea have handled.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from da952ef to 065af4a Compare January 20, 2025 08:55
@wenyingd
Copy link
Contributor Author

/test-all
/test-windows-all

Copy link
Contributor

@luolanzone luolanzone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one nit.

@tnqn @antoninbas can you take a look? thanks.

pkg/agent/cniserver/server_linux.go Outdated Show resolved Hide resolved
@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 065af4a to 1193dc7 Compare January 21, 2025 02:25
@wenyingd
Copy link
Contributor Author

/test-all
/test-windows-all

@XinShuYang
Copy link
Contributor

/test-windows-e2e
/test-windows-networkpolicy

@wenyingd
Copy link
Contributor Author

/test-windows-networkpolicy
/test-windows-e2e

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add an e2e test for this as well?

@@ -741,11 +758,19 @@ func TestReconcile(t *testing.T) {
go cniServer.podConfigurator.Run(stopCh)

// Re-install Pod1 flows
podFlowsInstalled := make(chan string, 2)
expReInstalledPodCount := 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: expectedReinstalledPodCount

reinstall is one word

Copy link
Contributor Author

@wenyingd wenyingd Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.
I will add an e2e case to verify the host-network Pod traffic is working even if antrea-agent is restarted on Windows.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 1193dc7 to f6c3466 Compare January 22, 2025 03:26
@wenyingd
Copy link
Contributor Author

/test-windows-all

@@ -98,3 +99,15 @@ func getInfraContainer(containerID, netNS string) string {
func (c *CNIConfig) getInfraContainer() string {
return getInfraContainer(c.ContainerId, c.Netns)
}

Copy link
Contributor

@XinShuYang XinShuYang Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
| Pod Type | Spec.HostNetwork | windowsOptions.hostProcess | Kubelet CMDADD | OVS interface needed? | Reconcile needed? |
| 1. Normal Pod (non-HostNetwork) | false | false / not set | Yes | Yes | Yes |
| 2. Windows HostNetwork Pod | true | false / not set | Yes | Yes | Yes |
| 3. Windows HostProcess Pod | (any) | true | No | No | No (no real network config) |

@wenyingd Can we add a table to the code comments to explain the different behaviors of Windows Pods for developers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from f6c3466 to c3d9724 Compare January 22, 2025 05:34
@wenyingd
Copy link
Contributor Author

/test-windows-e2e

1 similar comment
@wenyingd
Copy link
Contributor Author

/test-windows-e2e

@wenyingd
Copy link
Contributor Author

/test-windows-all

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from c3d9724 to 127aaf4 Compare January 22, 2025 09:47
@wenyingd
Copy link
Contributor Author

/test-windows-e2e

assert.NoErrorf(t, err, "The Openflow entries should be consistent after Antrea agent restarts")

data.runPingMesh(t, podInfos, toolboxContainerName, true)
time.Sleep(10 * time.Minute)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you had this for debugging purposes, that's a very long sleep :)

Copy link
Contributor Author

@wenyingd wenyingd Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will remove it after finishing the debug on the new e2e case.

// This case only run on Windows. The Pods in the test includes both host-network Pod and none host-network Pod,
// because CNI on Windows is also responsible for the host-network Pod's networking as long as it is not using
// host-process containers.
func testWindowsPodConnectivityAfterAntreaRestart(t *testing.T, data *TestData) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess one issue with the test as currently written is that if there is a failure, it's not clear which one of the 2 Pods may be responsible (assuming only 1 Pod is broken after restart).
Maybe we could have 1 Linux Pod as the client, and then use it to ping each Windows Pod independently? This way we could split this test into 2 subtests, one for the regular Pod and one for the hostNetwork Pod. Would that make sense?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, updated.

test/e2e/connectivity_test.go Outdated Show resolved Hide resolved
@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 127aaf4 to 1cf8dcf Compare January 23, 2025 02:19
@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 1cf8dcf to 6234815 Compare January 23, 2025 02:30
@wenyingd
Copy link
Contributor Author

/test-windows-all

@wenyingd wenyingd force-pushed the windows_reconcile_hostnetwork branch from 6234815 to 471d2a7 Compare January 23, 2025 04:16
@wenyingd
Copy link
Contributor Author

/test-windows-all

@wenyingd
Copy link
Contributor Author

/test-all

@wenyingd
Copy link
Contributor Author

The new added e2e is succeeded, the failed case in windows-e2e is TestFQDNCacheMinTTL by issue #6891

See: http://10.164.243.223/view/Windows/job/antrea-windows-e2e-for-pull-request/89

XinShuYang
XinShuYang previously approved these changes Jan 24, 2025
Copy link
Contributor

@XinShuYang XinShuYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple nits, otherwise LGTM

if err := data.createToolboxPodOnNode(clientPod.Name, clientPod.Namespace, clientPod.NodeName, false); err != nil {
t.Fatalf("Error when creating Pod '%s': %v", clientPod.Name, err)
}
defer deletePodWrapper(t, data, clientPod.Namespace, clientPod.Name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to omit this, especially for a simple Linux Pod like this one, the Pod will be automatically deleted when the test Namespace is deleted at the end of the test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would keep this code since personally I prefer to clean up the new resources created dedicated for the case after the test.

data.runPingMesh(t, testPodInfos, toolboxContainerName, true)

// Count the OVS flows.
oriOVSFlows := data.dumpOVSFlows(t, winWorkerNode)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

spell out original
originalOVSFlows or initialOVSFlows

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, updated accordingly.

@@ -121,6 +125,100 @@ func (data *TestData) runPingMesh(t *testing.T, podInfos []PodInfo, ctrname stri
}
}

// verifyWindowsPodConnectivity checks the Pods' connectivity after antrea-agent is restarted on Windows.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

checks Pod connectivity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

@@ -121,6 +125,100 @@ func (data *TestData) runPingMesh(t *testing.T, podInfos []PodInfo, ctrname stri
}
}

// verifyWindowsPodConnectivity checks the Pods' connectivity after antrea-agent is restarted on Windows.
// The Pods in the test includes both host-network Pod and the generic Pod, because CNI on Windows is also
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We test both the generic Pod case and the host-network Pod case, because ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated.

This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea has received the CmdAdd request for such Pods when they are created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <[email protected]>
@wenyingd
Copy link
Contributor Author

wenyingd commented Feb 3, 2025

/test-all
/test-windows-all

Copy link
Contributor

@antoninbas antoninbas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@antoninbas antoninbas merged commit 71c4290 into antrea-io:main Feb 3, 2025
61 of 62 checks passed
@antoninbas antoninbas added kind/bug Categorizes issue or PR as related to a bug. action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. labels Feb 3, 2025
@antoninbas
Copy link
Contributor

@wenyingd will you backport this?

@wenyingd
Copy link
Contributor Author

wenyingd commented Feb 5, 2025

@wenyingd will you backport this?

Sure, I will back port it to release-2.1 and release-2.2.

wenyingd added a commit to wenyingd/antrea that referenced this pull request Feb 5, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <[email protected]>
wenyingd added a commit to wenyingd/antrea that referenced this pull request Feb 5, 2025
This change is to support reconciling the host-network Pods on Windows because
k8s expects to let CNI manage such Pods as long as they are not using
host-process containers.

Antrea received the CmdAdd request for such Pods when they were created, so
they should be included in the Pod reconcile list after agent is restarted.

Signed-off-by: Wenying Dong <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
action/backport Indicates a PR that requires backports. action/release-note Indicates a PR that should be included in release notes. area/OS/windows Issues or PRs related to the Windows operating system. kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Windows doesn't reconcile "hostNetwork" Pod after agent is restarted
4 participants