update deploy file: use install_multus bin to update cni file #1213

41tair · 2024-01-19T07:46:30Z

install cni initcontiainer can't execute sometimes

a sample error logs looks like:

cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy

coveralls · 2024-02-01T14:42:23Z

coverage: 63.051% (-0.4%) from 63.432%
when pulling efa51cf on 41tair:fix/use-multus-install
into 3477c9c on k8snetworkplumbingwg:master.

s1061123

Hi @41tair, thank you for the PR!

Once we use install_multus for thick pluigin, then we could change container base image to distroless from

multus-cni/images/Dockerfile.thick

Line 10 in ba18cf5

FROM debian:stable-slim

to such like that as thin Dockerfile

multus-cni/images/Dockerfile

Line 11 in ba18cf5

FROM gcr.io/distroless/base-debian11:latest

Could you pelase take care of that?

kfox1111 · 2024-03-11T22:14:56Z

Anything left to do on this? Think mode seems broken/risky without it.

dougbtv

LGTM! Thanks for the update

kfox1111 · 2024-03-27T19:48:48Z

Reluctant to deploy a new instance of multus until this bug is fixed.

dustinrouillard · 2024-04-19T06:11:11Z

Reluctant to deploy a new instance of multus until this bug is fixed.

Would love to see this fixed.

e2e/templates/multus-daemonset-thick.yml.j2

hansbogert · 2024-08-05T07:14:17Z

as an end-user I'm wondering what's blocking this? I'm kinda stuck, 4.0.2 is non working for my cluster, but so is master, b/c of this: #1221

#1221 mentions that that issue should be fixed by this change

iiiceoo · 2024-08-13T06:19:31Z

/cc @s1061123 May I ask why this PR is blocked? It actually fixes a critical bug.

sandj1438 · 2024-10-28T14:48:12Z

@dougbtv can we merge this and release a new image ? it is a critical bug fix right?

dustinrouillard · 2024-11-04T21:38:13Z

Hey maintainers, can we please get some action on this?!?!?!?!?!?!?!?! People are tired of manually patching and fixing their installs.. Merge. The. Fix.

kub3let · 2024-11-24T15:13:48Z

@s1061123 sorry to bother you but please give a statement about this, it's breaking a lot of clusters.

What has to be done to get this merged ?

dougbtv · 2024-12-05T14:58:34Z

Please join the maintainer's call (see k8snetworkplumbingwg/community github page for details) and express this stuff if it's causing you problems! we'd also love tests to prove that it works.

in the meanwhile -- we're re-running the tests in #1364 (and... Tomo also figured out how to kick the tests again, so, I appreciate it)

Note that the install yaml is meant as a basis to use for your own environments and comes with no guarantees (other than it passes the e2e tests, so we want to see it tested that way too) which is what slowed down this process and made it difficult to get maintainer feedback on. My apologies to those impacted by the usage of this yaml.

kfox1111 · 2024-12-05T15:26:20Z

@dougbtv kubevirt/cluster-network-addons-operator#1810 is the same issue too. They seem to just be following this repo's lead and running into the same issue. :/

dougbtv · 2025-01-28T22:57:11Z

anyone willing to look into why there's a failing test? that'll help us move this along, thanks!

appreciate the frustration, but, we'd love to have the passing tests, and it'll be a "no question about it" to merge during the maintainer's call if there's both passing tests and an additional e2e test that validates this functionality, too.

dims · 2025-01-31T01:30:08Z

@dougbtv looked into one of the logs and spotted this. Looks like portmap is mandating newer GLIBC

Dec 05 15:00:10 kind-worker2 containerd[167]: 2024-12-05T15:00:10Z [error] CmdDel (shim): CNI request failed with status 400: 'ContainerID:"59a6abee1e17eac1d3f833c90fcd4ef107be97902c5c21d4d85ba262cc23b795" Netns:"/var/run/netns/cni-54886520-6d07-c76b-a9f0-4e093f2bcb5d" IfName:"eth0" Args:"K8S_POD_INFRA_CONTAINER_ID=59a6abee1e17eac1d3f833c90fcd4ef107be97902c5c21d4d85ba262cc23b795;K8S_POD_UID=c6fd5fae-26cc-437a-86b4-bf6075a38f2f;IgnoreUnknown=1;K8S_POD_NAMESPACE=default;K8S_POD_NAME=simple-centos1" Path:"" ERRORED: DelegateDel: error invoking ConflistDel - "kindnet": conflistDel: error in getting result from DelNetworkList: plugin type="portmap" failed (delete): netplugin failed: "/host/opt/cni/bin/portmap: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.34' not found (required by /host/opt/cni/bin/portmap)\n/host/opt/cni/bin/portmap: /lib/x86_64-linux-gnu/libc.so.6: version `GLIBC_2.32' not found (required by /host/opt/cni/bin/portmap)\n"

iiiceoo · 2025-02-05T07:03:50Z

@dims Thank you for providing the logs. This indicates that the portmap (which actually includes any other CNI binaries in kindest/node image) in the kindest/node image is dynamically linked executable and relies on the GLIBC of the base image of multus-ds container.

Generally speaking, a regular K8s cluster, rather than a kind cluster, typically uses statically linked executable CNI binaries:

$ ldd /opt/cni/bin/portmap
    not a dynamic executable

That's why I adopted the PR modification early on but didn't encounter the same issue in e2e testing. So in fact, the error in e2e is a less critical one.

evilhamsterman · 2025-02-07T21:12:14Z

Since it looks like E2E is not an issue, and a lot of people have been manually patching the copy command for almost a year already so I think we can be pretty confident it doesn't have issues. Can we get this merged

dims · 2025-02-10T15:26:43Z

@dougbtv filed a fix in kind itself kubernetes-sigs/kind#3859 given this failure is limited to the CI jobs with Dockerfile.thick we should be safe to land this in addition to what @evilhamsterman and @iiiceoo said above

BenTheElder · 2025-02-12T00:18:25Z

I'm not that familiar with multus ... are you attempting to invoke the host's CNI from within your container? That's not portable, regardless of if we ship the kind change. Can you install your own CNI binaries?

Also:
https://kind.sigs.k8s.io/docs/design/node-image/

DO NOT DEPEND ON THE INTERNALS OF THE NODE IMAGES.

We only guarantee that these can create Kubernetes, please try not to depend on what is available on the host versus using conformant Kubernetes APIs ...

Generally speaking, a regular K8s cluster, rather than a kind cluster, typically uses statically linked executable CNI binaries:

That seems like a rather fragile assumption. This workload probably needs to either:
a) fully install CNI itself
b) run on the host
c) stick to a very stable glibc track

kfox1111 · 2025-02-12T00:24:57Z

I'm not that familiar with multus ... are you attempting to invoke the host's CNI from within your container? That's not portable, regardless of if we ship the kind change. Can you install your own CNI binaries?

the issue is simply about copying the file from the container onto the host so the host can use it. If its in use currently, it blocks updating. In thick mode, it tries to start and blocks until the service is up. But the service wont start until the init container starts to copy the driver onto the host. so it wont auto recover. Simply always overwriting it is one way to fix it.

BenTheElder · 2025-02-12T00:54:26Z

Again, I'm not sure if this is relevant for multus (apologies), but many "cni" daemonset "installers" will run an init container to copy out the binaries they need to the host, which is a little more portable (because so many do this it's virtually required to not customize the usual predictable CNI host directory).

I merged @dims's PR because it makes sense anyhow, but we are probably going to delete unused binaries in the future as well, and may or may not break this setup in the future (certainly not intentionally out of malice, but this is not something we're actively trying to guarantee either, bring your own CNI is generally "best of luck, we don't have the bandwidth to test all of these").

😅

dougbtv · 2025-02-13T15:02:37Z

Definitely bring this to the maintainer's call to have a real-time chat about it. Like I said, we'd love to discuss it.

There's two main reasons:

We want passing tests to merge a PR, if a PR breaks a test -- we need a fix for it. One way or another, we can't just break the rest of CI because this fixes what is currently an untested fix.
I don't think we test an upgrade situation as of right now. I think that's why people are hitting this issue, but we're not seeing this issue in CI.

The fastest path to a merge here is:

Have passing end-to-end tests.
Have replication of the issue in the end-to-end tests (possibly an upgrade test), and a fix for it.

That being said, this needs to be tackled from a few directions.... And I appreciate the help I've gotten there (thanks Dims for the build

We do already install our own copy of the reference CNI plugins, so, potentially the order by which we install these, we should check that.
I don't grok why this test changes and makes it so we hit that error, but we're not seeing it in other e2e tests? That is baffling to me. I'd love to know why.
I will try to run it with the updated kind, if that works and CI can pass, great.

The maintainer's stance is that the yaml as provided for a quick start and for end to end testing, and as a reference. We do advise maintaining your own yaml for production deployment and to customize to your systems.

kfox1111 · 2025-02-13T16:35:07Z

its not an upgrade issue, its a reboot race issue. it works the first time. driver is installed on the node with init container, and starts up fine.

Then, reboot the node:

Kubelet comes up. tries to start pods. cni is called, it runs the multus binary in client mode, tries to contact the multus server. it is not available yet, the cni driver blocks rather then returns like most cni drivers at this point..... The process is actively using the multus cni driver file at this point and has locked out writes to that file.

next the multus pod tries to start on the node. it runs the init container, it tries to copy the driver onto the host (its already there). The copy fails currently because it tries to overwrite the data inside the binary file rather then roll it into place. the process in memory that is already running blocks that from happening.

because the init container fails, it never starts the server, so the cni instance on the host can never start, to get out of the copy command's way. The init container will never pass, so the server can start.

Deadlock/broken cluster.

evilhamsterman · 2025-02-14T05:45:29Z

What @kfox1111 said this is a long running issue with regular operations not just upgrades. Take a look at the linked issue that this is supposed to fix. People do maintain their own deployment files and have been applying this exact patch for almost a year now. But most people don't figure out this issue until they go to reboot a node and hit the problem. Then they spend hours trying to figure it out while managers are breathing down their necks asking when it'll be fixed.

It's a simple change to basically do the same thing that the thin version does. It's been identified that the failing tests are due to the tests themselves not the change. If you want to skip the change they made to the docker file that's fine, I believe that was done due to a request in the issue.

But otherwise please just merge this. It's a simple bug with a simple fix that's causing a lot of headaches

evilhamsterman · 2025-02-14T05:58:19Z

Also due to the fact it's an issue with a race condition by its nature will be difficult to replicate. But the patch resolves it

iiiceoo · 2025-02-14T07:45:38Z

images/Dockerfile.thick

@@ -7,7 +7,7 @@ ADD . /usr/src/multus-cni
 RUN  cd /usr/src/multus-cni && \
     ./hack/build-go.sh

-FROM debian:stable-slim
+FROM gcr.io/distroless/base-debian11:latest


I don't grok why this test changes and makes it so we hit that error, but we're not seeing it in other e2e tests? That is baffling to me. I'd love to know why.

In the e2e test, multus-ds mounts the CNI binaries provided by the kind node. These binaries are dynamically linked (compiled with CGO_ENABLED=1). When these CNI binaries (e.g., portmap) are called, they rely on the GLIBC of multus-ds container, which is the GLIBC of base image gcr.io/distroless/base-debian11:latest.

Perhaps change to debian:stable-slim allows the e2e test to pass, but it does not meet @s1061123's suggestions.

The reason I've explained so much is simply to convey that the failure of this e2e test is solely an issue with the CI job, not a problem with the changes made in this PR.

update deploy file: use install_multus bin to update cni file

7834dbc

s1061123 requested changes Feb 15, 2024

View reviewed changes

dougbtv mentioned this pull request Feb 15, 2024

Race issue after node reboot #1221

Open

update thick image to distroless debian

efa51cf

41tair requested a review from s1061123 February 21, 2024 06:02

dougbtv approved these changes Mar 14, 2024

View reviewed changes

41tair and others added 2 commits April 25, 2024 15:16

update e2e thick template yamls

ac12368

Update multus-daemonset-thick.yml.j2

3db8d77

kfox1111 reviewed Apr 26, 2024

View reviewed changes

e2e/templates/multus-daemonset-thick.yml.j2 Show resolved Hide resolved

fix indentation error in e2e

6c31d1c

kfox1111 mentioned this pull request Jun 17, 2024

ifname net1 is already exist #353

Closed

kub3let mentioned this pull request Nov 24, 2024

Fix: prevent stuck init container due to existing/busy multus binary #1360

Closed

dougbtv mentioned this pull request Dec 5, 2024

Upgradefix (clone of #1213) #1364

Open

dims mentioned this pull request Jan 31, 2025

[WIP] Try older debian for ci jobs #1381

Closed

evilhamsterman mentioned this pull request Feb 7, 2025

enforce copy the multus-cni #1375

Open

dims mentioned this pull request Feb 10, 2025

Statically link CNI plugins kubernetes-sigs/kind#3859

Merged

iiiceoo reviewed Feb 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update deploy file: use install_multus bin to update cni file #1213

update deploy file: use install_multus bin to update cni file #1213

41tair commented Jan 19, 2024 •

edited by dougbtv

Loading

coveralls commented Feb 1, 2024 •

edited

Loading

s1061123 left a comment

kfox1111 commented Mar 11, 2024

dougbtv left a comment

kfox1111 commented Mar 27, 2024

dustinrouillard commented Apr 19, 2024

hansbogert commented Aug 5, 2024

iiiceoo commented Aug 13, 2024

sandj1438 commented Oct 28, 2024

dustinrouillard commented Nov 4, 2024

kub3let commented Nov 24, 2024

dougbtv commented Dec 5, 2024

kfox1111 commented Dec 5, 2024

dougbtv commented Jan 28, 2025

dims commented Jan 31, 2025 •

edited

Loading

iiiceoo commented Feb 5, 2025 •

edited

Loading

evilhamsterman commented Feb 7, 2025

dims commented Feb 10, 2025

BenTheElder commented Feb 12, 2025

kfox1111 commented Feb 12, 2025

BenTheElder commented Feb 12, 2025

dougbtv commented Feb 13, 2025

kfox1111 commented Feb 13, 2025 •

edited

Loading

evilhamsterman commented Feb 14, 2025

evilhamsterman commented Feb 14, 2025

iiiceoo Feb 14, 2025

update deploy file: use install_multus bin to update cni file #1213

Are you sure you want to change the base?

update deploy file: use install_multus bin to update cni file #1213

Conversation

41tair commented Jan 19, 2024 • edited by dougbtv Loading

coveralls commented Feb 1, 2024 • edited Loading

s1061123 left a comment

Choose a reason for hiding this comment

kfox1111 commented Mar 11, 2024

dougbtv left a comment

Choose a reason for hiding this comment

kfox1111 commented Mar 27, 2024

dustinrouillard commented Apr 19, 2024

hansbogert commented Aug 5, 2024

iiiceoo commented Aug 13, 2024

sandj1438 commented Oct 28, 2024

dustinrouillard commented Nov 4, 2024

kub3let commented Nov 24, 2024

dougbtv commented Dec 5, 2024

kfox1111 commented Dec 5, 2024

dougbtv commented Jan 28, 2025

dims commented Jan 31, 2025 • edited Loading

iiiceoo commented Feb 5, 2025 • edited Loading

evilhamsterman commented Feb 7, 2025

dims commented Feb 10, 2025

BenTheElder commented Feb 12, 2025

kfox1111 commented Feb 12, 2025

BenTheElder commented Feb 12, 2025

dougbtv commented Feb 13, 2025

kfox1111 commented Feb 13, 2025 • edited Loading

evilhamsterman commented Feb 14, 2025

evilhamsterman commented Feb 14, 2025

iiiceoo Feb 14, 2025

Choose a reason for hiding this comment

41tair commented Jan 19, 2024 •

edited by dougbtv

Loading

coveralls commented Feb 1, 2024 •

edited

Loading

dims commented Jan 31, 2025 •

edited

Loading

iiiceoo commented Feb 5, 2025 •

edited

Loading

kfox1111 commented Feb 13, 2025 •

edited

Loading