-
Notifications
You must be signed in to change notification settings - Fork 599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
update deploy file: use install_multus bin to update cni file #1213
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @41tair, thank you for the PR!
Once we use install_multus
for thick pluigin, then we could change container base image to distroless from
multus-cni/images/Dockerfile.thick
Line 10 in ba18cf5
FROM debian:stable-slim |
to such like that as thin Dockerfile
Line 11 in ba18cf5
FROM gcr.io/distroless/base-debian11:latest |
Could you pelase take care of that?
Anything left to do on this? Think mode seems broken/risky without it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the update
Reluctant to deploy a new instance of multus until this bug is fixed. |
Would love to see this fixed. |
/cc @s1061123 May I ask why this PR is blocked? It actually fixes a critical bug. |
@dougbtv can we merge this and release a new image ? it is a critical bug fix right? |
Hey maintainers, can we please get some action on this?!?!?!?!?!?!?!?! People are tired of manually patching and fixing their installs.. Merge. The. Fix. |
@s1061123 sorry to bother you but please give a statement about this, it's breaking a lot of clusters. What has to be done to get this merged ? |
Please join the maintainer's call (see k8snetworkplumbingwg/community github page for details) and express this stuff if it's causing you problems! we'd also love tests to prove that it works. in the meanwhile -- we're re-running the tests in #1364 (and... Tomo also figured out how to kick the tests again, so, I appreciate it) Note that the install yaml is meant as a basis to use for your own environments and comes with no guarantees (other than it passes the e2e tests, so we want to see it tested that way too) which is what slowed down this process and made it difficult to get maintainer feedback on. My apologies to those impacted by the usage of this yaml. |
@dougbtv kubevirt/cluster-network-addons-operator#1810 is the same issue too. They seem to just be following this repo's lead and running into the same issue. :/ |
anyone willing to look into why there's a failing test? that'll help us move this along, thanks! appreciate the frustration, but, we'd love to have the passing tests, and it'll be a "no question about it" to merge during the maintainer's call if there's both passing tests and an additional e2e test that validates this functionality, too. |
@dougbtv looked into one of the logs and spotted this. Looks like
|
@dims Thank you for providing the logs. This indicates that the Generally speaking, a regular K8s cluster, rather than a kind cluster, typically uses statically linked executable CNI binaries: $ ldd /opt/cni/bin/portmap
not a dynamic executable That's why I adopted the PR modification early on but didn't encounter the same issue in e2e testing. So in fact, the error in e2e is a less critical one. |
Since it looks like E2E is not an issue, and a lot of people have been manually patching the copy command for almost a year already so I think we can be pretty confident it doesn't have issues. Can we get this merged |
@dougbtv filed a fix in |
I'm not that familiar with multus ... are you attempting to invoke the host's CNI from within your container? That's not portable, regardless of if we ship the kind change. Can you install your own CNI binaries? Also:
We only guarantee that these can create Kubernetes, please try not to depend on what is available on the host versus using conformant Kubernetes APIs ...
That seems like a rather fragile assumption. This workload probably needs to either: |
the issue is simply about copying the file from the container onto the host so the host can use it. If its in use currently, it blocks updating. In thick mode, it tries to start and blocks until the service is up. But the service wont start until the init container starts to copy the driver onto the host. so it wont auto recover. Simply always overwriting it is one way to fix it. |
Again, I'm not sure if this is relevant for multus (apologies), but many "cni" daemonset "installers" will run an init container to copy out the binaries they need to the host, which is a little more portable (because so many do this it's virtually required to not customize the usual predictable CNI host directory). I merged @dims's PR because it makes sense anyhow, but we are probably going to delete unused binaries in the future as well, and may or may not break this setup in the future (certainly not intentionally out of malice, but this is not something we're actively trying to guarantee either, bring your own CNI is generally "best of luck, we don't have the bandwidth to test all of these"). 😅 |
Definitely bring this to the maintainer's call to have a real-time chat about it. Like I said, we'd love to discuss it. There's two main reasons:
The fastest path to a merge here is:
That being said, this needs to be tackled from a few directions.... And I appreciate the help I've gotten there (thanks Dims for the build
The maintainer's stance is that the yaml as provided for a quick start and for end to end testing, and as a reference. We do advise maintaining your own yaml for production deployment and to customize to your systems. |
its not an upgrade issue, its a reboot race issue. it works the first time. driver is installed on the node with init container, and starts up fine. Then, reboot the node: Kubelet comes up. tries to start pods. cni is called, it runs the multus binary in client mode, tries to contact the multus server. it is not available yet, the cni driver blocks rather then returns like most cni drivers at this point..... The process is actively using the multus cni driver file at this point and has locked out writes to that file. next the multus pod tries to start on the node. it runs the init container, it tries to copy the driver onto the host (its already there). The copy fails currently because it tries to overwrite the data inside the binary file rather then roll it into place. the process in memory that is already running blocks that from happening. because the init container fails, it never starts the server, so the cni instance on the host can never start, to get out of the copy command's way. The init container will never pass, so the server can start. Deadlock/broken cluster. |
What @kfox1111 said this is a long running issue with regular operations not just upgrades. Take a look at the linked issue that this is supposed to fix. People do maintain their own deployment files and have been applying this exact patch for almost a year now. But most people don't figure out this issue until they go to reboot a node and hit the problem. Then they spend hours trying to figure it out while managers are breathing down their necks asking when it'll be fixed. It's a simple change to basically do the same thing that the thin version does. It's been identified that the failing tests are due to the tests themselves not the change. If you want to skip the change they made to the docker file that's fine, I believe that was done due to a request in the issue. But otherwise please just merge this. It's a simple bug with a simple fix that's causing a lot of headaches |
Also due to the fact it's an issue with a race condition by its nature will be difficult to replicate. But the patch resolves it |
@@ -7,7 +7,7 @@ ADD . /usr/src/multus-cni | |||
RUN cd /usr/src/multus-cni && \ | |||
./hack/build-go.sh | |||
|
|||
FROM debian:stable-slim | |||
FROM gcr.io/distroless/base-debian11:latest |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't grok why this test changes and makes it so we hit that error, but we're not seeing it in other e2e tests? That is baffling to me. I'd love to know why.
In the e2e test, multus-ds mounts the CNI binaries provided by the kind node. These binaries are dynamically linked (compiled with CGO_ENABLED=1
). When these CNI binaries (e.g., portmap
) are called, they rely on the GLIBC of multus-ds container, which is the GLIBC of base image gcr.io/distroless/base-debian11:latest
.
Perhaps change to debian:stable-slim
allows the e2e test to pass, but it does not meet @s1061123's suggestions.
The reason I've explained so much is simply to convey that the failure of this e2e test is solely an issue with the CI job, not a problem with the changes made in this PR.
install cni initcontiainer can't execute sometimes
a sample error logs looks like:
cp: cannot create regular file '/host/opt/cni/bin/multus-shim': Text file busy
Fixes #1221