-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
connectivity: Add pod-to-pod-no-frag to check MTU misconfigurations #2610
Conversation
To include iproute2 [1]. [1]: cilium/alpine-curl#123 Signed-off-by: Martynas Pumputis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of nits
To pass "do not fragment" and to specify ping payload size in a subsequent commit. Signed-off-by: Martynas Pumputis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My ci-structure review:
I'm going to repost a previous review comment from another PR because I am seeing the same pattern being repeated here: #2568 (comment)
Is this really the best place to have this kind of test? I don't think I understand the point of moving away from Ginkgo-style tests if we are going to port the same logic and testing pattern into the CLI.
@christarazi I get your concerns regarding flakiness. Unfortunately, with Ginkgo deprecation we don't have any other place to add such e2e tests. There was an effort to create a E2E framework during the hyperjump, but it turned to be a massive investment. Maybe this topic can be picked again by some working group. At the same time, IMO such tests are helpful for end users to diagnose non-obvious Cilium connectivity issues (which misconfigured MTU can lead to). And what I hear from users that they extensively use (Ginkgo/gomega as a testing framework is / was a terrible foundation for Cilium's E2E tests. The argumentation https://www.reddit.com/r/golang/comments/1azj63h/comment/ks1srp2/ resonates well with us). |
The new test case is intended to check whether MTU is properly set in pod netns. The check sends an ICMP Echo with DF set and payload. The payload size is derived from pod MTU. No reply means that MTU is misconfigured somewhere in the packet path. Signed-off-by: Martynas Pumputis <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get your concerns regarding flakiness. Unfortunately, with Ginkgo deprecation we don't have any other place to add such e2e tests. There was an effort to create a E2E framework during the hyperjump, but it turned to be a massive investment. Maybe this topic can be picked again by some working group.
I am mostly worried about lines 198-230 in connectivity/tests/pod.go
because it seems to me that checking for the MTU on the node itself for the pod doesn't necessarily need to be done at this level. Couldn't it be validated at endpoint creation time that the appropriate MTU value was passed to the container?
The lines beyond the above lines mentioned is fine because IMO that is testing potential fragmentation issues with MTU. Ideally, that should be done in a BPF program unit test IIUC, but this is better than issuing ad-hoc commands to validate a parameter value (i.e. MTU).
It would be nice to have an issue to track the work to migrate to a better testing style, otherwise this code will just sit here until it causes us so many flakes that we then have to look at it again. (I realize I didn't mention this in the previous PR I posted a review comment like this to; will do that there as well).
For now I will approve to unblock because it seems the tradeoff of not having this sort of coverage is worse than not having it at all while we build better tests. Ideally we aren't adding more tests with this pattern going forward.
Unfortunately, noup. In-between pod-to-pod there are multiple network devices involved (some managed by Cilium), which MTU misconfiguration might not be reflected in an endpoint's MTU.
The BPF unit tests do not involve netdevs. Perhaps, we should start thinking about introducing integration tests, in which cilium-agent loads BPF progs via pkg/datapath and then we inject a packet (e.g., via libpcap). That could be a replacement for the test cases from this PR. Maybe a topic for the next hive time (if yes, which WG?)? |
CI run - cilium/cilium#33286. Only ci-ipsec and ci-eks failures are relevant. I will fix them in a PR for cilium/cilium, which bumps CLI vsn.