-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BPF MASQ for veth mode and ip-masq-agent #11148
Conversation
test-me-please |
7067619
to
0e9781c
Compare
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just minor nits in addition to previous feedback, LGTM.
daemon/cmd/daemon_main.go
Outdated
if option.Config.EnableIPMasqAgent { | ||
if err := ipmasq.Start(option.Config.IPMasqAgentConfigPath, | ||
option.Config.IPMasqAgentSyncPeriod); err != nil { | ||
log.WithError(err).Fatal("ip-masq-agent failed to start") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I note that the inability to restore will cause daemon crash here. On the flip side, if there's a problem with the config we will instead just put the controller into an error state and still successfully run the daemon. Not sure if that's specifically thought through or just the most convenient way to represent those errors?
I can't think of a specific failure case that would cause this to be a problem, just seems slightly mismatched in the representation of the errors. Probably restore won't fail anyway, it just requires the ability to create the map (incl. the resource limits stuff) + ability to dump a (potentially empty) map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's intended.
If the restore fails, then it's a good indicator that something is wrong with a system (e.g. the resource limit), and it's OK for cilium-agent not to start. Otherwise, we might risk introducing subtle changes in behavior which could be difficult for a user to debug.
Meanwhile, crashing a cluster network due to a syntax error in a ip-masq-agent config file or inability to update the related BPF map is IMHO too dangerous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think this can come down to two different user approaches: There are some users that would prefer things crash/break as soon as any failure occurs so that they can deal with them. Other users will prefer that things just work, and log a message describing why things are not 100% so that they can follow up later.
In general I think in Cilium if there is a pure user input mismatch (eg daemon config options) we are crashing out early to signal to the user that Cilium is misconfigured and will not operate correctly. I think this is fine as it is most likely to occur when the user first configures Cilium or first attempts upgrade; and it's purely functional on a set of provided parameters. This state from the configmap is probably similar "user input" which we should validate clearly, but I'm still mulling it over. The weird part I point out above is that if a user customizes the configmap while Cilium is running and misconfigures it, then Cilium will continue running happily (admittedly with a warning log). Next time Cilium is restarted, it will go into crashloopbackoff because of this input validation. If they don't test properly then they may inadvertently set Cilium up to crash out on one day, then only find out days/weeks later when restarting Cilium (or worst case when Cilium crashes). At that point a misconfiguration can lead to a wider Cilium outage.
When we get into potential transient errors particularly for areas like this which are not pure connectivity or pure security, it gets a little less clear when Cilium should crash out. Crashing the Cilium daemon can cause DNS connectivity issues for instance if the user runs FQDN policy. So if we're getting away from "this is the first time the user runs Cilium in this setup" then I think we should err on the side of reporting an error through normal mechanisms (eg log a warning; that will also show up in metrics via the 'number of warnings/errors' metric) and continuing with functionality disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
. If they don't test properly then they may inadvertently set Cilium up to crash out on one day, then only find out days/weeks later when restarting Cilium
That's a very valid concern.
I think we should err on the side of reporting an error through normal mechanisms (eg log a warning
Agreed. I've changed restore()
to log an error instead of crashing.
0e9781c
to
262fd09
Compare
test-me-please |
1 similar comment
test-me-please |
Hmm, hitting |
test-me-please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
# destinations outside of the cluster. | ||
masquerade: true | ||
|
||
# bpfMasquerade enables masquerading with BPF instead of iptables | ||
bpfMasquerade: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why aren't we enabling this by default for new instalations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we have enough confidence about this feature, I'm happy to enable this by default;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@brb so when will we have confidence about it? Does this mean the feature is in beta? If it's not enabled for new installations will anyone use it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's discuss it during the next meeting.
{{- if .Values.global.egressMasqueradeInterfaces }} | ||
egress-masquerade-interfaces: {{ .Values.global.egressMasqueradeInterfaces }} | ||
{{- end }} | ||
{{- if and .Values.global.ipMasqAgent .Values.global.ipMasqAgent.enabled }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This requires a warning somewhere in the docs or in helm generation, the ip-masq-agent
config map needs to be set in the same namespace where Cilium is running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I will mention it in the docs.
|
||
// config represents the ip-masq-agent configuration file encoded as YAML | ||
type config struct { | ||
NonMasqCIDRs []Ipnet `yaml:"nonMasqueradeCIDRs"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about json?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will add it in the follow-up!
manager.UpdateController("ip-masq-agent", | ||
controller.ControllerParams{ | ||
DoFunc: func(ctx context.Context) error { | ||
return a.Update() | ||
}, | ||
RunInterval: syncPeriod, | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[2] Use something similar as
cilium/pkg/clustermesh/config.go
Lines 33 to 57 in 47531ad
type configDirectoryWatcher struct { | |
watcher *fsnotify.Watcher | |
lifecycle clusterLifecycle | |
path string | |
stop chan struct{} | |
} | |
func createConfigDirectoryWatcher(path string, lifecycle clusterLifecycle) (*configDirectoryWatcher, error) { | |
watcher, err := fsnotify.NewWatcher() | |
if err != nil { | |
return nil, err | |
} | |
if err := watcher.Add(path); err != nil { | |
watcher.Close() | |
return nil, err | |
} | |
return &configDirectoryWatcher{ | |
watcher: watcher, | |
path: path, | |
lifecycle: lifecycle, | |
stop: make(chan struct{}), | |
}, nil | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both you and @joestringer asked me about this (https://github.com/cilium/cilium/pull/10878/files#r409979845). I could implement it in the follow-up, but do you really think that's worth increasing the complexity of ip-masq-agent? I expect the config files to be relatively simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's one less flag that we have and we won't wait at least 60 seconds until changes are enforced. They will be done as soon kubelet has the files locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it OK if I do it in the follow-up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to the follow-up items.
flags.Duration(option.IPMasqAgentSyncPeriod, 60*time.Second, "ip-masq-agent configuration file synchronization period") | ||
option.BindEnv(option.IPMasqAgentSyncPeriod) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't be required if [2] is done instead.
16e5182
to
3d8d178
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming all follow up items are done.
pkg/ipmasq/ipmasq.go
Outdated
for cidrStr, cidr := range a.nonMasqCIDRsInMap { | ||
if _, ok := a.nonMasqCIDRsFromConfig[cidrStr]; !ok { | ||
log.WithField(logfields.CIDR, cidrStr).Info("Removing CIDR") | ||
a.ipMasqMap.Delete(cidr) | ||
delete(a.nonMasqCIDRsInMap, cidrStr) | ||
} | ||
} | ||
|
||
for cidrStr, cidr := range a.nonMasqCIDRsFromConfig { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, one more point here. Did you consider the ordering between add/delete here?
Recent similar BPF map updates changes in policy side have shown that the ordering can impact datapath availability, see #10936
EDIT: To be explicit, I mean to say: If someone updates the configmap, will we cause connectivity issues / misrouted traffic temporarily during the update because of the ordering between deletes and adds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the order so that adds happen before deletes just to make new CIDRs available a bit quicker.
The ordering won't have any effect on established connections. However, if there is a connection which previously was masqueraded, and after the CIDR update it's no longer a subject to masquerading, then the egressing packets from the pod after the update won't be masqueraded, and thus it will be dropped. If it's an issue, we can always add a lookup in the BPF (S)NAT map when deciding whether the masquerading is needed. But that would come with a performance penalty in the fastpath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just small things on bpf datapath side, once resolved lgtm. ipv6 support is missing and should probably be added to the follow-ups. The snat engine supports ipv6 and we do it in nodeport snat as well, so we should also do it for the masq agent.
Currently, it is broken, and it requires a re-work. The removal will help us to refactor the BPF SNAT engine. Also, this allows us to remove the SNAT dead code, and to inline the helper macros. Signed-off-by: Martynas Pumputis <[email protected]>
Currently, the flag is noop except that it disables provisioning of iptables MASQUERADE rules when set. The flag is going to be used to control BPF MASQUERADE'ing. By default, it is set to false, so that users who will upgrade to Cilium v1.8 will keep the existing behaviour (= MASQ is done by iptables). Signed-off-by: Martynas Pumputis <[email protected]>
This commit enables MASQUERADE'ing packets sent from local endpoints to outside (WORLD_ID). The MASQ is enabled only when --enable-bpf-masquerade is set. Signed-off-by: Martynas Pumputis <[email protected]>
This commit fixes DSR and BPF MASQ when both are enabled: - For egress path (local_endpoint -> outside) do not skip (S)NAT if L4 protocol uses DSR. - For ingress do not avoid trying to do the reverse SNAT xlation if L4 protocol uses DSR. Currently, the rev-xlaton happens inside the CILIUM_CALL_IPV4_NODEPORT_NAT tailcall. Signed-off-by: Martynas Pumputis <[email protected]>
The param controls whether BPF MASQ should be enabled. Currently, it defaults to "false", and cilium-agent defaults to "false" as well, so that existing users who upgrade to v1.8 will preserve the existing behavior (iptables). Signed-off-by: Martynas Pumputis <[email protected]>
Otherwise, the map creation in the probe (sometimes) fails with EPERM. See [1] for details re rlimit. [1]: https://docs.cilium.io/en/v1.7/bpf/#miscellaneous Signed-off-by: Martynas Pumputis <[email protected]>
This commit introduces the following ip-masq-agent related flags: * --enable-ip-masq-agent * --ip-masq-agent-config-path * --ip-masq-agent-sync-period Currently, the flags are noop. Signed-off-by: Martynas Pumputis <[email protected]>
The LPM map is going to be used by Cilium's ip-masq-agent to store CIDRs which should bypass masquerading. Signed-off-by: Martynas Pumputis <[email protected]>
This commit introduces the "ip-masq-agent" controller which is going to be used to sync the ipmasq BPF maps with a config files updated by users. Also, enable ip-masq-agent when --enable-ip-masq-agent is set to true. Signed-off-by: Martynas Pumputis <[email protected]>
It makes an integration via ConfigMap much easier, as a user provided ConfigMap will be mounted as a file to cilium-agent pod, and its content is encoded as YAML. Signed-off-by: Martynas Pumputis <[email protected]>
This commit extends the SNAT engine in a way that a packet sent from a local endpoint to outside is NOT masqueraded if a dst IP addr of the packet matches the one in the ipmasq map. Signed-off-by: Martynas Pumputis <[email protected]>
The cmd lists all CIDRs in the ip-masq-agent BPF LPM map. Signed-off-by: Martynas Pumputis <[email protected]>
The BPF ip-masq-agent can be enabled with global.ipMasqAgent.enabled. When enabled, the ip-masq-agent configmap (created by a user) is mounted into /etc/config/ip-masq-agent of the cilium-agent pod. Signed-off-by: Martynas Pumputis <[email protected]>
8290499
to
892cefb
Compare
Hmm, good question. I assumed that new deployments will have it enabled. If it's disabled, perhaps it's safe to check for
The removal of the check should not make DSR to be SNAT'd. After my PR is merged,
And it won't be called for the following packet which would be a subject to SNAT:
Also, if we missed a case and DSR would have been SNAT'd, I believe that the CI tests would catch it, as we do the source IP addr validation. |
test-me-please |
Yeah, I think |
CI hit the #10763 flake. |
test-me-please |
CI failed:
|
test-me-please |
1 similar comment
test-me-please |
Code has approvals for all codeowners, and GKE check is not required. Merging. |
This PR is the same as #10878, but minus the multi-dev NodePort and the tests. I've created it for easier reviewing. Once it's merged, I will open a PR for the rest.
The follow-up TODO: #11018.
Fix #10229
Fix #9922
@joestringer I believe you have already reviewed all the changes in this PR.