Fix: all traffic ingress rule triggers fatal nil dereference#697
Conversation
For anything besides tcp, udp, icmp, and icmpv6 there is no applicable notion of "port range." AWS omits FromPort and ToPort in its responses, causing a fatal nil dereference when attempting to read any security groups with e.g. an "all traffic" rule.
|
Hi @sethp-nr. Thanks for your PR. I'm waiting for a kubernetes-sigs or kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
vincepri
left a comment
There was a problem hiding this comment.
some comments, otherwise LGTM
| IpProtocol: aws.String(string(i.Protocol)), | ||
| FromPort: aws.Int64(i.FromPort), | ||
| ToPort: aws.Int64(i.ToPort), | ||
| var res *ec2.IpPermission |
There was a problem hiding this comment.
This could be declared as named return
| ToPort: *v.ToPort, | ||
| var res *v1alpha1.IngressRule | ||
| switch *v.IpProtocol { | ||
| case "tcp", "udp", "icmp", "58" /* ICMPv6 */ : |
There was a problem hiding this comment.
Should we move all of these to be package constants with descriptions?
There was a problem hiding this comment.
You don't like the "58" hanging out there, apropos of nothing? 😆
I'll move 'em on in
There was a problem hiding this comment.
That for sure! 😂 Main reasoning was that these variables are probably going to be reused in other places too so it might be worthwhile to declare them somewhere else.
This commit cleans up and clarifies a few of the less obvious components of the previous work.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: sethp-nr, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
I've been ruminating on what to do about the trouble with
What do you think? |
|
I'd also like to know your preference on trying to fix that issue in this PR vs. fixing the crasher and opening a new issue/PR for the LoadBalancer thing. |
For both ICMP and ICMPv6, from/to port will actually correspond to the ICMP types allowed.
Hmm, given AWS still hasn't updated the limit of 5 security groups per network interface (raised to 10 by request), I guess we need to coalesce our rules into the smallest number of SGs possible, as to maximise the number of possible LB attachments. So, 1 security group per node, regardless of type, and the merger of control plane & worker node rules for the control plane instance. This only helps us in the case that the cloud provider doesn't try to inject rules into an SG managed by Cluster API. This relates to #608 (comment) where it was found that the provider doesn't set Feels like if we do something in this repo, it's a hack, but getting changes made to the provider may take time. |
Address linter failures.
Yeah, that 5 limit is a lot lower than I'd like. I guess my question is whether "smallest number" of SGs in this case is one or two.
Cluster-api tells the cloud provider it owns the resource: https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/pkg/cloud/aws/tags/tags.go#L94 It seems like either CAPA should allow the provider-injected rules to exist in the shared group, or it should create a separate group for the cloud provider to inject its rules into. I'm presently leaning toward the latter for its simplicity, but that does mean the carrying cost of a CAPA cluster is 2/5 security group slots. |
Usage needs to match declaration. Computers are sticklers about that sort of thing.
Don't think that was ever our intent, though maybe @detiber can clarify. We're tagging that cluster-api-provider-aws is managing the resource ( Either we make the AWS cloud provider aware of tooling that uses |
Add clarifying comment to serializer function.
|
/retest |
|
lgtm from me |
|
/lgtm |
…tes-sigs#697) * fix: respect all traffic security group rules (and others) For anything besides tcp, udp, icmp, and icmpv6 there is no applicable notion of "port range." AWS omits FromPort and ToPort in its responses, causing a fatal nil dereference when attempting to read any security groups with e.g. an "all traffic" rule. * fix: omit description when empty string * fix: handle more security groups without crashing This commit cleans up and clarifies a few of the less obvious components of the previous work. * fix: handle more security groups without crashing Address linter failures. * fix: handle more security groups without crashing Usage needs to match declaration. Computers are sticklers about that sort of thing. * fix: handle more security groups without crashing Add clarifying comment to serializer function.
* Update the releasing docs (#689) * Add error reason to output if fail to checkout an account from boskos (#698) * Temporary workaround a data issue in boskos service (#699) * Update checkout_account.py to not reuse connections (#700) * Fix checkout_account.py (#702) * Make hack/checkin_account.py executable (#703) * Fix: all traffic ingress rule triggers fatal nil dereference (#697) * fix: respect all traffic security group rules (and others) For anything besides tcp, udp, icmp, and icmpv6 there is no applicable notion of "port range." AWS omits FromPort and ToPort in its responses, causing a fatal nil dereference when attempting to read any security groups with e.g. an "all traffic" rule. * fix: omit description when empty string * fix: handle more security groups without crashing This commit cleans up and clarifies a few of the less obvious components of the previous work. * fix: handle more security groups without crashing Address linter failures. * fix: handle more security groups without crashing Usage needs to match declaration. Computers are sticklers about that sort of thing. * fix: handle more security groups without crashing Add clarifying comment to serializer function. * Fixes a bug and adds tests for kubeadm defaults (#707) The pointers were not working as expected so the API is changing to be more functional and leverage kubernetes' DeepCopy function. * Update listed v1.14 AMIs to v1.14.1 (#708) * Update listed v1.14 AMIs to v1.14.1 * Update README with list of published AMIs/Kubernetes versions * GZIP user-data (#710) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Make sure Calico can talk IP-in-IP (#701) * MAke sure Calico can talk IP-in-IP * Add IP in IP protocol to the control plane security group * Add IPv4 protocol definition and make sure it's handled properly. * Make port ranges AWS complient and security groups more restrictive. * Fix security groups * Adds tests to kubeadm defaults (#709) Attempt at documenting the assumptions made in the kubeadm defaults code. Signed-off-by: Chuck Ha <chuckh@vmware.com> * Logging (#713) * Adds logr as dependency Signed-off-by: Chuck Ha <chuckh@vmware.com> * Use logr in the cluster actuator This only creates the logger. Does not yet swap out actual klog calls. Signed-off-by: Chuck Ha <chuckh@vmware.com> * update bazel Signed-off-by: Chuck Ha <chuckh@vmware.com> * update Signed-off-by: Chuck Ha <chuckh@vmware.com> * Switch dep to use release-0.1 branch instead of version (#715) * Adds logr as dependency (#714) Adds context for logs and removes excessive logging Signed-off-by: Chuck Ha <chuckh@vmware.com> * Ensure `make manifests` generates machines file for HA control plane too. (#720) * Add HA machines template * Introduce HA machines file in `make manifests` target * Add clusterawsadm as make dependency to manifests make target. (#721) Ensures manifests are generated from the current state of the source. Assuming $GOPATH/bin is in the $PATH * Update to Go 1.12 (#719) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Add ability to override Organization ID for image lookups (#723) * Add ability to override Organization ID for image lookups * Update pkg/cloud/aws/services/ec2/ami.go Co-Authored-By: detiber <detiberusj@vmware.com> * Add updated generated crd * feat: support customizing root device size (#718) * feat: support customizing root device size * chore: re-generate CRDs * fix: update formatting * chore: add comment describing Service.sdkToInstance * chore: make service.SDKToInstance public * Rename BUILD -> BUILD.bazel for consistency (#724) find . -type file -name BUILD -not -path "./vendor/*" | xargs -n1 -I{} -- git mv {} {}.bazel Preferred build name changed in 3788fb1 Fixes #722 * Adds retry-on-conflict during updates (#725) * Adds retry-on-conflict during updates Signed-off-by: Chuck Ha <chuckh@vmware.com> * adds note about status update caveat Signed-off-by: Chuck Ha <chuckh@vmware.com> * clarify errors/comments Signed-off-by: Chuck Ha <chuckh@vmware.com> * Add the HA machines configuration to bazel (#733) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Ensure bazel is the correct version (#731) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Update OWNERS_ALIASES and SECURITY_CONTACTS (#712) * Fix the prow jobs (#735) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Fix markdown formatting (#736) * extract fmt from release tool (#738) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Use DEFAULT_REGION as the default and REGION as the supplied (#739) Signed-off-by: Chuck Ha <chuckh@vmware.com> * e2e testing improvement (#743) * Bump kind version * Remove docker load in favor of kind load for e2e cluster Signed-off-by: Chuck Ha <chuckh@vmware.com> * fix: Don't try to update root size when it's unset (#726) * fix: Don't try to update root size when it's unset This commit looks for empty RootDeviceSize in the spec and ignores it. Otherwise, none of our control plane machines were updating with this error: ``` E0418 23:07:48.250925 1 controller.go:214] Error updating machine "ns/controlplane-2": found attempt to change immutable state for machine "controlplane-2": ["Root volume size cannot be mutated from 8 to 0"] ``` * fix: updates without specifying a root volume size Add unit test. * fix: updates without specifying a root volume size Fix gofmt. * Scope nodeRef to workload cluster (#744) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Fix NPE on delete bastion host (#746) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Documentation for creating a new cluster on a different AWS account (#728) * Initial draft of documentation for Cluster creation using cross account role assumption * Update roleassumption.md Complete the document. * cleanup the documentation for roleassumption * Resolved the comments: role assumption documentation. * Fix minor issues - roleassumption.md * resolve more comments to roleassumption.md * Resolve more comments - roleassumption.md * include machines-ha.yaml.template in release artifacts (#741) * Update AWS sdk, improve log in machine actuator delete (#747) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Fixes the infinite reconcile loop (#748) * Uses patch for updating the cluster and machine specs - patch does not cause a re-reconcile in the capi controller * Uses update for updating the cluster and machine status - update for status is ok since it does not update any of the metadata no re-reconcile is necessary for the capi controller Signed-off-by: Chuck Ha <chuckh@vmware.com> * Update Gopkg.lock and cleanup Makefile (#751) * Update cluster-api release-0.1 vendor (#750) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Reduce the number of re-reconciles (#752) Signed-off-by: Chuck Ha <chuckh@vmware.com>
* Update the releasing docs (kubernetes-sigs#689) * Add error reason to output if fail to checkout an account from boskos (kubernetes-sigs#698) * Temporary workaround a data issue in boskos service (kubernetes-sigs#699) * Update checkout_account.py to not reuse connections (kubernetes-sigs#700) * Fix checkout_account.py (kubernetes-sigs#702) * Make hack/checkin_account.py executable (kubernetes-sigs#703) * Fix: all traffic ingress rule triggers fatal nil dereference (kubernetes-sigs#697) * fix: respect all traffic security group rules (and others) For anything besides tcp, udp, icmp, and icmpv6 there is no applicable notion of "port range." AWS omits FromPort and ToPort in its responses, causing a fatal nil dereference when attempting to read any security groups with e.g. an "all traffic" rule. * fix: omit description when empty string * fix: handle more security groups without crashing This commit cleans up and clarifies a few of the less obvious components of the previous work. * fix: handle more security groups without crashing Address linter failures. * fix: handle more security groups without crashing Usage needs to match declaration. Computers are sticklers about that sort of thing. * fix: handle more security groups without crashing Add clarifying comment to serializer function. * Fixes a bug and adds tests for kubeadm defaults (kubernetes-sigs#707) The pointers were not working as expected so the API is changing to be more functional and leverage kubernetes' DeepCopy function. * Update listed v1.14 AMIs to v1.14.1 (kubernetes-sigs#708) * Update listed v1.14 AMIs to v1.14.1 * Update README with list of published AMIs/Kubernetes versions * GZIP user-data (kubernetes-sigs#710) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Make sure Calico can talk IP-in-IP (kubernetes-sigs#701) * MAke sure Calico can talk IP-in-IP * Add IP in IP protocol to the control plane security group * Add IPv4 protocol definition and make sure it's handled properly. * Make port ranges AWS complient and security groups more restrictive. * Fix security groups * Adds tests to kubeadm defaults (kubernetes-sigs#709) Attempt at documenting the assumptions made in the kubeadm defaults code. Signed-off-by: Chuck Ha <chuckh@vmware.com> * Logging (kubernetes-sigs#713) * Adds logr as dependency Signed-off-by: Chuck Ha <chuckh@vmware.com> * Use logr in the cluster actuator This only creates the logger. Does not yet swap out actual klog calls. Signed-off-by: Chuck Ha <chuckh@vmware.com> * update bazel Signed-off-by: Chuck Ha <chuckh@vmware.com> * update Signed-off-by: Chuck Ha <chuckh@vmware.com> * Switch dep to use release-0.1 branch instead of version (kubernetes-sigs#715) * Adds logr as dependency (kubernetes-sigs#714) Adds context for logs and removes excessive logging Signed-off-by: Chuck Ha <chuckh@vmware.com> * Ensure `make manifests` generates machines file for HA control plane too. (kubernetes-sigs#720) * Add HA machines template * Introduce HA machines file in `make manifests` target * Add clusterawsadm as make dependency to manifests make target. (kubernetes-sigs#721) Ensures manifests are generated from the current state of the source. Assuming $GOPATH/bin is in the $PATH * Update to Go 1.12 (kubernetes-sigs#719) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Add ability to override Organization ID for image lookups (kubernetes-sigs#723) * Add ability to override Organization ID for image lookups * Update pkg/cloud/aws/services/ec2/ami.go Co-Authored-By: detiber <detiberusj@vmware.com> * Add updated generated crd * feat: support customizing root device size (kubernetes-sigs#718) * feat: support customizing root device size * chore: re-generate CRDs * fix: update formatting * chore: add comment describing Service.sdkToInstance * chore: make service.SDKToInstance public * Rename BUILD -> BUILD.bazel for consistency (kubernetes-sigs#724) find . -type file -name BUILD -not -path "./vendor/*" | xargs -n1 -I{} -- git mv {} {}.bazel Preferred build name changed in 3788fb1 Fixes kubernetes-sigs#722 * Adds retry-on-conflict during updates (kubernetes-sigs#725) * Adds retry-on-conflict during updates Signed-off-by: Chuck Ha <chuckh@vmware.com> * adds note about status update caveat Signed-off-by: Chuck Ha <chuckh@vmware.com> * clarify errors/comments Signed-off-by: Chuck Ha <chuckh@vmware.com> * Add the HA machines configuration to bazel (kubernetes-sigs#733) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Ensure bazel is the correct version (kubernetes-sigs#731) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Update OWNERS_ALIASES and SECURITY_CONTACTS (kubernetes-sigs#712) * Fix the prow jobs (kubernetes-sigs#735) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Fix markdown formatting (kubernetes-sigs#736) * extract fmt from release tool (kubernetes-sigs#738) Signed-off-by: Chuck Ha <chuckh@vmware.com> * Use DEFAULT_REGION as the default and REGION as the supplied (kubernetes-sigs#739) Signed-off-by: Chuck Ha <chuckh@vmware.com> * e2e testing improvement (kubernetes-sigs#743) * Bump kind version * Remove docker load in favor of kind load for e2e cluster Signed-off-by: Chuck Ha <chuckh@vmware.com> * fix: Don't try to update root size when it's unset (kubernetes-sigs#726) * fix: Don't try to update root size when it's unset This commit looks for empty RootDeviceSize in the spec and ignores it. Otherwise, none of our control plane machines were updating with this error: ``` E0418 23:07:48.250925 1 controller.go:214] Error updating machine "ns/controlplane-2": found attempt to change immutable state for machine "controlplane-2": ["Root volume size cannot be mutated from 8 to 0"] ``` * fix: updates without specifying a root volume size Add unit test. * fix: updates without specifying a root volume size Fix gofmt. * Scope nodeRef to workload cluster (kubernetes-sigs#744) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Fix NPE on delete bastion host (kubernetes-sigs#746) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Documentation for creating a new cluster on a different AWS account (kubernetes-sigs#728) * Initial draft of documentation for Cluster creation using cross account role assumption * Update roleassumption.md Complete the document. * cleanup the documentation for roleassumption * Resolved the comments: role assumption documentation. * Fix minor issues - roleassumption.md * resolve more comments to roleassumption.md * Resolve more comments - roleassumption.md * include machines-ha.yaml.template in release artifacts (kubernetes-sigs#741) * Update AWS sdk, improve log in machine actuator delete (kubernetes-sigs#747) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Fixes the infinite reconcile loop (kubernetes-sigs#748) * Uses patch for updating the cluster and machine specs - patch does not cause a re-reconcile in the capi controller * Uses update for updating the cluster and machine status - update for status is ok since it does not update any of the metadata no re-reconcile is necessary for the capi controller Signed-off-by: Chuck Ha <chuckh@vmware.com> * Update Gopkg.lock and cleanup Makefile (kubernetes-sigs#751) * Update cluster-api release-0.1 vendor (kubernetes-sigs#750) Signed-off-by: Vince Prignano <vincepri@vmware.com> * Reduce the number of re-reconciles (kubernetes-sigs#752) Signed-off-by: Chuck Ha <chuckh@vmware.com>
What this PR does / why we need it:
This change does two things:
Which issue(s) this PR fixes (optional, in
fixes #<issue number>(, fixes #<issue_number>, ...)format, will close the issue(s) when PR gets merged):Special notes for your reviewer:
These patches fix the crasher, but the story isn't yet fully told. We found ourselves in this situation after spinning up a Service of
Type: LoadBalancer, which caused the cloud provider to helpfully add a rule to the nodes' security group for "all traffic" coming from the ELB.This change will enable the management tooling to handle that new rule, decide it's not spec'd, and clean it up properly. At which point, I expect, the cloud provider will put it back. We're still investigating what to do about that tug of war, but I wanted to open the PR for collaboration.
Release note: