Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 Make VPC creation idempotent to avoid indefinite creation of new VPCs if storage of the ID fails #4723

Merged
merged 1 commit into from
Feb 2, 2024

Conversation

AndiDog
Copy link
Contributor

@AndiDog AndiDog commented Jan 5, 2024

What type of PR is this?

/kind bug

What this PR does / why we need it:

In case of a repeating storage error, for example

"Reconciler error" err="failed to reconcile network for AWSManagedControlPlane mynamespace/mycluster: failed to patch conditions: AWSManagedControlPlane.controlplane.cluster.x-k8s.io \"mycluster\" is invalid: spec.network.subnets[7]: Duplicate value: map[string]interface {}{\"id\":\"\"}" [...]

which was caused by something totally different, CAPA would create an indefinite number of VPCs (and potentially other resources depending how far reconciliation succeeds!), filling up the AWS account to its limits and requiring manual cleanup of the whole mess (the one case I had wasn't fun 😆).

This is terrifying but based on a very simple, horrible bug: VPC creation wasn't idempotent. This PR introduces the typical look-up-else-create logic.

Unfortunately, VPC creation tests were faulty because the mockCtrl object was shared across all test cases, so the expected mock calls of test case A could be "used" by a unrelated test case B, potentially making a test case pass that normally should fail. Or in short, we weren't testing what we thought we were. Also, some test cases had a description differing from the test, and made no sense overall, so I removed those. The tests now describe that VPC creation should happen only once. On top, since some tests were related to IPv6, I added a check to see if the AWS VPC IPv6 support matches our spec, or fail on mismatch. This made all tests green again. To ensure that we test the right thing in the future, test cases now check for specific errors.

Checklist:

  • squashed commits
  • includes documentation
  • includes emojis
  • adds unit tests
  • adds or updates e2e tests

Release note:

Make VPC creation idempotent to avoid indefinite creation of new VPCs if storage of the ID fails

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-priority size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 5, 2024
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 5, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

1 similar comment
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 8, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

Comment on lines 117 to 119
if !vpc.Tags.HasOwned(s.scope.Name()) {
return errors.Errorf("an unmanaged VPC %s already exists, refusing to create another managed VPC", vpc.ID)
}
Copy link
Member

@vincepri vincepri Jan 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The VPC today is created with tags within the same call, the error message is confusing to users given that if there are no tags, there isn't another managed VPC

Suggested change
if !vpc.Tags.HasOwned(s.scope.Name()) {
return errors.Errorf("an unmanaged VPC %s already exists, refusing to create another managed VPC", vpc.ID)
}
if !vpc.Tags.HasOwned(s.scope.Name()) {
return errors.Errorf("found VPC named %s which cannot be managed by CAPA due to lack of tags, either tag the VPC manually [add tag needed here], or provide the `vpc.id` field instead if you wish to bring your own VPC (link to the doc?)", vpc.ID)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

}

if s.scope.VPC().IsIPv6Enabled() != (vpc.IPv6 != nil) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split this in two lines for readability?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in both locations

@AndiDog AndiDog force-pushed the vpc-recon-idempotence branch from 2b579fc to 811ef3e Compare January 10, 2024 15:57
@AndiDog AndiDog requested a review from vincepri January 18, 2024 08:11
Comment on lines 57 to 60
actualIPv6Enabled := (vpc.IPv6 != nil)
if s.scope.VPC().IsIPv6Enabled() != actualIPv6Enabled {
return errors.Errorf("IPv6 support of found unmanaged VPC %s differs from desired spec. Changing IP family is currently not supported.", vpc.ID)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a net-new check (repeated in the other function as well); should we remove this for now and add it in a different PR? Generally these checks would be best in a webhook at Update time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The webhook already checks that IPv6 can't be toggled later:

if oldAWSManagedControlplane.Spec.NetworkSpec.VPC.IsIPv6Enabled() != r.Spec.NetworkSpec.VPC.IsIPv6Enabled() {
allErrs = append(allErrs,
field.Invalid(field.NewPath("spec", "networkSpec", "vpc", "enableIPv6"), r.Spec.NetworkSpec.VPC.IsIPv6Enabled(), "changing IP family is not allowed after it has been set"))
}

IIRC, my change was to get a clear error message in unit tests, since they can simulate mismatching situations such as someone manually having turned on IPv6 via AWS Console. In such scenarios, the webhook wouldn't be called, but we still want an error instead of being silent about not reconciling the difference. I don't think this would be a user-facing breakage since normally, the reconciled state (here: IPv6 support on/off in the VPC) matches the spec.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @vincepri here, it's better to do it in separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the 2 occurrences of the check, and the new test case for that. I'll take note to re-add this in a separate PR, potentially targeting a later milestone.

@richardcase
Copy link
Member

/milestone v2.4.0

@k8s-ci-robot k8s-ci-robot added this to the v2.4.0 milestone Jan 23, 2024
@AndiDog AndiDog force-pushed the vpc-recon-idempotence branch from 811ef3e to 0e78bb0 Compare January 24, 2024 07:24
@AndiDog
Copy link
Contributor Author

AndiDog commented Jan 24, 2024

E2E tests didn't work some weeks ago, it seems. Rebased onto main to see if that fixes the problem.

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

@Ankitasw
Copy link
Member

Ankitasw commented Feb 1, 2024

/test pull-cluster-api-provider-aws-e2e

@AndiDog AndiDog force-pushed the vpc-recon-idempotence branch from 0e78bb0 to af9ffa5 Compare February 1, 2024 12:55
@vincepri
Copy link
Member

vincepri commented Feb 1, 2024

/test pull-cluster-api-provider-aws-e2e
/test pull-cluster-api-provider-aws-e2e-eks

Copy link
Member

@vincepri vincepri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/assign @Ankitasw

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 1, 2024
@Ankitasw
Copy link
Member

Ankitasw commented Feb 2, 2024

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Ankitasw

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 2, 2024
@k8s-ci-robot k8s-ci-robot merged commit 030638a into kubernetes-sigs:main Feb 2, 2024
19 checks passed
@AndiDog AndiDog deleted the vpc-recon-idempotence branch February 13, 2024 19:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-priority release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants