Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automated cherry pick of #5955: Persist TLS certificate and key of antrea-controller (#5955) #6004: Fix race condition in pkg/apiserver/certificate unit tests #6080

Conversation

tnqn
Copy link
Member

@tnqn tnqn commented Mar 7, 2024

Cherry pick of #5955 #6004 on release-1.13.

#5955: Persist TLS certificate and key of antrea-controller (#5955)
#6004: Fix race condition in pkg/apiserver/certificate unit tests

For details on the cherry pick process, see the cherry pick requests page.

tnqn and others added 2 commits March 7, 2024 22:06
In clusters where upgrade is performed with rolling update of Nodes and
images of new versions are only available on new Nodes, the deployment
strategy of antrea-controller is set to RollingUpdate to prevent
antrea-controller Pod from being deleted immediately when the deployment
is updated, leading to a period in which no antrea-controller is
running. However, it also causes two instances of antrea-controller to
run simultaneously in a short time, making it possible that the old
instance overrides the CA bundle stored in antrea-ca ConfigMap,
APIServices, and Webhooks, while the new instance won't update them
again.

The commit makes two changes to fix the problem:

1. CACertController will periodically sync the CA cert to improve the
   robustness.

2. Self-signed TLS certificate and key of antrea-controller will be
   stored in a Secret and will be reused after restarting controller.
   This makes running multiple antrea-controller instances
   simultaneously possible and makes restart of antrea-controller
   smoother as antrea-agents don't need to retrieve a new CA bundle most
   of the time.

Besides, the change is helpful for implementing high-availability of
antrea-controller in the future.

Signed-off-by: Quan Tian <[email protected]>
…#6004)

There was some "interference" between TestSelfSignedCertProviderRotate
and TestSelfSignedCertProviderRun. The root cause is that the
certutil.GenerateSelfSignedCertKey does not support a custom clock
implementation and always calls time.Now() to determine the current
time. It then adds a year to the current time to set the expiration time
of the certificate. This means that when rotateSelfSignedCertificate()
is called as part of TestSelfSignedCertProviderRotate, the new
certificate is already expired, and rotateSelfSignedCertificate() will
be called immediately a second time. By this time however,
TestSelfSignedCertProviderRotate has already exited, and we are already
running the next test, TestSelfSignedCertProviderRun. This creates a
race condition because the next test will overwrite
generateSelfSignedCertKey with a mock version, right as it is called by
the second call to rotateSelfSignedCertificate() from the previous
test's provider.

To avoid this race condition, we make generateSelfSignedCertKey a member
of selfSignedCertProvider.

Fixes antrea-io#5977

Signed-off-by: Antonin Bas <[email protected]>
Co-authored-by: Quan Tian <[email protected]>
@tnqn tnqn added the kind/cherry-pick Categorizes issue or PR as related to the cherry-pick of a bug fix from the main branch to a release label Mar 7, 2024
@tnqn
Copy link
Member Author

tnqn commented Mar 7, 2024

@antoninbas I backport this as it's requested by release-1.13 consumers. Given that the risky scenario only happens when users have changed deployment strategy (which shouldn't be common) and we will release a patch release for all minor releases >=1.13, I think it should be fine. Please let me know if you have any concern.

@antoninbas
Copy link
Contributor

@tnqn I know you mentioned before that "upgrading" from a patched controller to an unpatched one could create an issue (when using the RollingUpdate strategy), but I don't recall what the issue would be exactly. While I see that there could be a race condition between the old and new controllers, it doesn't seem any worse than the situation before your patch (for example, upgrading from 1.12.0 to 1.13.0). Or do we increase the likelihood of the old controller overriding the CA bundle because of the periodic sync?

I am surprised that we are getting a request to backport. As you pointed out originally, the odds of the old controller overriding the new CA are quite low:

Even for the latest report, we only know there was an old controller overwriting the CA bundle with its own one generated two days ago, but this shouldn't happen as overwriting the ConfigMap is one-time job and an old instance will only do it again after 1/2 of the valid duration (half a year) or it never succeeded to update all resources' CA bundle in the last two days.

That being said, I am fine with backporting this patch, for the reason you mentioned. Let's just make sure we publish patch releases in the correct order: 1.15.x -> 1.14.x -> 1.13.x

@tnqn
Copy link
Member Author

tnqn commented Mar 8, 2024

Or do we increase the likelihood of the old controller overriding the CA bundle because of the periodic sync?

Yes, I think it increases the likelihood a bit. When it's upgraded from unpatched to unpatched, the issue should only happen when there is already something wrong causing the old controller to keep retrying overwriting the CA cert. When it's upgraded from patched to unpatched, even if everything works fine, there is a chance that old controller could overwrite it due to the periodical update.

Let's just make sure we publish patch releases in the correct order: 1.15.x -> 1.14.x -> 1.13.x

Sure, I was thinking the same.

@tnqn
Copy link
Member Author

tnqn commented Mar 8, 2024

/test-all

@tnqn tnqn merged commit d129266 into antrea-io:release-1.13 Mar 8, 2024
48 of 51 checks passed
@tnqn tnqn deleted the automated-cherry-pick-of-#5955-#6004-upstream-release-1.13 branch March 8, 2024 09:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cherry-pick Categorizes issue or PR as related to the cherry-pick of a bug fix from the main branch to a release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants