-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-8605. Implement the ability to update the ServiceInfo object with the new rootCA #5009
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest use "interval" instead of "frequency" here and all other places, to be consistent with existing other properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's a better word, I'm going to fix it
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fapifta for the patch.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OzoneManager.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is only used by ServiceInfoProvider, which already handles PEM encoding for the CA list, I think this method should be omitted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this makes testing a bit more complex though I think you are right, a simpler API for the CertificateClient interface is more important, I haven't realized this one before, thank you for catching it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a few more tests that need to be updated:
https://github.com/fapifta/hadoop-ozone/actions/runs/5459036651/jobs/9934647451#step:5:3792
https://github.com/fapifta/hadoop-ozone/actions/runs/5459036651/jobs/9934647503#step:5:3764
https://github.com/fapifta/hadoop-ozone/actions/runs/5459036651/jobs/9934647503#step:5:3810
Sorry for the trouble. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have adjusted the code to overcome these.
...op-ozone/ozone-manager/src/test/java/org/apache/hadoop/ozone/om/TestServiceInfoProvider.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/ServiceInfoProvider.java
Outdated
Show resolved
Hide resolved
hadoop-hdds/common/src/test/java/org/apache/hadoop/hdds/security/x509/package-info.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method could possibly used in other places, (ReloadingTrustManager is what I know by heart, but it's going to change with Sammi's change), could you check once we get there, if there are other places we could use this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually after the new changes, I realized that this method is not needed anymore, and I should use the new methods that we have in the certificate client to get the list of rootCA certificates. I have reworked the relevant code, and related tests.
Galsza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @fapifta for working on this. I have left 2 comments, one of them is minor, but I'm curious about your opinion on the suggestion of avoiding the race condition between the client's cert renewal and the ServiceInfoProvider's. I hope I didn't misunderstand anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a race condition here with the certificateClient saving its own new rootCaCertificate. On root ca rotation the client will schedule a task to send a CSR and only after receiving the signed certificate will it save and update the new RootCA/SubCA. However when the ServiceInfoProvider asks for it it still might not have reached that point and therefore would receive the old certificate data. (And this notification won't repeat because we already think that the handlers successfully rotated the root ca)
I have a suggestion which could resolve this issue: instead of registering this to the rootCaRotationListeners we could register it to just the CertificateNotation receivers. They will get notified each cert rotation including the client renewing their known root ca.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is that, it is a race condition, but the problem is not the race...
The problem is more of a question instead, around what we want to do with 1.1 or older clients, as all 1.2 or newer client versions work based off of the caCertPemList instead of the caCertPem value.
So newer clients rightfully get the list of the old and the new rootCA certificate, and will work with the cluster fine during the rotation, whichever root of trust the certificates of cluster services have old or new.
But with older clients we have a similar tradeoff as we had once we have introduce SCM HA. With SCM HA old clients got the sub-CA certificate that OM has read from its filesystem as the signer of its certificate, and if the post 1.2 DN that the pre 1.2 DN trying to connect to has its certificate signed by the same cert, then the client was able to read from the DN. Now in our case, we again need a decision, either we give back the old root CA certificate, which is returned by the certClient at this point, or we return the new certificate.
Based on your question I am happy that I thought this through more, and I think we should return the new rootCA certificate here. With that, we will have a glitch in old clients, as we don't know how many DNs have already renewed their certificates with the new rootCA and its subCAs, but for sure the number will be all DNs eventually, while less and less DNs will use their old certificates. So I need to change this code accordingly. Thank you for pointing this out!
5f7fa25 to
aed245b
Compare
|
I think all the review comments were addressed, @Galsza @ChenSammi @adoroszlai please take a look once you have the time. I have rebased the patch onto the tip of the master branch, as it has all the related commits. During the process I squashed my commits that were added to address the review comments, but in the meantime, I think I have improved the readability of the code. Let me know if you see any further things to address. |
|
@fapifta Thank you Pifta, there are a few elegant solutions here that I could learn of. LGTM +1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please avoid call certClient.getAllRootCaCerts() twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not have any significant impact, the getter itself provides an unmodifiable set, and that is basically just a wrapper around the internal data structure inside the CertificateClient.
I am fine with both approach so I have updated the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"certClient != null" will have the same effect as "new SecurityConfig(config).isSecurityEnabled()" and avoid create a new SecurityConfig object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that the configuration itself was needed just for this one. So instead I pass on a SecurityConfig object from OzoneManager directly. I know the null check is essentially the same effect, but this way the code is better expressing what it does, and why.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fapifta , I think we should not hide the exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hiding this is not really a problem. The following logic is applied on the client side: if there are a list of CA certs, then the list is being used, if there is no element in the list or the list is null, then the sole caCertPEM is being used. This logic is there to ensure compatibility between the new (1.2+) client and an older server side, so we need to keep that semantic.
If any/all certificates in the list fails to translate to PEM format, then we have a second chance.
I have added an error level log message to sign that there is a problem, but chances are that this will never happen, as we get the in memory X509Certificate already from a PEM formatted string, so we should fail earlier if there is a problem with this conversion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fapifta , thanks for the explanation.
It cannot purely sure a checked type exception will not happen. Let's not make that assumption.
Besides, an originally empty caCertPEMList, and an empty caCertPEMList because of certificate convert failure are two different cases. We need to distinguish them. Not sure if the backward comparability handling on client side can cover them both. Even if it can, it's not a recommended behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at it once more I relaized, that this local optimization does not really mean a thing, as other methods might throw an exception and still fail early. So instead of returning null, the exception is re-thrown as runtimeexception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the inline statement directly, instead of calling a one statement function, to save one time function call overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JIT would inline the method call anyways, and probably it would do more. Though this note was useful because it made me check what I was doing, and realize that I have a way better solution to solve the readability problem I wanted to solve with the expiry method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
info -> error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a way to synchronize the fetch and update of caCertPem and caCertPermList.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The easiest seems to synchronize on this when the values are updated in onRootCAChange, and then assign these to variables in a synchronized block so that we can provide a copy to the ServiceInfoEx constructor, and we do not need to care about further synchronization if I think it correclty. Adjusted the code.
|
Hi @fapifta , thanks for working on this. Could you add some file put/get operation in robot test, after the root certificate is rotated and first root certificate is expired? You can leverage test-root-ca-rotation.sh |
Galsza
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left one more comment for the createSelfSignedCert helper method.
(It came from copying it to my own fork temporarily and trying to use it. I've encountered the problem of trying to store multiple generated certs, but the id was the same)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This actually generates the same id for the certificate over and over again and hides it away from the invoker. I think for a utility method like this it would be better to generate a new id or explicitly specify that the same keypair will generate the same id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is a utility method, I could design it only for the use case I have at hand. I am adjusting the documentation, but I think we should not overengineer it here, once someone needs some other functionality, we can add it at that point in time either via an overload or via a new parameter and adjusted code where it is used.
|
Thank you for the additional reviews @ChenSammi @Galsza. I tried to address all your comments properly, if you can take an other look, I would be glad. This PR in the current stage relies on #5050 and #5052 to be added, until that the ozonesecure and ozonesecure-ha tests are expected to fail. |
|
|
||
| #transfer leader to scm4.org | ||
| execute_robot_test scm4.org -v "TARGET_SCM:scm4.org" scmha/scm-leader-transfer.robot | ||
| execute_robot_test scm4.org -v PREFIX:"rootca" certrotation/root-ca-rotation-client-checks.robot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rootca -> rootca2
I checked the failed robot test. After leader is transferred to scm4.org, leader OM will fail to allocate block for key upload since the three OM doesn't know scm4.org. We can move this root-ca-rotation-client-checks.robot test ahead of leader transfer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updates, thank you for checking into this.
| */ | ||
| public static X509Certificate createSelfSignedCert( | ||
| KeyPair keys, String commonName | ||
| ) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to discuss the function declare code style here. It's better to keep the consistent style with other majority codes, putting as many as possible in one line,
public static X509Certificate createSelfSignedCert(
KeyPair keys, String commonName) throws Exception {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also updated the constructor of ServiceInfoProvider to conform this style.
| */ | ||
| public static X509Certificate createSelfSignedCert( | ||
| KeyPair keys, String commonName, Duration expiresIn | ||
| ) throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above comment.
|
@fapifta , thanks for update the patch. I have left four new comments, including the one under toPEMEncodedStringUnsafe. |
|
Thank you for the review, I have pushed changes to address your comments @ChenSammi, thank you for the continous reviews! |
|
Looking at it further after the change in ServiceInfoProvider.toPEMEncodedStringUnsafe, the method felt unnecessary, so I did some more code changes around that, and removed the method. |
|
@fapifta thank you for the changes. They are looking good overall. There are a couple of CI failures, but they were successful on your branch as far as I can tell. |
|
@fapifta , I checked the failed UT, looks like it's because "hdds.x509.ca.rotation.enabled" default is false now, so RootCaRotationPoller is not instantiated, so does the failed acceptance test. |
…s the 4th root ca before checking client operations.
…view discussion, and then removed the method as well.
|
Oh well... thank you @ChenSammi to take a look at, I am verifying the fix for this quickly on the failed tests, and push the fix. |
…ion is not enabled.
|
Thanks @fapifta , LGTM +1. |
What changes were proposed in this pull request?
During the rotation of the rootCA certificate, we need to update the cached PEM format representation of the CA certificates in Ozone Manager, so that via the ServiceInfoEx object we are able to provide those to the clients.
In this PR, I have separated a class called ServiceInfoProvider, into which I have pushed out all the relevant code.
It attaches a listener to the OM's certificate client that manages the polling for the rootCA certificate, and then notifies this provider to update its cached data.
OM every time creates a ServiceInfoEx object once it is requested via the OzoneManagerProtocol, so does the provider.
The provider is needed to separate away the logic and to be able to test the handling of the renew separately and effectively.
In order to utilize some pre-written utilities in the test for the Provider, I have moved the relevant methods from TestInterSCMProtocolService to a test utility class under hdds-common. With that I had to move the HDDSKeyGenerator utility class in the production code as well, so that it is available for the test utilities. Both of these are general utilities, so they are rightfully placed in hdds-common I believe.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8605
How was this patch tested?
Added UT for the new functionality.