Skip to content

Conversation

@fapifta
Copy link
Contributor

@fapifta fapifta commented Jul 12, 2023

What changes were proposed in this pull request?

The PR proposes a change in how the RootCaRotationPoller is being instantiated, as the shared rootCaCertificates set from the DefaultCertificateClient is causing a problem.
The problem was found as part of the testing done for HDDS-8605, it is causing a problem in the case of the following events in order:

  • rootCA certificates are being rotated
  • OM certificate is being renewed with the new CA certs
  • RootCaRotationPoller is running after the certificate renewal

The cause it the shared rootCaCertificates set, which is updated by the certificate renewal process within the DefaultCertificateClient, so that the poller once it gets to poll the new certificates, it does not recognize that there is an unkown certificate (as it already knows it via the change in the data structure during renewal), therefore it does not update its listeners about the rootCA change in this scenario.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8998

How was this patch tested?

In #5009 I am working on to add filesystem client calls towards the DataNodes that would rely on the ServiceInfoEx update in that patch. The ServiceInfoEx update happens within OM, once the poller sends a notification to the ServiceInfoProvider (again in that patch). This test bought up the problem, and with the proposed fix, that test also runs fine. I have not shared that test so far, I will push it up shortly to the PR so reviewers can see, once this gets merged I will rebase that PR again.

@Galsza
Copy link
Contributor

Galsza commented Jul 12, 2023

@fapifta thank you for spotting this error, I have missed this during my implementation.
LGTM+1

@Galsza
Copy link
Contributor

Galsza commented Jul 12, 2023

@fapifta nit: what do you think about fixing this on the poller side to make a copy of the passed in set instead of having to copy the set when instantiating it? (I know it wouldn't matter in the long run because the poller would just be instantiated from the DefaultCertificateClient, but afaik it would be a better practice)

@fapifta fapifta requested review from ChenSammi and adoroszlai July 12, 2023 10:13
@fapifta
Copy link
Contributor Author

fapifta commented Jul 12, 2023

@Galsza I think the responsibility is on the DefaultCertificateClient side whether it serves its internal state or not. The poller should not be bothered by internals of the DefaultCertificateClient, that is why I do the copy on instantiation.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @fapifta for the fix.

@fapifta
Copy link
Contributor Author

fapifta commented Jul 12, 2023

Thank you for the review @adoroszlai, I am tracking down something that has the same symptom in the ha-secure test set once I add the file operations. Let's wait until the cause is clear for that one, I am not sure but it might be related to something similar, and we might fix those together, but I need some time to figure it out.

@adoroszlai
Copy link
Contributor

I am tracking down something that has the same symptom in the ha-secure test set

https://issues.apache.org/jira/browse/HDDS-8983
https://issues.apache.org/jira/browse/HDDS-8992

One of these or yet another failure?

@ChenSammi
Copy link
Contributor

@fapifta, the fix looks good, +1.

OM new CSR will be rejected if root ca rotation is in progress. The sequence looks like this,

  1. Root CA has been rotated.
  2. OM certificate enters grace period, then it starts and finish the certificate rotation.
  3. RootCaRotationPoller runs, since the set is shared, then RootCaRotationPoller cannot find the difference.

@ChenSammi
Copy link
Contributor

I am tracking down something that has the same symptom in the ha-secure test set

https://issues.apache.org/jira/browse/HDDS-8983 https://issues.apache.org/jira/browse/HDDS-8992

One of these or yet another failure?

@adoroszlai, thanks for reporting the issue. HDDS-8983 and HDDS-8992 are the same cause, different behavior.

I submitted #5052, please help to have a review @fapifta @Galsza @adoroszlai , thanks.

@fapifta
Copy link
Contributor Author

fapifta commented Jul 13, 2023

The second problem I ran into was because of a missing config as it turned out at the end.

Thank you for the reviews, I am merging this one as it is for now.

@fapifta fapifta merged commit 8c8c6be into apache:master Jul 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants