Skip to content

AGENT-903: monitor-add-nodes should only show CSRs matching node#8376

Merged
openshift-merge-bot[bot] merged 11 commits intoopenshift:masterfrom
rwsu:AGENT-903
May 17, 2024
Merged

AGENT-903: monitor-add-nodes should only show CSRs matching node#8376
openshift-merge-bot[bot] merged 11 commits intoopenshift:masterfrom
rwsu:AGENT-903

Conversation

@rwsu
Copy link
Contributor

@rwsu rwsu commented May 9, 2024

The first and second CSRs pending approval have the node name (hostname) embedded in their specs. monitor-add-nodes should only show CSRs pending approval for a specific node. Currently it shows all CSRs pending approval for all nodes.

If the IP address of the node cannot be resolved to a hostname, we will not be able to determine if there are any CSRs pending approval for that node. The monitoring command will skip showing CSRs pending approval. In this case, users can still approve the CSRs, and the monitoring command will continue to check if the node has joined the cluster and has become Ready.

rwsu and others added 9 commits May 2, 2024 16:02
The function now requires kubeconfig file path, rendezvousIP, and
sshKey as parameters. Previously it had a single parameter, assetStore,
and it searched the asset store to determine the three parameters
above.
Adds the ability to monitor a node being added during day2.

The command is:

node-joiner monitor-add-nodes --kubeconfig <kubeconfig-file-path>
<IP-address-of-node-to-monitor>

Both the kubeconfig file and IP address are required.

Multi node monitoring will be added in a future PR.
NewCluster needs both assetDir for install workflow and
kubeconfigPath for addnodes workflow.

Cluster.assetDir should only be initialized for the install
workflow.
Co-authored-by: Andrea Fasano <afasano@redhat.com>
@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 9, 2024

@rwsu: This pull request references AGENT-903 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

The first and second CSRs pending approval have the node name (hostname) embedded in their specs. monitor-add-nodes should only show CSRs pending approval for a specific node. Currently it shows all CSRs pending approval for all nodes.

We can assume the hostname is provided in node-config.yaml. This allows us to tie a hostname to an IP address as both are required in node-config.yaml.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 9, 2024
@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 9, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 9, 2024
@rwsu rwsu marked this pull request as ready for review May 9, 2024 02:23
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 9, 2024
@openshift-ci openshift-ci bot requested review from andfasano and barbacbd May 9, 2024 02:24
@zaneb
Copy link
Member

zaneb commented May 9, 2024

We can assume the hostname is provided in node-config.yaml.

I don't think we can assume that.

The hostname is optional, and node-config.yaml should not be an input to monitor-add-nodes.

@andfasano
Copy link
Contributor

We can assume the hostname is provided in node-config.yaml.

I don't think we can assume that.

The hostname is optional, and node-config.yaml should not be an input to monitor-add-nodes.

The main issue here is the pending csr matching heuristic. Not sure if there's a better way, but the hostname is required to identify the host the csr is referring to, so we need to find a way to get it somehow. An alternative approach be trying to use net.LookupAddr on the specified IP(s) from the command line?

@zaneb
Copy link
Member

zaneb commented May 10, 2024

How would this normally be done? e.g. if you scaled out a machineset? I know in IPI that there's a nodelink controller that links a Node to a Machine by matching the IP address of the Node to one of the IPs of the Machine, and successful linkage results in the CSR being approved. So my thought was that we can look up the Node with the IP matching the one the user gave us and then sign the matching CSR. Does that only apply to the second CSR? How does the other one get approved?
Can we ask assisted-service what the hostname is before it goes away?

@zaneb
Copy link
Member

zaneb commented May 10, 2024

There's pretty good docs here: https://github.com/openshift/cluster-machine-approver?tab=readme-ov-file#node-client-csr-approval-workflow
The client appears to be the hard one, because we presumably don't have a Machine? The cluster-machine-approver checks:

A Machine must exist with a NodeInternalDNS address in its Status that matches the future name of the Node, as found in the CSR.

The NodeInternalDNS is the hostname obtained from inspection on baremetal.

The server one is what I described above. It's slightly hard than I remembered, though: all of the alternative names in the certificate must match IPs or hostnames in the Machine. We'll only get one IP from the user. Usually the ones in the Machine are populated by inspection.

Does assisted-service post anything back to the cluster (e.g. create a Machine object) that could help us? If not maybe we need it to. We could post a ConfigMap to a special namespace, and embed client creds with the ability to do that in the ISO. There's a reason we delayed actually approving CSRs beyond the initial preview release 🙂

For now if we're just filtering CSRs to decide which ones to tell the user about (and letting them approve manually), I think looking up the Node name in DNS as you suggested and checking the IP matches (for the client cert), and checking that one of the alternative names matches the IP (for the server cert) is sufficient.

@openshift-ci-robot
Copy link
Contributor

openshift-ci-robot commented May 14, 2024

@rwsu: This pull request references AGENT-903 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.16.0" version, but no target version was set.

Details

In response to this:

The first and second CSRs pending approval have the node name (hostname) embedded in their specs. monitor-add-nodes should only show CSRs pending approval for a specific node. Currently it shows all CSRs pending approval for all nodes.

If the IP address of the node cannot be resolved to a hostname, we will not be able to determine if there are any CSRs pending approval for that node. The monitoring command will skip showing CSRs pending approval. In this case, users can still approve the CSRs, and the monitoring command will continue to check if the node has joined the cluster and has become Ready.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@rwsu rwsu force-pushed the AGENT-903 branch 2 times, most recently from 789eb50 to 3f022a9 Compare May 15, 2024 03:03
Copy link
Contributor

@andfasano andfasano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of minor points, but the overall approach sounds definitely more accurate. Thanks @rwsu

/approve

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the http.Get(url) is successful (no err), then shouldn't isKubeletRunningOnNode() return true?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: could be worth resolving the hostname once (at the begininng, in the newAddNodeMonitor() and store the result in as a field in the struct, rather than trying to resolve it every time?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging this condition as an error may sound a little bit confusing for the user. Since it's a condition that does not prevent the monitor workflow to proceed, I think it could be logged with a message indicating that the hostname cannot be resolved, and thus the CSR checks are skipped.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than returning an empty list, I'd suggest to refactor this portion of code so that if the hostname cannot be resolved then clusterHasFirstCSRPending and clusterHasSecondCSRPending would return true, to clearly indicate that these checks are skipped

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 15, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andfasano

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 15, 2024
rwsu added 2 commits May 15, 2024 11:35
The first and second CSRs pending approval have the node name
(hostname) embedded in their specs. monitor-add-nodes should only
show CSRs pending approval for a specific node. Currently it shows
all CSRs pending approval for all nodes.

If the IP address of the node cannot be resolved to a hostname,
we will not be able to determine if there are any CSRs pending
approval for that node. The monitoring command will skip showing
CSRs pending approval. In this case, users can still approve the
CSRs, and the monitoring command will continue to check if the node
has joined the cluster and has become Ready.
Previously, the code was resolving each time the CSRs were checked.

Changed error logging to improve clarity.
@pawanpinjarkar
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 16, 2024
@rwsu
Copy link
Contributor Author

rwsu commented May 16, 2024

/label acknowledge-critical-fixes-only

@openshift-ci openshift-ci bot added the acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. label May 16, 2024
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 294318e and 2 for PR HEAD 5ad9955 in total

@rwsu
Copy link
Contributor Author

rwsu commented May 16, 2024

/override ci/prow/e2e-agent-compact-ipv4

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 16, 2024

@rwsu: Overrode contexts on behalf of rwsu: ci/prow/e2e-agent-compact-ipv4

Details

In response to this:

/override ci/prow/e2e-agent-compact-ipv4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD 48241a0 and 1 for PR HEAD 5ad9955 in total

@rwsu
Copy link
Contributor Author

rwsu commented May 16, 2024

/override ci/prow/e2e-agent-compact-ipv4

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 16, 2024

@rwsu: Overrode contexts on behalf of rwsu: ci/prow/e2e-agent-compact-ipv4

Details

In response to this:

/override ci/prow/e2e-agent-compact-ipv4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 17, 2024

@rwsu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/okd-e2e-aws-ovn-upgrade 5ad9955 link false /test okd-e2e-aws-ovn-upgrade
ci/prow/okd-e2e-agent-compact-ipv4 5ad9955 link false /test okd-e2e-agent-compact-ipv4
ci/prow/e2e-agent-ha-dualstack 5ad9955 link false /test e2e-agent-ha-dualstack
ci/prow/e2e-agent-compact-ipv4-appliance 5ad9955 link false /test e2e-agent-compact-ipv4-appliance

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@rwsu
Copy link
Contributor Author

rwsu commented May 17, 2024

/retest-required

@rwsu
Copy link
Contributor Author

rwsu commented May 17, 2024

/override ci/prow/e2e-agent-compact-ipv4

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 17, 2024

@rwsu: Overrode contexts on behalf of rwsu: ci/prow/e2e-agent-compact-ipv4

Details

In response to this:

/override ci/prow/e2e-agent-compact-ipv4

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rwsu
Copy link
Contributor Author

rwsu commented May 17, 2024

/override ci/prow/e2e-aws-ovn
The failures were in the e2e tests unrelated to this PR. The job has passed previously.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented May 17, 2024

@rwsu: Overrode contexts on behalf of rwsu: ci/prow/e2e-aws-ovn

Details

In response to this:

/override ci/prow/e2e-aws-ovn
The failures were in the e2e tests unrelated to this PR. The job has passed previously.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-merge-bot openshift-merge-bot bot merged commit 52e2540 into openshift:master May 17, 2024
@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build ose-installer-altinfra-container-v4.16.0-202405170342.p0.g52e2540.assembly.stream.el9 for distgit ose-installer-altinfra.
All builds following this will include this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants