-
Notifications
You must be signed in to change notification settings - Fork 1.5k
OCPBUGS-10767: Fix and improve locking session and AWS Metadata access #7070
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OCPBUGS-10767: Fix and improve locking session and AWS Metadata access #7070
Conversation
3359216 to
38dcd6f
Compare
|
@sadasu: An error was encountered querying GitHub for users with public email ([email protected]) for bug OCPBUGS-10767 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details. Full error message.
non-200 OK status code: 500 Internal Server Error body: ""
Please contact an administrator to resolve this issue, then request a bug refresh with DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/jira refresh |
|
@sadasu: This pull request references Jira Issue OCPBUGS-10767, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather have this function be dumber and just populate the subnets. Then we can drop the check if the subnets are populated and the mutex locking and move that to the calling functions. For example:
func (m *Metadata) PublicSubnets(ctx context.Context) (map[string]Subnet, error) {
m.mutex.Lock()
defer m.mutex.Unlock()
if len(m.publicSubnets) == 0 {
err := m.populateSubnets(ctx)
if err != nil {
return nil, errors.Wrap(err, "retrieving Public Subnets")
}
}
return m.publicSubnets, nil
}With that change, all exported functions (InstanceTypes, VPC, *Subnets, AvailabilityZones, Session) will behave and look consistently. I think that's more desirable than a bit of code "duplication".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had considered that. I am making some changes. Let us see where we land.
38dcd6f to
5b449a2
Compare
6cca123 to
77506b5
Compare
|
@sadasu since you updated the error messages, the unit tests also need to be updated. |
4ed48c6 to
5791245
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mtulio adding m.edgeSubnets to this tests causes the unit tests to fail. Looks like you specifically added a test to prevent that. I don't see why we can't check for len(m.edgeSubnets) > 0 here. Some special way EdgeSubnets are handled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sadasu , sorry by missed that message. I am on it validating it to recall the scenario that drove me to create that unit blocking. Ideally, calling subnets() at once should populate all *Subnets, including edge.
I am looking at the failed unit job on this PR history, overall we required that edge subnets exist only when public or private have been provided, we should never accept the edge subnets only (as CP and regular workers must exist alongside LZ nodes) - this is a required defined on EP due AWS limitations of network resources (NLB/NGW, etc)
Would you mind sharing the current unit failure to help to dig in? I can see changes after the job I checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also sharing on slack that I have installed successfully the cluster with Local Zones/edge + PHZ on this version
commit 1fced156cc5540a540a20ac208d79812ef72eae2 (HEAD -> pr_7070)
Author: Sandhya Dasu <[email protected]>
Date: Wed Apr 5 12:42:59 2023 -0400
install log:
$ OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE="$RELEASE" $INSTALLER version
./openshift-install_pr7070 unreleased-master-7977-g1fced156cc5540a540a20ac208d79812ef72eae2-dirty
built from commit 1fced156cc5540a540a20ac208d79812ef72eae2
release image registry.ci.openshift.org/origin/release:4.13
release architecture amd64
(...)
INFO Time elapsed: 26m46s
(...)
$ oc --kubeconfig $INSTALL_DIR/auth/kubeconfig get machines -n openshift-machine-api
NAME PHASE TYPE REGION ZONE AGE
ocp-lz14-h9rmj-edge-us-east-1-nyc-1a-x9pkz Running c5d.2xlarge us-east-1 us-east-1-nyc-1a 50m
|
/retest |
|
/approve Collapsing the mutexes into a single mutex in the common It would be good to include to a commit message explaining the change according to the contributing guidelines: https://github.com/openshift/installer/blob/master/CONTRIBUTING.md#commit-message-format |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: patrickdillon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
4dd154b to
99d62e9
Compare
|
/retest-required Remaining retests: 0 against base HEAD ebabc74 and 2 for PR HEAD 4dd154bc0e965d7167ab3a5c3e18b1ca98efda75 in total |
1fced15 to
2d39342
Compare
This fix eliminates the need for mutexSubnets to update subnet information within AWS metadata. It also updates populateSubnets to take care of getting VPC and subnets once for the installation eliminating duplicate code that could be prone to errors. Co-authored-by: Rafael F. <[email protected]>
2d39342 to
6cb715e
Compare
|
/test verify-vendor |
|
/test e2e-aws-ovn |
|
/retest-required |
|
/retest |
|
/hold |
mtulio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
validated with PHZ + LZ subnet
/lgtm
|
/hold cancel |
|
/hold |
|
/hold cancel |
|
@sadasu: Jira Issue OCPBUGS-10767: Some pull requests linked via external trackers have merged: The following pull requests linked via external trackers have not merged:
These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with Jira Issue OCPBUGS-10767 has not been moved to the MODIFIED state. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@sadasu: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
/cherry-pick release-4.13 |
|
@sadasu: new pull request created: #7129 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This fix eliminates the need for mutexSubnets to update subnet
information within AWS metadata. It also updates populateSubnets
to take care of getting VPC and subnets once for the installation
eliminating duplicate code that could be prone to errors.