Skip to content

Conversation

@sadasu
Copy link
Contributor

@sadasu sadasu commented Apr 5, 2023

This fix eliminates the need for mutexSubnets to update subnet
information within AWS metadata. It also updates populateSubnets
to take care of getting VPC and subnets once for the installation
eliminating duplicate code that could be prone to errors.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2023
@sadasu sadasu changed the title WIP: Cleanup session setup and mutext use in aws metedata WIP: Cleanup session setup and mutex use in aws metedata Apr 5, 2023
@openshift-ci openshift-ci bot requested review from mtulio and patrickdillon April 5, 2023 13:46
@sadasu sadasu changed the title WIP: Cleanup session setup and mutex use in aws metedata WIP: Cleanup session setup and mutex use in aws metadata Apr 5, 2023
@sadasu sadasu force-pushed the cleanup-aws-metedata branch 4 times, most recently from 3359216 to 38dcd6f Compare April 5, 2023 16:44
@sadasu sadasu changed the title WIP: Cleanup session setup and mutex use in aws metadata OCPBUGS-10767: Fix and improve locking session and AWS Metadata access Apr 5, 2023
@openshift-ci-robot
Copy link
Contributor

@sadasu: An error was encountered querying GitHub for users with public email ([email protected]) for bug OCPBUGS-10767 on the Jira server at https://issues.redhat.com/. No known errors were detected, please see the full error message for details.

Full error message. non-200 OK status code: 500 Internal Server Error body: ""

Please contact an administrator to resolve this issue, then request a bug refresh with /jira refresh.

Details

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@sadasu
Copy link
Contributor Author

sadasu commented Apr 5, 2023

/jira refresh

@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 5, 2023
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Apr 5, 2023
@openshift-ci-robot
Copy link
Contributor

@sadasu: This pull request references Jira Issue OCPBUGS-10767, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.14.0) matches configured target version for branch (4.14.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira ([email protected]), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Comment on lines 115 to 121
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather have this function be dumber and just populate the subnets. Then we can drop the check if the subnets are populated and the mutex locking and move that to the calling functions. For example:

func (m *Metadata) PublicSubnets(ctx context.Context) (map[string]Subnet, error) {
    m.mutex.Lock()
    defer m.mutex.Unlock()

    if len(m.publicSubnets) == 0 {
	    err := m.populateSubnets(ctx)
	    if err != nil {
		    return nil, errors.Wrap(err, "retrieving Public Subnets")
        }
	}

	return m.publicSubnets, nil
}

With that change, all exported functions (InstanceTypes, VPC, *Subnets, AvailabilityZones, Session) will behave and look consistently. I think that's more desirable than a bit of code "duplication".

Copy link
Contributor Author

@sadasu sadasu Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had considered that. I am making some changes. Let us see where we land.

@sadasu sadasu force-pushed the cleanup-aws-metedata branch from 38dcd6f to 5b449a2 Compare April 5, 2023 18:06
@sadasu sadasu force-pushed the cleanup-aws-metedata branch 2 times, most recently from 6cca123 to 77506b5 Compare April 6, 2023 17:14
@r4f4
Copy link
Contributor

r4f4 commented Apr 6, 2023

@sadasu since you updated the error messages, the unit tests also need to be updated.

@sadasu sadasu force-pushed the cleanup-aws-metedata branch 3 times, most recently from 4ed48c6 to 5791245 Compare April 6, 2023 19:12
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mtulio adding m.edgeSubnets to this tests causes the unit tests to fail. Looks like you specifically added a test to prevent that. I don't see why we can't check for len(m.edgeSubnets) > 0 here. Some special way EdgeSubnets are handled?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sadasu , sorry by missed that message. I am on it validating it to recall the scenario that drove me to create that unit blocking. Ideally, calling subnets() at once should populate all *Subnets, including edge.

I am looking at the failed unit job on this PR history, overall we required that edge subnets exist only when public or private have been provided, we should never accept the edge subnets only (as CP and regular workers must exist alongside LZ nodes) - this is a required defined on EP due AWS limitations of network resources (NLB/NGW, etc)

Would you mind sharing the current unit failure to help to dig in? I can see changes after the job I checked.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also sharing on slack that I have installed successfully the cluster with Local Zones/edge + PHZ on this version

commit 1fced156cc5540a540a20ac208d79812ef72eae2 (HEAD -> pr_7070)
Author: Sandhya Dasu <[email protected]>
Date:   Wed Apr 5 12:42:59 2023 -0400

install log:

$ OPENSHIFT_INSTALL_RELEASE_IMAGE_OVERRIDE="$RELEASE" $INSTALLER version
./openshift-install_pr7070 unreleased-master-7977-g1fced156cc5540a540a20ac208d79812ef72eae2-dirty
built from commit 1fced156cc5540a540a20ac208d79812ef72eae2
release image registry.ci.openshift.org/origin/release:4.13
release architecture amd64
(...)
INFO Time elapsed: 26m46s     
(...)                    
$ oc --kubeconfig $INSTALL_DIR/auth/kubeconfig get machines -n openshift-machine-api
NAME                                         PHASE     TYPE          REGION      ZONE               AGE
ocp-lz14-h9rmj-edge-us-east-1-nyc-1a-x9pkz   Running   c5d.2xlarge   us-east-1   us-east-1-nyc-1a   50m

@sadasu
Copy link
Contributor Author

sadasu commented Apr 11, 2023

/retest

@patrickdillon
Copy link
Contributor

/approve

Collapsing the mutexes into a single mutex in the common populateSubnets function makes sense to me.

It would be good to include to a commit message explaining the change according to the contributing guidelines: https://github.com/openshift/installer/blob/master/CONTRIBUTING.md#commit-message-format

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 12, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: patrickdillon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2023
@sadasu sadasu force-pushed the cleanup-aws-metedata branch from 4dd154b to 99d62e9 Compare April 12, 2023 20:53
@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 12, 2023
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD ebabc74 and 2 for PR HEAD 4dd154bc0e965d7167ab3a5c3e18b1ca98efda75 in total

@sadasu sadasu force-pushed the cleanup-aws-metedata branch 2 times, most recently from 1fced15 to 2d39342 Compare April 13, 2023 03:28
This fix eliminates the need for mutexSubnets to update subnet
information within AWS metadata. It also updates populateSubnets
to take care of getting VPC and subnets once for the installation
eliminating duplicate code that could be prone to errors.

Co-authored-by: Rafael F. <[email protected]>
@sadasu sadasu force-pushed the cleanup-aws-metedata branch from 2d39342 to 6cb715e Compare April 13, 2023 14:46
@sadasu
Copy link
Contributor Author

sadasu commented Apr 13, 2023

/test verify-vendor

@sadasu
Copy link
Contributor Author

sadasu commented Apr 13, 2023

/test e2e-aws-ovn

@sadasu
Copy link
Contributor Author

sadasu commented Apr 13, 2023

/retest-required

@sadasu
Copy link
Contributor Author

sadasu commented Apr 14, 2023

/retest

@sadasu
Copy link
Contributor Author

sadasu commented Apr 14, 2023

/hold
Hold off on non-critical merges until branch stabilizes.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 14, 2023
Copy link
Contributor

@mtulio mtulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

validated with PHZ + LZ subnet
/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Apr 14, 2023
@sadasu
Copy link
Contributor Author

sadasu commented Apr 17, 2023

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2023
@sadasu
Copy link
Contributor Author

sadasu commented Apr 17, 2023

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 17, 2023
@patrickdillon
Copy link
Contributor

/hold cancel
/retest

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Apr 18, 2023
@openshift-merge-robot openshift-merge-robot merged commit ed661cc into openshift:master Apr 18, 2023
@openshift-ci-robot
Copy link
Contributor

@sadasu: Jira Issue OCPBUGS-10767: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-10767 has not been moved to the MODIFIED state.

Details

In response to this:

This fix eliminates the need for mutexSubnets to update subnet
information within AWS metadata. It also updates populateSubnets
to take care of getting VPC and subnets once for the installation
eliminating duplicate code that could be prone to errors.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Apr 18, 2023

@sadasu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-fips 6cb715e link false /test e2e-aws-ovn-fips
ci/prow/okd-e2e-aws-ovn 6cb715e link false /test okd-e2e-aws-ovn
ci/prow/e2e-aws-ovn-proxy 6cb715e link false /test e2e-aws-ovn-proxy
ci/prow/e2e-aws-ovn-workers-rhel8 6cb715e link false /test e2e-aws-ovn-workers-rhel8

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sadasu
Copy link
Contributor Author

sadasu commented Apr 25, 2023

/cherry-pick release-4.13

@openshift-cherrypick-robot

@sadasu: new pull request created: #7129

Details

In response to this:

/cherry-pick release-4.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants