Skip to content

Conversation

@abhinavdahiya
Copy link
Contributor

@abhinavdahiya abhinavdahiya commented Mar 11, 2020

enhancement: openshift/enhancements#163

work items

  • Define custom service endpoints for install-config
  • Configure the installer to use service endpoints when provided
  • Configure the terraform to use service endpoints when provided
  • Configure the destroy code to use service endpoints
  • Configure the aws cloud provider config with service overrides
  • Import openshift/api with updated Infra api
  • Configure the new infrastructure config/v1 object with service endpoints for AWS
  • Setup known, vs unknown regions and allow all regions to be specified using install-config but only show known regions in the TUI
  • Add validation that unknown regions have specific service endpoints set
  • setup AMI import/copy from us-east-1

Details

pkg/types/aws: add service endpoints to aws platform

Adds a list of service endpoints to the AWS platform. A service endpoint entry/item consists of the service name and the endpoint url.

only one service endpoint for a service can be provided, i.e duplicates are not allowed.
the endpoint URL must be a hostname or a http(s) URL with no request path or requests queries.

asset/installconfig/aws/session.go: allow creating session with region and service overrides

Adds GetSessionWithOptions which allows creating aws.Session with options. Currently adds two options,

WithRegion, this configures the aws.Session to a specific region.

WithServiceEndpoints, this configures the aws.Session with a new service resolvers that uses the provided list of service endpoints to override the endpoints hard coded into the AWS SDK for the specific service.
It uses an internal awsResolver that can be configures with a list of service endpoints such that when AWS SDK tries to fetch the endpoint for a service in a region, it returns the configured endpoint. If no such endpoint was configured for this internal resolver, it falls back to AWS SDK's default resolver that uses the hard coded list.
NOTE: the internal resolver returns the same endpoint URL for all regions for a service. This is acceptable as service endpoints provided to the installer don't provide region. Also since clusters are installed to single region, one should only need the endpoint for that region or global.

asset/installconfig: use service endpoints when creating the session

The aws metadata now provides consumers session initialized with region and AWS service overrides set in the install config.
The clients in various assets therefore transparently use the correct endpoints without extra configuration, while also allowing them to override it if needed when initializing the client

asset: Use common session when fetching public zone ID for AWS

Previously, while generating the DNS asset, the asset would involke the GetPublicZone function to fetch the public Route53 zone id. This functoin would create it's own AWS session instead of the already provided session from aws metadata. This is various downsides like region is not configured and also the service endpoints overrides is not handled, therefore, switching to the already initialized session allows it to reuse the setup.

aws: configure terraform with service endpoints

The installer now passed on the service endpoints set in the install-config as a map of serivce name => service endpoint URL

  • Why the map lookup when setting endpoints for aws provider

The endpoints for aws provider are a schema.Set type and therefore an input map object cannot be directly assigned to the endpoints and therefore, we lookup the map object for specific key and set the corresponding endpoint.

There are two major concerns with this,

First, what if we need multiple endpoints for a service, like per region?
Terraform only supports providing ONE sevice endpoint per service, and this is fine as installer uses the terraform to create resources only in one region today, if there was a case where we had to create resources across regions, we might hav to extend the installer to invoke terraform multiple times.. there is no plan/requirement for that anytime soon.

Secondly, why is the list hard coded? that's not good right?
Installer only creates a SPECIFIC set of resources using the terraform, and therefore the hard coded list should be fine and it can be extended when we need new types of resources. This list is hard coded today because of terraform's limitations today.

destroy/aws: use the service overrides for AWS apis

Currently the destroy code-path for the isntaller is driven completely based on the metadata.json. This allows users to clean up the clusters as long as they possess/create a metadata.json for the cluster. Therefore, the service endpoints for the cluster need to be propagated to the metadata.json for later use.

Adds a list of service endpoints similar to the install-config to the metadata.json's definition. The metadata assets uses the install-config to set the service endpoints when provided by the user.

Updates the destroy/aws.New to create session using the GetSessionWithOptions similar to other assets to re-use the regions and service endpoints options. This makes the destroy code a lot less complicated.

destroy/aws ClusterInstaller allows users to provide no aws session. In this case, destroy continues to create aws session using SDK defaults with specified region and no service endpoint overrides. Requiring the user to handle special session and passing it along seems sane for users using destroy code directly similar to installer's metadata.json powered code-path.

manifests/aws: create cloud provider config when service endpoints are set

The cloud provider config is imported sparsely from upstream. The cloud provider config functions are imported from previous cloudproviderconfig.go and cloudproviderconfig_test.go but have been slightly modified for changed field names.

!IMPORTANT: All this code can be removed from the installer when the cloud-config-stitching controller is available based on enhancement

The aws kube-cloud-controller uses special configuration. For OpenShift, we need to configure the service endpoints for various AWS services. This configuration generation will be handled by controller defined in 5, but for the time being this is done once as a phase one.

data/aws/main.tf: load sts service endpoint override

terraform uses the sts service to find details about the user credentials. These details are used to compute various things like the account ID, partition ID, or the DNS suffix.

aws: use SDK and rhcos AMIs to calculate known regions

w.r.t to the installer there are four kind of regions that user can provide.

First, is a commercial AWS region where the RHCOS AMIs are published and known to the installer
Second, is a commercial AWS region where the RHCOS AMIS are not published or not know to the isntaller at that verion
Third, is a restricted or soverign AWS region where RHCOS AMIs are not published but also these regions do not allowing traffic across the region traffic.
Fourth lastly is a custom region only available to the user.

For commercial regions in first, the installer is expected to use the AMI embedded in the binary. For the commercial regions in second, the installer is expected to import the AMI from one of the regions in first. For the restricted AWS regions, the installer cannot import the AMI to these regions and therefore the user is expected to provide the installer with the AMI that needs to be used. Similarly for custom regions only for users, the users are expected to provide the AMI.

Now, the list of regions that fall into first vs second depends on the regions for which AMI exists in rhcos{-amd64}.json. Since the contents of that file are mounted into the binary at build time using shurcooL/vfsgen there are two ways to provide this list to the go code,

  • either provide an API as part of rhcos pkg that reads the rhcos.json from the embedded assets
    since the file asset is not available to read unless the binary is built or OPENSHIFT_INSTALL_DATA env is set, if we were to run unit tests using go test the tests would fail to run. Having the capability of running unit tests without setting any env or building a release binary severly hinders the extent of the unit tests that are capable.

  • check in a hard coded list of the regions based on the rhcos.json contents.
    This has the advantage of go tests just working fine without any extra configuration. But now there is a need to update the file to match the rhcos.json
    To help with automatic management, we can use go generate to create a go file that uses the rhcos.json as input and outputs a go file with a list of regions as a variable. But this file needs to be checked in the version control.

and the only step when the rhcos.json is bumped or changed is running go generate ./pkg/rhcos/...

To follow the second method, a new executable go file ami_regions_generate.go is added which can be executed as

go run ami_regions_generate.go <packae of the generated file> <source rhcos.json path> <destination go file path>

The generator provides a variable AMIRegions which is a string slice of all the regions that are part of rhcos.json

The ami.go include a go generate tag to allow go generate to execute the generator and update the list.

The installconfig asset package now provides a public function to check whether a region is Known i.e. regions in the first category.

asset/installconfig/aws/platform.go: use the known regions for region list for tui

The terminal prompts must only provide users a list of regions where the installer has a RHCOS AMI published. The terminal prompts are designed for getting started regions and unknown regions end up requiring a lot more configuration most times that the TUI can support.

aws/validation/platform.go: drop region validation from install-config

Previously isntaller only allowed users to specify certain regions in the install-config. But now users should be allowed to specify any region in the install-config.yaml

various validations can be added to make sure users provide AMIs, service endpoints for custom regions to allow installs to succeed.

Since the AWS SDK has the endpoints for most of the services for it's internal region, we require that users provide service endpoints for some required services when using custom region.

the required services are: "ec2", "elasticloadbalancing", "iam", "route53", "s3", "sts", and "tagging".

Also the users need to specify AMI for these custom regions.

aws: copy AMI from us-east-1 for SDK regions when not available

for the second category of the regions, commercial AWS regions that don't have RHCOS AMI published but that allow AMI import, the installer is expected to copy the AMI to the user's account from one of the RHCOS published ones.

So the installer picks the AMI published for us-east-1 region and copies it to the chosen region for the cluster to use.

Currently, the rhcos.Image asset's string content is extended to be <ami-id>,<ami-id-region> when the AMI to be used is different from the cluster's region. And the machinesets and machine objects are updated to use AWS tags to find the generated AMI.

The user is still allowed to provide an AMI if they want to re-use one, and the installer will use the AMI as is and will not copy/import the AMI.

/hold because 30f06675df19f6f6d88b12897a9426ac3061bdb0 has a vendor for forked openshift/api

@openshift-ci-robot openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 11, 2020
@openshift-ci-robot openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 12, 2020
@abhinavdahiya abhinavdahiya force-pushed the aws_service_endpoints branch from 83c87ff to aea8d03 Compare March 12, 2020 20:26
@openshift-ci-robot openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 12, 2020
@abhinavdahiya abhinavdahiya force-pushed the aws_service_endpoints branch 2 times, most recently from 7bc8e17 to dd36ada Compare March 13, 2020 16:12
@abhinavdahiya
Copy link
Contributor Author

cred-minter PR openshift/cloud-credential-operator#167 required.

@abhinavdahiya abhinavdahiya changed the title [WIP] AWS: support custom regions and custom endpoints. AWS: support custom regions and custom endpoints. Mar 17, 2020
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2020
@abhinavdahiya
Copy link
Contributor Author

So waiting on openshift/api#599, #3295 and #3293 to make final changes.

Will update the commit messages with details next.

@openshift-ci-robot openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2020
@abhinavdahiya abhinavdahiya force-pushed the aws_service_endpoints branch from 0012139 to 2193f27 Compare March 24, 2020 18:50
@openshift-ci-robot openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2020
@abhinavdahiya abhinavdahiya force-pushed the aws_service_endpoints branch 2 times, most recently from 385a50f to c92d109 Compare March 24, 2020 21:08
Comment on lines 49 to 51
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should configure resolver when the list is non-empty.

@abhinavdahiya abhinavdahiya force-pushed the aws_service_endpoints branch from c92d109 to 806f0ec Compare March 25, 2020 03:08
@abhinavdahiya
Copy link
Contributor Author

/hold because 30f0667 has a vendor for forked openshift/api

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2020
Copy link
Contributor

@jstuever jstuever left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a few questions about specific code. Also, do we plan to revert back to github.com/openshift/api after it has all the required changes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see services being used in this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if defaultRegion does not exist in the new endpoint, will we always panic? Maybe this would be better as an else if to line 39 if defaultRegionPointer != nil && *defaultRegionPointer != "" {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default region picked in code should always have an AMI.

line 39 uses the user-environment to pick a default region, which can actually be one of the unknown regions.

therefore panic makes sense in first as this is not an error but developer misconfiguration, while second one should be handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible for an endpoint to have a region with the same name, but not be the same and therefore not have the RHCOS AMI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would be odd, i don't think that would be possible...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this... service := service I suspect a limit of my knowledge. Is it to copy the object to a new memory location?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go loops reuse the memory for the value..
https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables

this is just a way to create a new variable that can be stored in the resolver.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling the AMIID out of the rhcosImage string seems to happen in three different locations. Does it make sense for this to be a function?

Copy link
Member

@jhixson74 jhixson74 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than what has already been pointed out, everything else looks good to me.

@abhinavdahiya
Copy link
Contributor Author

/test e2e-metal

@abhinavdahiya abhinavdahiya force-pushed the aws_service_endpoints branch from 806f0ec to 149433c Compare April 16, 2020 01:09
@abhinavdahiya
Copy link
Contributor Author

/hold cancel

as the vendor now includes the correct update

the only test defined are `TestGetDefaultInstanceClass`. That test are useful for developers of installer to make sure the default instance class used for a region is acurate and available in that region.

But these tests are no longer required since the installer moved to use AWS APIs to find out the correct default instance type [1]

[1]: openshift@d45e881
w.r.t to the installer there are four kind of regions that user can provide.

First, is a commercial AWS region where the RHCOS AMIs are published and known to the installer
Second, is a commercial AWS region where the RHCOS AMIS are not published or not know to the isntaller at that verion
Third, is a restricted or soverign AWS region where RHCOS AMIs are not published but also these regions do not allowing traffic across the region traffic.
Fourth lastly is a custom region only available to the user.

For commercial regions in first, the installer is expected to use the AMI embedded in the binary. For the commercial regions in second, the installer is expected to `import` the AMI from one of the regions in first. For the restricted AWS regions, the installer cannot `import` the AMI to these regions and therefore the user is expected to provide the installer with the AMI that needs to be used. Similarly for custom regions only for users, the users are expected to provide the AMI.

Now, the list of regions that fall into first vs second depends on the regions for which AMI exists in `rhcos{-amd64}.json`. Since the contents of that file are mounted into the binary at build time using `shurcooL/vfsgen` there are two ways to provide this list to the go code,

- either provide an API as part of `rhcos` pkg that reads the rhcos.json from the embedded assets
since the file asset is not available to `read` unless the binary is built or `OPENSHIFT_INSTALL_DATA` env is set, if we were to run unit tests using `go test` the tests would fail to run. Having the capability of running unit tests without setting any env or building a release binary severly hinders the extent of the unit tests that are capable.

- check in a hard coded list of the regions based on the rhcos.json contents.
This has the advantage of go tests just working fine without any extra configuration. But now there is a need to update the file to match the rhcos.json
To help with automatic management, we can use go generate to create a go file that uses the rhcos.json as input and outputs a go file with a list of regions as a variable. But this file needs to be checked in the version control.

and the only step when the rhcos.json is bumped or changed is running `go generate ./pkg/rhcos/...`

To follow the second method, a new executable go file `ami_regions_generate.go` is added which can be executed as
```
go run ami_regions_generate.go <packae of the generated file> <source rhcos.json path> <destination go file path>
```
The generator provides a variable `AMIRegions` which is a string slice of all the regions that are part of `rhcos.json`

The `ami.go` include a go generate tag to allow go generate to execute the generator and update the list.

The installconfig asset package now provides a public function to check whether a region is `Known` i.e. regions in the first category.
… list for tui

The terminal prompts must only provide users a list of regions where the installer has a RHCOS AMI published. The terminal prompts are designed for getting started regions and unknown regions end up requiring a lot more configuration most times that the TUI can support.
Previously isntaller only allowed users to specify certain regions in the install-config. But now users should be allowed to specify any region in the install-config.yaml

various validations can be added to make sure users provide AMIs, service endpoints for custom regions to allow installs to succeed.
…ain services for custom regions

Since the AWS SDK has the endpoints for most of the services for it's internal region, we require that users provide service endpoints for some required services when using custom region.

the required services are: "ec2", "elasticloadbalancing", "iam", "route53", "s3", "sts", and "tagging".

Also the users need to specify AMI for these custom regions.
for the second category of the regions, commercial AWS regions that don't have RHCOS AMI published but that allow AMI import, the installer is expected to copy the AMI to the user's account from one of the RHCOS published ones.

So the installer picks the AMI published for us-east-1 region and copies it to the chosen region for the cluster to use.

Currently, the rhcos.Image asset's string content is extended to be `<ami-id>,<ami-id-region>` when the AMI to be used is different from the cluster's region. And the machinesets and machine objects are updated to use AWS tags to find the generated AMI.

The user is still allowed to provide an AMI if they want to re-use one, and the installer will use the AMI as is and will not copy/import the AMI.
since, AWS SDK is missing the route53 endpoints for AWS China regions, the users will be forced to add the service endpoint.
But it seems like there is a known endpoint for that service already and the sdk is trailing for a bit, also the terraform aws provider is using this same endpoint already.
So adding the endpoint as default known, should help our users a little bit...

Also the destroy code needs to perform the same switching for AWS china partition like we go for AWS commercial regions for resource tagging api wrt. route53 object from [1]

[1]: openshift@e24c7dc
@abhinavdahiya abhinavdahiya force-pushed the aws_service_endpoints branch from adeb689 to 61ea70e Compare April 17, 2020 15:17
@abhinavdahiya
Copy link
Contributor Author

rebased onto master for vendor conflicts.

@abhinavdahiya
Copy link
Contributor Author

/retest

1 similar comment
@patrickdillon
Copy link
Contributor

/retest

Comment on lines +219 to +222
// this is a list of known default endpoints for specific regions that would
// otherwise require user to set the service overrides.
// it's a map of region => service => resolved endpoint
// this is only used when the user hasn't specified a override for the service in that region.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This is the same comment as lines 179-182. It should probably only appear once.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm.. personally I want to keep it the same for now, I think the internal one at 179-182 will change in the future but this function is going to stay the same.


variable "aws_master_root_volume_size" {
type = string
type = string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like something weird is going on with tf-fmt, or I'm missing something... normally tf-fmt likes to have the = vertically aligned. I encountered something similar.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i think we are using an old version of tf that had some weird bug... we should fix it and use a newer version..

@patrickdillon
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2020
@abhinavdahiya
Copy link
Contributor Author

/test e2e-gcp
/test e2e-azure
/retest

@abhinavdahiya
Copy link
Contributor Author

/test e2e-vsphere

@openshift-ci-robot
Copy link
Contributor

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-steps 806f0ec89dba0fba17f3f5429dab05614d8e9119 link /test e2e-aws-steps
ci/prow/e2e-vsphere 6ced6ad link /test e2e-vsphere
ci/prow/e2e-aws-scaleup-rhel7 6ced6ad link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-azure 6ced6ad link /test e2e-azure

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@abhinavdahiya
Copy link
Contributor Author

No failures for existing CI. I think we can iterate on this.

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2020
@openshift-merge-robot openshift-merge-robot merged commit 70447fe into openshift:master Apr 18, 2020
wking added a commit to wking/openshift-installer that referenced this pull request Jul 7, 2020
The guts were removed by 805a108 (platformtests: drop aws as no
longer required, 2020-03-11, openshift#3277), citing d45e881 (aws: pick
instance types based on selected availability zones, 2020-02-03, openshift#3051).
No need to keep the useless directory around now that it holds no
code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants