AWS: support custom regions and custom endpoints. #3277

abhinavdahiya · 2020-03-11T21:04:04Z

work items

Details

pkg/types/aws: add service endpoints to aws platform

Adds a list of service endpoints to the AWS platform. A service endpoint entry/item consists of the service name and the endpoint url.

only one service endpoint for a service can be provided, i.e duplicates are not allowed.
the endpoint URL must be a hostname or a http(s) URL with no request path or requests queries.

asset/installconfig/aws/session.go: allow creating session with region and service overrides

Adds GetSessionWithOptions which allows creating aws.Session with options. Currently adds two options,

WithRegion, this configures the aws.Session to a specific region.

WithServiceEndpoints, this configures the aws.Session with a new service resolvers that uses the provided list of service endpoints to override the endpoints hard coded into the AWS SDK for the specific service.
It uses an internal awsResolver that can be configures with a list of service endpoints such that when AWS SDK tries to fetch the endpoint for a service in a region, it returns the configured endpoint. If no such endpoint was configured for this internal resolver, it falls back to AWS SDK's default resolver that uses the hard coded list.
NOTE: the internal resolver returns the same endpoint URL for all regions for a service. This is acceptable as service endpoints provided to the installer don't provide region. Also since clusters are installed to single region, one should only need the endpoint for that region or global.

asset/installconfig: use service endpoints when creating the session

The aws metadata now provides consumers session initialized with region and AWS service overrides set in the install config.
The clients in various assets therefore transparently use the correct endpoints without extra configuration, while also allowing them to override it if needed when initializing the client

asset: Use common session when fetching public zone ID for AWS

Previously, while generating the DNS asset, the asset would involke the GetPublicZone function to fetch the public Route53 zone id. This functoin would create it's own AWS session instead of the already provided session from aws metadata. This is various downsides like region is not configured and also the service endpoints overrides is not handled, therefore, switching to the already initialized session allows it to reuse the setup.

aws: configure terraform with service endpoints

The installer now passed on the service endpoints set in the install-config as a map of serivce name => service endpoint URL

Why the map lookup when setting endpoints for aws provider

The endpoints for aws provider are a schema.Set type and therefore an input map object cannot be directly assigned to the endpoints and therefore, we lookup the map object for specific key and set the corresponding endpoint.

There are two major concerns with this,

First, what if we need multiple endpoints for a service, like per region?
Terraform only supports providing ONE sevice endpoint per service, and this is fine as installer uses the terraform to create resources only in one region today, if there was a case where we had to create resources across regions, we might hav to extend the installer to invoke terraform multiple times.. there is no plan/requirement for that anytime soon.

Secondly, why is the list hard coded? that's not good right?
Installer only creates a SPECIFIC set of resources using the terraform, and therefore the hard coded list should be fine and it can be extended when we need new types of resources. This list is hard coded today because of terraform's limitations today.

destroy/aws: use the service overrides for AWS apis

Currently the destroy code-path for the isntaller is driven completely based on the metadata.json. This allows users to clean up the clusters as long as they possess/create a metadata.json for the cluster. Therefore, the service endpoints for the cluster need to be propagated to the metadata.json for later use.

Adds a list of service endpoints similar to the install-config to the metadata.json's definition. The metadata assets uses the install-config to set the service endpoints when provided by the user.

Updates the destroy/aws.New to create session using the GetSessionWithOptions similar to other assets to re-use the regions and service endpoints options. This makes the destroy code a lot less complicated.

destroy/aws ClusterInstaller allows users to provide no aws session. In this case, destroy continues to create aws session using SDK defaults with specified region and no service endpoint overrides. Requiring the user to handle special session and passing it along seems sane for users using destroy code directly similar to installer's metadata.json powered code-path.

manifests/aws: create cloud provider config when service endpoints are set

The cloud provider config is imported sparsely from upstream. The cloud provider config functions are imported from previous cloudproviderconfig.go and cloudproviderconfig_test.go but have been slightly modified for changed field names.

!IMPORTANT: All this code can be removed from the installer when the cloud-config-stitching controller is available based on enhancement

The aws kube-cloud-controller uses special configuration. For OpenShift, we need to configure the service endpoints for various AWS services. This configuration generation will be handled by controller defined in 5, but for the time being this is done once as a phase one.

data/aws/main.tf: load sts service endpoint override

terraform uses the sts service to find details about the user credentials. These details are used to compute various things like the account ID, partition ID, or the DNS suffix.

aws: use SDK and rhcos AMIs to calculate known regions

w.r.t to the installer there are four kind of regions that user can provide.

First, is a commercial AWS region where the RHCOS AMIs are published and known to the installer
Second, is a commercial AWS region where the RHCOS AMIS are not published or not know to the isntaller at that verion
Third, is a restricted or soverign AWS region where RHCOS AMIs are not published but also these regions do not allowing traffic across the region traffic.
Fourth lastly is a custom region only available to the user.

For commercial regions in first, the installer is expected to use the AMI embedded in the binary. For the commercial regions in second, the installer is expected to import the AMI from one of the regions in first. For the restricted AWS regions, the installer cannot import the AMI to these regions and therefore the user is expected to provide the installer with the AMI that needs to be used. Similarly for custom regions only for users, the users are expected to provide the AMI.

Now, the list of regions that fall into first vs second depends on the regions for which AMI exists in rhcos{-amd64}.json. Since the contents of that file are mounted into the binary at build time using shurcooL/vfsgen there are two ways to provide this list to the go code,

either provide an API as part of rhcos pkg that reads the rhcos.json from the embedded assets
since the file asset is not available to read unless the binary is built or OPENSHIFT_INSTALL_DATA env is set, if we were to run unit tests using go test the tests would fail to run. Having the capability of running unit tests without setting any env or building a release binary severly hinders the extent of the unit tests that are capable.
check in a hard coded list of the regions based on the rhcos.json contents.
This has the advantage of go tests just working fine without any extra configuration. But now there is a need to update the file to match the rhcos.json
To help with automatic management, we can use go generate to create a go file that uses the rhcos.json as input and outputs a go file with a list of regions as a variable. But this file needs to be checked in the version control.

and the only step when the rhcos.json is bumped or changed is running go generate ./pkg/rhcos/...

To follow the second method, a new executable go file ami_regions_generate.go is added which can be executed as

go run ami_regions_generate.go <packae of the generated file> <source rhcos.json path> <destination go file path>

The generator provides a variable AMIRegions which is a string slice of all the regions that are part of rhcos.json

The ami.go include a go generate tag to allow go generate to execute the generator and update the list.

The installconfig asset package now provides a public function to check whether a region is Known i.e. regions in the first category.

asset/installconfig/aws/platform.go: use the known regions for region list for tui

The terminal prompts must only provide users a list of regions where the installer has a RHCOS AMI published. The terminal prompts are designed for getting started regions and unknown regions end up requiring a lot more configuration most times that the TUI can support.

aws/validation/platform.go: drop region validation from install-config

Previously isntaller only allowed users to specify certain regions in the install-config. But now users should be allowed to specify any region in the install-config.yaml

various validations can be added to make sure users provide AMIs, service endpoints for custom regions to allow installs to succeed.

Since the AWS SDK has the endpoints for most of the services for it's internal region, we require that users provide service endpoints for some required services when using custom region.

the required services are: "ec2", "elasticloadbalancing", "iam", "route53", "s3", "sts", and "tagging".

Also the users need to specify AMI for these custom regions.

aws: copy AMI from us-east-1 for SDK regions when not available

for the second category of the regions, commercial AWS regions that don't have RHCOS AMI published but that allow AMI import, the installer is expected to copy the AMI to the user's account from one of the RHCOS published ones.

So the installer picks the AMI published for us-east-1 region and copies it to the chosen region for the cluster to use.

Currently, the rhcos.Image asset's string content is extended to be <ami-id>,<ami-id-region> when the AMI to be used is different from the cluster's region. And the machinesets and machine objects are updated to use AWS tags to find the generated AMI.

The user is still allowed to provide an AMI if they want to re-use one, and the installer will use the AMI as is and will not copy/import the AMI.

/hold because 30f06675df19f6f6d88b12897a9426ac3061bdb0 has a vendor for forked openshift/api

abhinavdahiya · 2020-03-16T16:27:43Z

cred-minter PR openshift/cloud-credential-operator#167 required.

abhinavdahiya · 2020-03-17T00:04:28Z

So waiting on openshift/api#599, #3295 and #3293 to make final changes.

Will update the commit messages with details next.

abhinavdahiya · 2020-03-25T00:27:42Z

pkg/asset/installconfig/aws/session.go

maybe we should configure resolver when the list is non-empty.

abhinavdahiya · 2020-03-25T03:19:44Z

/hold because 30f0667 has a vendor for forked openshift/api

jstuever

I had a few questions about specific code. Also, do we plan to revert back to github.com/openshift/api after it has all the required changes?

jstuever · 2020-03-25T16:26:43Z

pkg/asset/installconfig/aws/metadata.go

I don't see services being used in this function.

good catch, will fix.

jstuever · 2020-03-25T16:33:31Z

pkg/asset/installconfig/aws/platform.go

What happens if defaultRegion does not exist in the new endpoint, will we always panic? Maybe this would be better as an else if to line 39 if defaultRegionPointer != nil && *defaultRegionPointer != "" {

the default region picked in code should always have an AMI.

line 39 uses the user-environment to pick a default region, which can actually be one of the unknown regions.

therefore panic makes sense in first as this is not an error but developer misconfiguration, while second one should be handled.

jstuever · 2020-03-25T16:38:23Z

pkg/asset/installconfig/aws/regions.go

Do you think it's possible for an endpoint to have a region with the same name, but not be the same and therefore not have the RHCOS AMI?

that would be odd, i don't think that would be possible...

jstuever · 2020-03-25T16:43:50Z

pkg/asset/installconfig/aws/session.go

I don't understand this... service := service I suspect a limit of my knowledge. Is it to copy the object to a new memory location?

go loops reuse the memory for the value..
https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables

this is just a way to create a new variable that can be stored in the resolver.

jcpowermac · 2020-03-26T17:28:57Z

pkg/asset/machines/master.go

Pulling the AMIID out of the rhcosImage string seems to happen in three different locations. Does it make sense for this to be a function?

jhixson74

Other than what has already been pointed out, everything else looks good to me.

abhinavdahiya · 2020-04-01T20:19:22Z

/test e2e-metal

abhinavdahiya · 2020-04-16T01:11:12Z

/hold cancel

as the vendor now includes the correct update

the only test defined are `TestGetDefaultInstanceClass`. That test are useful for developers of installer to make sure the default instance class used for a region is acurate and available in that region. But these tests are no longer required since the installer moved to use AWS APIs to find out the correct default instance type [1] [1]: openshift@d45e881

w.r.t to the installer there are four kind of regions that user can provide. First, is a commercial AWS region where the RHCOS AMIs are published and known to the installer Second, is a commercial AWS region where the RHCOS AMIS are not published or not know to the isntaller at that verion Third, is a restricted or soverign AWS region where RHCOS AMIs are not published but also these regions do not allowing traffic across the region traffic. Fourth lastly is a custom region only available to the user. For commercial regions in first, the installer is expected to use the AMI embedded in the binary. For the commercial regions in second, the installer is expected to `import` the AMI from one of the regions in first. For the restricted AWS regions, the installer cannot `import` the AMI to these regions and therefore the user is expected to provide the installer with the AMI that needs to be used. Similarly for custom regions only for users, the users are expected to provide the AMI. Now, the list of regions that fall into first vs second depends on the regions for which AMI exists in `rhcos{-amd64}.json`. Since the contents of that file are mounted into the binary at build time using `shurcooL/vfsgen` there are two ways to provide this list to the go code, - either provide an API as part of `rhcos` pkg that reads the rhcos.json from the embedded assets since the file asset is not available to `read` unless the binary is built or `OPENSHIFT_INSTALL_DATA` env is set, if we were to run unit tests using `go test` the tests would fail to run. Having the capability of running unit tests without setting any env or building a release binary severly hinders the extent of the unit tests that are capable. - check in a hard coded list of the regions based on the rhcos.json contents. This has the advantage of go tests just working fine without any extra configuration. But now there is a need to update the file to match the rhcos.json To help with automatic management, we can use go generate to create a go file that uses the rhcos.json as input and outputs a go file with a list of regions as a variable. But this file needs to be checked in the version control. and the only step when the rhcos.json is bumped or changed is running `go generate ./pkg/rhcos/...` To follow the second method, a new executable go file `ami_regions_generate.go` is added which can be executed as ``` go run ami_regions_generate.go <packae of the generated file> <source rhcos.json path> <destination go file path> ``` The generator provides a variable `AMIRegions` which is a string slice of all the regions that are part of `rhcos.json` The `ami.go` include a go generate tag to allow go generate to execute the generator and update the list. The installconfig asset package now provides a public function to check whether a region is `Known` i.e. regions in the first category.

… list for tui The terminal prompts must only provide users a list of regions where the installer has a RHCOS AMI published. The terminal prompts are designed for getting started regions and unknown regions end up requiring a lot more configuration most times that the TUI can support.

Previously isntaller only allowed users to specify certain regions in the install-config. But now users should be allowed to specify any region in the install-config.yaml various validations can be added to make sure users provide AMIs, service endpoints for custom regions to allow installs to succeed.

…ain services for custom regions Since the AWS SDK has the endpoints for most of the services for it's internal region, we require that users provide service endpoints for some required services when using custom region. the required services are: "ec2", "elasticloadbalancing", "iam", "route53", "s3", "sts", and "tagging". Also the users need to specify AMI for these custom regions.

for the second category of the regions, commercial AWS regions that don't have RHCOS AMI published but that allow AMI import, the installer is expected to copy the AMI to the user's account from one of the RHCOS published ones. So the installer picks the AMI published for us-east-1 region and copies it to the chosen region for the cluster to use. Currently, the rhcos.Image asset's string content is extended to be `<ami-id>,<ami-id-region>` when the AMI to be used is different from the cluster's region. And the machinesets and machine objects are updated to use AWS tags to find the generated AMI. The user is still allowed to provide an AMI if they want to re-use one, and the installer will use the AMI as is and will not copy/import the AMI.

…vice endpoints

since, AWS SDK is missing the route53 endpoints for AWS China regions, the users will be forced to add the service endpoint. But it seems like there is a known endpoint for that service already and the sdk is trailing for a bit, also the terraform aws provider is using this same endpoint already. So adding the endpoint as default known, should help our users a little bit... Also the destroy code needs to perform the same switching for AWS china partition like we go for AWS commercial regions for resource tagging api wrt. route53 object from [1] [1]: openshift@e24c7dc

abhinavdahiya · 2020-04-17T15:17:43Z

rebased onto master for vendor conflicts.

abhinavdahiya · 2020-04-17T16:47:30Z

/retest

patrickdillon · 2020-04-17T19:40:10Z

/retest

patrickdillon · 2020-04-17T19:49:23Z

pkg/asset/installconfig/aws/session.go

+// this is a list of known default endpoints for specific regions that would
+// otherwise require user to set the service overrides.
+// it's a map of region => service => resolved endpoint
+// this is only used when the user hasn't specified a override for the service in that region.


nit: This is the same comment as lines 179-182. It should probably only appear once.

hmm.. personally I want to keep it the same for now, I think the internal one at 179-182 will change in the future but this function is going to stay the same.

patrickdillon · 2020-04-17T19:52:45Z

data/data/aws/variables-aws.tf


 variable "aws_master_root_volume_size" {
-  type        = string
+  type = string


seems like something weird is going on with tf-fmt, or I'm missing something... normally tf-fmt likes to have the = vertically aligned. I encountered something similar.

yeah, i think we are using an old version of tf that had some weird bug... we should fix it and use a newer version..

patrickdillon · 2020-04-17T19:54:16Z

/lgtm

abhinavdahiya · 2020-04-17T22:29:32Z

/test e2e-gcp
/test e2e-azure
/retest

abhinavdahiya · 2020-04-17T22:29:46Z

/test e2e-vsphere

openshift-ci-robot · 2020-04-18T00:45:17Z

@abhinavdahiya: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-steps	806f0ec89dba0fba17f3f5429dab05614d8e9119	link	`/test e2e-aws-steps`
ci/prow/e2e-vsphere	`6ced6ad`	link	`/test e2e-vsphere`
ci/prow/e2e-aws-scaleup-rhel7	`6ced6ad`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-azure	`6ced6ad`	link	`/test e2e-azure`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

abhinavdahiya · 2020-04-18T02:46:48Z

No failures for existing CI. I think we can iterate on this.

/approve

openshift-ci-robot · 2020-04-18T02:47:25Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

The guts were removed by 805a108 (platformtests: drop aws as no longer required, 2020-03-11, openshift#3277), citing d45e881 (aws: pick instance types based on selected availability zones, 2020-02-03, openshift#3051). No need to keep the useless directory around now that it holds no code.

openshift-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 11, 2020

openshift-ci-robot requested review from fabianofranz and patrickdillon March 11, 2020 21:05

openshift-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 12, 2020

abhinavdahiya force-pushed the aws_service_endpoints branch from 83c87ff to aea8d03 Compare March 12, 2020 20:26

openshift-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 12, 2020

abhinavdahiya force-pushed the aws_service_endpoints branch 2 times, most recently from 7bc8e17 to dd36ada Compare March 13, 2020 16:12

abhinavdahiya changed the title ~~[WIP] AWS: support custom regions and custom endpoints.~~ AWS: support custom regions and custom endpoints. Mar 17, 2020

openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 17, 2020

openshift-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 18, 2020

abhinavdahiya force-pushed the aws_service_endpoints branch from 0012139 to 2193f27 Compare March 24, 2020 18:50

openshift-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 24, 2020

abhinavdahiya force-pushed the aws_service_endpoints branch 2 times, most recently from 385a50f to c92d109 Compare March 24, 2020 21:08

abhinavdahiya commented Mar 25, 2020

View reviewed changes

abhinavdahiya force-pushed the aws_service_endpoints branch from c92d109 to 806f0ec Compare March 25, 2020 03:08

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 25, 2020

jstuever reviewed Mar 25, 2020

View reviewed changes

jcpowermac reviewed Mar 26, 2020

View reviewed changes

jhixson74 reviewed Mar 27, 2020

View reviewed changes

abhinavdahiya force-pushed the aws_service_endpoints branch from 806f0ec to 149433c Compare April 16, 2020 01:09

abhinavdahiya added 11 commits April 17, 2020 08:16

vendor: update the openshift/api

982bd2f

manifests/infrastructure.go: aws: update the config/v1 Infra with ser…

204413d

…vice endpoints

data/data/aws: skip region validation for custom regions

2b4fed0

FIXUP: terraform formatting

61ea70e

abhinavdahiya force-pushed the aws_service_endpoints branch from adeb689 to 61ea70e Compare April 17, 2020 15:17

FIXUP: drop shadowing withing range in isAWSRegion

6ced6ad

patrickdillon reviewed Apr 17, 2020

View reviewed changes

openshift-ci-robot assigned patrickdillon Apr 17, 2020

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Apr 17, 2020

openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Apr 18, 2020

openshift-merge-robot merged commit 70447fe into openshift:master Apr 18, 2020

wking mentioned this pull request Jul 7, 2020

platformtests: Drop no-longer-used directory #3856

Merged

AWS: support custom regions and custom endpoints. #3277

AWS: support custom regions and custom endpoints. #3277

Uh oh!

Conversation

abhinavdahiya commented Mar 11, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Details

pkg/types/aws: add service endpoints to aws platform

asset/installconfig/aws/session.go: allow creating session with region and service overrides

asset/installconfig: use service endpoints when creating the session

asset: Use common session when fetching public zone ID for AWS

aws: configure terraform with service endpoints

destroy/aws: use the service overrides for AWS apis

manifests/aws: create cloud provider config when service endpoints are set

data/aws/main.tf: load sts service endpoint override

aws: use SDK and rhcos AMIs to calculate known regions

asset/installconfig/aws/platform.go: use the known regions for region list for tui

aws/validation/platform.go: drop region validation from install-config

aws: copy AMI from us-east-1 for SDK regions when not available

Uh oh!

abhinavdahiya commented Mar 16, 2020

Uh oh!

abhinavdahiya commented Mar 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Mar 25, 2020

Uh oh!

jstuever left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jhixson74 left a comment

Choose a reason for hiding this comment

Uh oh!

abhinavdahiya commented Apr 1, 2020

Uh oh!

abhinavdahiya commented Apr 16, 2020

Uh oh!

abhinavdahiya commented Apr 17, 2020

Uh oh!

abhinavdahiya commented Apr 17, 2020

Uh oh!

patrickdillon commented Apr 17, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

patrickdillon commented Apr 17, 2020

Uh oh!

abhinavdahiya commented Apr 17, 2020

Uh oh!

abhinavdahiya commented Apr 17, 2020

Uh oh!

openshift-ci-robot commented Apr 18, 2020

Uh oh!

abhinavdahiya commented Mar 11, 2020 •

edited

Loading