-
Notifications
You must be signed in to change notification settings - Fork 1.5k
AWS: support custom regions and custom endpoints. #3277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AWS: support custom regions and custom endpoints. #3277
Conversation
83c87ff to
aea8d03
Compare
7bc8e17 to
dd36ada
Compare
|
cred-minter PR openshift/cloud-credential-operator#167 required. |
|
So waiting on openshift/api#599, #3295 and #3293 to make final changes. Will update the commit messages with details next. |
0012139 to
2193f27
Compare
385a50f to
c92d109
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should configure resolver when the list is non-empty.
c92d109 to
806f0ec
Compare
|
/hold because 30f0667 has a vendor for forked openshift/api |
jstuever
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a few questions about specific code. Also, do we plan to revert back to github.com/openshift/api after it has all the required changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see services being used in this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, will fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if defaultRegion does not exist in the new endpoint, will we always panic? Maybe this would be better as an else if to line 39 if defaultRegionPointer != nil && *defaultRegionPointer != "" {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the default region picked in code should always have an AMI.
line 39 uses the user-environment to pick a default region, which can actually be one of the unknown regions.
therefore panic makes sense in first as this is not an error but developer misconfiguration, while second one should be handled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's possible for an endpoint to have a region with the same name, but not be the same and therefore not have the RHCOS AMI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be odd, i don't think that would be possible...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this... service := service I suspect a limit of my knowledge. Is it to copy the object to a new memory location?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go loops reuse the memory for the value..
https://github.com/golang/go/wiki/CommonMistakes#using-goroutines-on-loop-iterator-variables
this is just a way to create a new variable that can be stored in the resolver.
pkg/asset/machines/master.go
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling the AMIID out of the rhcosImage string seems to happen in three different locations. Does it make sense for this to be a function?
jhixson74
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than what has already been pointed out, everything else looks good to me.
|
/test e2e-metal |
806f0ec to
149433c
Compare
|
/hold cancel as the vendor now includes the correct update |
the only test defined are `TestGetDefaultInstanceClass`. That test are useful for developers of installer to make sure the default instance class used for a region is acurate and available in that region. But these tests are no longer required since the installer moved to use AWS APIs to find out the correct default instance type [1] [1]: openshift@d45e881
w.r.t to the installer there are four kind of regions that user can provide.
First, is a commercial AWS region where the RHCOS AMIs are published and known to the installer
Second, is a commercial AWS region where the RHCOS AMIS are not published or not know to the isntaller at that verion
Third, is a restricted or soverign AWS region where RHCOS AMIs are not published but also these regions do not allowing traffic across the region traffic.
Fourth lastly is a custom region only available to the user.
For commercial regions in first, the installer is expected to use the AMI embedded in the binary. For the commercial regions in second, the installer is expected to `import` the AMI from one of the regions in first. For the restricted AWS regions, the installer cannot `import` the AMI to these regions and therefore the user is expected to provide the installer with the AMI that needs to be used. Similarly for custom regions only for users, the users are expected to provide the AMI.
Now, the list of regions that fall into first vs second depends on the regions for which AMI exists in `rhcos{-amd64}.json`. Since the contents of that file are mounted into the binary at build time using `shurcooL/vfsgen` there are two ways to provide this list to the go code,
- either provide an API as part of `rhcos` pkg that reads the rhcos.json from the embedded assets
since the file asset is not available to `read` unless the binary is built or `OPENSHIFT_INSTALL_DATA` env is set, if we were to run unit tests using `go test` the tests would fail to run. Having the capability of running unit tests without setting any env or building a release binary severly hinders the extent of the unit tests that are capable.
- check in a hard coded list of the regions based on the rhcos.json contents.
This has the advantage of go tests just working fine without any extra configuration. But now there is a need to update the file to match the rhcos.json
To help with automatic management, we can use go generate to create a go file that uses the rhcos.json as input and outputs a go file with a list of regions as a variable. But this file needs to be checked in the version control.
and the only step when the rhcos.json is bumped or changed is running `go generate ./pkg/rhcos/...`
To follow the second method, a new executable go file `ami_regions_generate.go` is added which can be executed as
```
go run ami_regions_generate.go <packae of the generated file> <source rhcos.json path> <destination go file path>
```
The generator provides a variable `AMIRegions` which is a string slice of all the regions that are part of `rhcos.json`
The `ami.go` include a go generate tag to allow go generate to execute the generator and update the list.
The installconfig asset package now provides a public function to check whether a region is `Known` i.e. regions in the first category.
… list for tui The terminal prompts must only provide users a list of regions where the installer has a RHCOS AMI published. The terminal prompts are designed for getting started regions and unknown regions end up requiring a lot more configuration most times that the TUI can support.
Previously isntaller only allowed users to specify certain regions in the install-config. But now users should be allowed to specify any region in the install-config.yaml various validations can be added to make sure users provide AMIs, service endpoints for custom regions to allow installs to succeed.
…ain services for custom regions Since the AWS SDK has the endpoints for most of the services for it's internal region, we require that users provide service endpoints for some required services when using custom region. the required services are: "ec2", "elasticloadbalancing", "iam", "route53", "s3", "sts", and "tagging". Also the users need to specify AMI for these custom regions.
for the second category of the regions, commercial AWS regions that don't have RHCOS AMI published but that allow AMI import, the installer is expected to copy the AMI to the user's account from one of the RHCOS published ones. So the installer picks the AMI published for us-east-1 region and copies it to the chosen region for the cluster to use. Currently, the rhcos.Image asset's string content is extended to be `<ami-id>,<ami-id-region>` when the AMI to be used is different from the cluster's region. And the machinesets and machine objects are updated to use AWS tags to find the generated AMI. The user is still allowed to provide an AMI if they want to re-use one, and the installer will use the AMI as is and will not copy/import the AMI.
since, AWS SDK is missing the route53 endpoints for AWS China regions, the users will be forced to add the service endpoint. But it seems like there is a known endpoint for that service already and the sdk is trailing for a bit, also the terraform aws provider is using this same endpoint already. So adding the endpoint as default known, should help our users a little bit... Also the destroy code needs to perform the same switching for AWS china partition like we go for AWS commercial regions for resource tagging api wrt. route53 object from [1] [1]: openshift@e24c7dc
adeb689 to
61ea70e
Compare
|
rebased onto master for vendor conflicts. |
|
/retest |
1 similar comment
|
/retest |
| // this is a list of known default endpoints for specific regions that would | ||
| // otherwise require user to set the service overrides. | ||
| // it's a map of region => service => resolved endpoint | ||
| // this is only used when the user hasn't specified a override for the service in that region. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: This is the same comment as lines 179-182. It should probably only appear once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm.. personally I want to keep it the same for now, I think the internal one at 179-182 will change in the future but this function is going to stay the same.
|
|
||
| variable "aws_master_root_volume_size" { | ||
| type = string | ||
| type = string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like something weird is going on with tf-fmt, or I'm missing something... normally tf-fmt likes to have the = vertically aligned. I encountered something similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, i think we are using an old version of tf that had some weird bug... we should fix it and use a newer version..
|
/lgtm |
|
/test e2e-gcp |
|
/test e2e-vsphere |
|
@abhinavdahiya: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
|
No failures for existing CI. I think we can iterate on this. /approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
The guts were removed by 805a108 (platformtests: drop aws as no longer required, 2020-03-11, openshift#3277), citing d45e881 (aws: pick instance types based on selected availability zones, 2020-02-03, openshift#3051). No need to keep the useless directory around now that it holds no code.
enhancement: openshift/enhancements#163
work items
Details
pkg/types/aws: add service endpoints to aws platform
Adds a list of service endpoints to the AWS platform. A service endpoint entry/item consists of the service name and the endpoint url.
only one service endpoint for a service can be provided, i.e duplicates are not allowed.
the endpoint URL must be a hostname or a http(s) URL with no request path or requests queries.
asset/installconfig/aws/session.go: allow creating session with region and service overrides
Adds
GetSessionWithOptionswhich allows creating aws.Session with options. Currently adds two options,WithRegion, this configures the aws.Session to a specific region.WithServiceEndpoints, this configures the aws.Session with a new service resolvers that uses the provided list of service endpoints to override the endpoints hard coded into the AWS SDK for the specific service.It uses an internal
awsResolverthat can be configures with a list of service endpoints such that when AWS SDK tries to fetch the endpoint for a service in a region, it returns the configured endpoint. If no such endpoint was configured for this internal resolver, it falls back to AWS SDK's default resolver that uses the hard coded list.NOTE: the internal resolver returns the same endpoint URL for all regions for a service. This is acceptable as service endpoints provided to the installer don't provide region. Also since clusters are installed to single region, one should only need the endpoint for that region or global.
asset/installconfig: use service endpoints when creating the session
The aws metadata now provides consumers session initialized with region and AWS service overrides set in the install config.
The clients in various assets therefore transparently use the correct endpoints without extra configuration, while also allowing them to override it if needed when initializing the client
asset: Use common session when fetching public zone ID for AWS
Previously, while generating the DNS asset, the asset would involke the GetPublicZone function to fetch the public Route53 zone id. This functoin would create it's own AWS session instead of the already provided session from aws metadata. This is various downsides like region is not configured and also the service endpoints overrides is not handled, therefore, switching to the already initialized session allows it to reuse the setup.
aws: configure terraform with service endpoints
The installer now passed on the service endpoints set in the install-config as a map of serivce name => service endpoint URL
The endpoints for aws provider are a
schema.Settype and therefore an input map object cannot be directly assigned to the endpoints and therefore, we lookup the map object for specific key and set the corresponding endpoint.There are two major concerns with this,
First, what if we need multiple endpoints for a service, like per region?
Terraform only supports providing ONE sevice endpoint per service, and this is fine as installer uses the terraform to create resources only in one region today, if there was a case where we had to create resources across regions, we might hav to extend the installer to invoke terraform multiple times.. there is no plan/requirement for that anytime soon.
Secondly, why is the list hard coded? that's not good right?
Installer only creates a SPECIFIC set of resources using the terraform, and therefore the hard coded list should be fine and it can be extended when we need new types of resources. This list is hard coded today because of terraform's limitations today.
destroy/aws: use the service overrides for AWS apis
Currently the destroy code-path for the isntaller is driven completely based on the
metadata.json. This allows users to clean up the clusters as long as they possess/create ametadata.jsonfor the cluster. Therefore, the service endpoints for the cluster need to be propagated to themetadata.jsonfor later use.Adds a list of service endpoints similar to the install-config to the metadata.json's definition. The metadata assets uses the install-config to set the service endpoints when provided by the user.
Updates the destroy/aws.New to create session using the
GetSessionWithOptionssimilar to other assets to re-use the regions and service endpoints options. This makes the destroy code a lot less complicated.destroy/awsClusterInstaller allows users to provide no aws session. In this case, destroy continues to create aws session using SDK defaults with specified region and no service endpoint overrides. Requiring the user to handle special session and passing it along seems sane for users using destroy code directly similar to installer's metadata.json powered code-path.manifests/aws: create cloud provider config when service endpoints are set
The cloud provider config is imported sparsely from upstream. The cloud provider config functions are imported from previous cloudproviderconfig.go and cloudproviderconfig_test.go but have been slightly modified for changed field names.
!IMPORTANT: All this code can be removed from the installer when the cloud-config-stitching controller is available based on enhancement
The aws kube-cloud-controller uses special configuration. For OpenShift, we need to configure the service endpoints for various AWS services. This configuration generation will be handled by controller defined in 5, but for the time being this is done once as a phase one.
data/aws/main.tf: load sts service endpoint override
terraform uses the sts service to find details about the user credentials. These details are used to compute various things like the account ID, partition ID, or the DNS suffix.
aws: use SDK and rhcos AMIs to calculate known regions
w.r.t to the installer there are four kind of regions that user can provide.
First, is a commercial AWS region where the RHCOS AMIs are published and known to the installer
Second, is a commercial AWS region where the RHCOS AMIS are not published or not know to the isntaller at that verion
Third, is a restricted or soverign AWS region where RHCOS AMIs are not published but also these regions do not allowing traffic across the region traffic.
Fourth lastly is a custom region only available to the user.
For commercial regions in first, the installer is expected to use the AMI embedded in the binary. For the commercial regions in second, the installer is expected to
importthe AMI from one of the regions in first. For the restricted AWS regions, the installer cannotimportthe AMI to these regions and therefore the user is expected to provide the installer with the AMI that needs to be used. Similarly for custom regions only for users, the users are expected to provide the AMI.Now, the list of regions that fall into first vs second depends on the regions for which AMI exists in
rhcos{-amd64}.json. Since the contents of that file are mounted into the binary at build time usingshurcooL/vfsgenthere are two ways to provide this list to the go code,either provide an API as part of
rhcospkg that reads the rhcos.json from the embedded assetssince the file asset is not available to
readunless the binary is built orOPENSHIFT_INSTALL_DATAenv is set, if we were to run unit tests usinggo testthe tests would fail to run. Having the capability of running unit tests without setting any env or building a release binary severly hinders the extent of the unit tests that are capable.check in a hard coded list of the regions based on the rhcos.json contents.
This has the advantage of go tests just working fine without any extra configuration. But now there is a need to update the file to match the rhcos.json
To help with automatic management, we can use go generate to create a go file that uses the rhcos.json as input and outputs a go file with a list of regions as a variable. But this file needs to be checked in the version control.
and the only step when the rhcos.json is bumped or changed is running
go generate ./pkg/rhcos/...To follow the second method, a new executable go file
ami_regions_generate.gois added which can be executed asThe generator provides a variable
AMIRegionswhich is a string slice of all the regions that are part ofrhcos.jsonThe
ami.goinclude a go generate tag to allow go generate to execute the generator and update the list.The installconfig asset package now provides a public function to check whether a region is
Knowni.e. regions in the first category.asset/installconfig/aws/platform.go: use the known regions for region list for tui
The terminal prompts must only provide users a list of regions where the installer has a RHCOS AMI published. The terminal prompts are designed for getting started regions and unknown regions end up requiring a lot more configuration most times that the TUI can support.
aws/validation/platform.go: drop region validation from install-config
Previously isntaller only allowed users to specify certain regions in the install-config. But now users should be allowed to specify any region in the install-config.yaml
various validations can be added to make sure users provide AMIs, service endpoints for custom regions to allow installs to succeed.
Since the AWS SDK has the endpoints for most of the services for it's internal region, we require that users provide service endpoints for some required services when using custom region.
the required services are: "ec2", "elasticloadbalancing", "iam", "route53", "s3", "sts", and "tagging".
Also the users need to specify AMI for these custom regions.
aws: copy AMI from us-east-1 for SDK regions when not available
for the second category of the regions, commercial AWS regions that don't have RHCOS AMI published but that allow AMI import, the installer is expected to copy the AMI to the user's account from one of the RHCOS published ones.
So the installer picks the AMI published for us-east-1 region and copies it to the chosen region for the cluster to use.
Currently, the rhcos.Image asset's string content is extended to be
<ami-id>,<ami-id-region>when the AMI to be used is different from the cluster's region. And the machinesets and machine objects are updated to use AWS tags to find the generated AMI.The user is still allowed to provide an AMI if they want to re-use one, and the installer will use the AMI as is and will not copy/import the AMI.
/hold because 30f06675df19f6f6d88b12897a9426ac3061bdb0 has a vendor for forked openshift/api