-
Notifications
You must be signed in to change notification settings - Fork 533
installer: AWS custom region and endpoints #163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
installer: AWS custom region and endpoints #163
Conversation
| #### AWS API endpoints for cluster operators | ||
|
|
||
| Since almost of the cluster operators that communicate to the the AWS API need the use the API endpoints provided by the user, the `Infrastructure` global configuration seems like the best place to store and discover this information. | ||
|
|
||
| And in contrast to the current configuration in the `Infrastructure` object like `region` is not editable as day-2 operation, the users should be able to edit the AWS endpoints day-2, therefore this configuration best belongs in the `spec` section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @derekwaynecarr @deads2k @csrwng
Because we had pushback previously from adding configuration to global configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinavdahiya sgtm from a hosted deployment point of view
| 2. Configure each individual cluster operator | ||
| There are five cluster operators that would need to be configured namely, cluster-kube-controller-manager, cluster-ingress-operator, cluster-machine-api-operator, cluster-image-registry-operator, cluster-credential-operator. There might be more operators like cluster-network-operator that might require access to the AWS APIs in the future to control the security group rules. Configuring all these separately is not a great UX for installer and a user who wants to modify the cluster to use API endpoints as day-2 operation. | ||
|
|
||
| #### Cluster Kube Controller Manager Operator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @deads2k
for changes in cluster-kube-controller-manager-operator
|
|
||
| If the `.spec.cloudConfig` already has the service overrides set, the operator should fail to roll out the configuration. | ||
|
|
||
| #### Other Cluster Operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding a Jira for CCO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that if an S3 endpoint is not defined, the image-registry-operator will not be able to configure storage and will report itself Degraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these api endpoints are not required, but optional. so we continue to use the default ones unless these overrides are specified.
|
This all seems reasonable to me so far, and the work described for ingress operator and KCM seems accurate to me |
|
|
||
| Currently the cluster-kube-controller-manager-operator uses the configmap `.spec.cloudConfig`, which stores the Kubernetes cloud-provider configuration, to configure the kube-controller-manager for various clouds. | ||
|
|
||
| But since the API endpoints are now part of the `.spec.platformSpec.aws.apiEndpoints[]`, the kube-controller-manager will have to take the cloud configuration from the `.spec.cloudConfig` and add the [service overrides for AWS][k8s-aws-cloud-provider-service-overrides]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick skim, it seems like the cloudconfig should match reality. I'm not clear on why individual operators would be left trying to stitch and understand the vagaries of various cloud providers.
This appears to be the second example of a need for an infrastructure or config level operator to performance stitching/maintenance concerns. The first was proxy and was shoehorned into networking operator. I don't think we should continue to shy away from creating an operator at the level we need it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a quick skim, it seems like the cloudconfig should match reality.
the cloud config is k8s cloud operator configuration, which I think is personally wrong for our operators to respect.
- it's designed and maintained by upstream
- it contains various fields which I don't think our operators should get.
- there is no major compatibility guarantees for this configuration...
I'm not clear on why individual operators would be left trying to stitch and understand the vagaries of various cloud providers.
If the operators are going to communicate with cloud APIs they need to understand the details vagaries of various cloud providers .
There is no one thing that can do all. so i don't see a way our operators can shy away from understanding the complexities of the cloud.
This appears to be the second example of a need for an infrastructure or config level operator to performance stitching/maintenance concerns. The first was proxy and was shoehorned into networking operator. I don't think we should continue to shy away from creating an operator at the level we need it.
i don't think adding another operator here helps stitching/maintenance, as there is no one thing that can do all. KCM-O manages the k8s cloud provider and it is the best place to keep the knowledge of configuring the cloud provider.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on why individual operators would be left trying to stitch and understand the vagaries of various cloud providers.
If the operators are going to communicate with cloud APIs they need to understand the details vagaries of various cloud providers .
Following that thought, operators using cloud API clients already implicitly manage connection details via hard-coded cloud API library default values. Externalizing and centralizing that configuration seems like an improvement to me in that regard.
@deads2k, I'm not clear what alternative you're proposing. I think you allude to it here, but maybe it presupposes some background on this topic I'm missing:
I don't think we should continue to shy away from creating an operator at the level we need it.
Would you be willing to elaborate a bit more on the specifics of your preferred solution?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloud provider configuration files are used by the kube-apiserver, kubelet, kube-controller-manager, and cloud-controller-manager and that's just the kubernetes components. It is a mistake to make each of these consumers stitch together an existing cloudconf file with additional side-channel information. The cloudconf file that we tell components to load needs to be fully valid. If that means that an operator (not the kcm-o, but one for configuration or cloud providers), needs to stitch user input with an infrastructure information so that the information is codified in one spot, so be it.
Doing this in one spot instead of a half-dozen (with more potentially coming) makes a lot more sense than doing it everywhere. Doing it in a spot that can be logically ordered (if not actually ordered) also makes sense. Stitching as described in this enhancement looks like an antipattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you proposing a higher-level cloud provider configuration object which would subsume the AWS-specific endpoint details proposed in this EP (as well as future endpoint configurations for other cloud providers)?
|
/cc @joelddiaz |
|
|
||
| If the `.spec.cloudConfig` already has the service overrides set, the operator should fail to roll out the configuration. | ||
|
|
||
| #### Other Cluster Operators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that if an S3 endpoint is not defined, the image-registry-operator will not be able to configure storage and will report itself Degraded.
| // override existing defaults of AWS Services. | ||
| // Currently Supports - EC2, IAM, ELB, S3 and Route53. | ||
| type AWSCustomEndpoint struct { | ||
| Service string `json:"service"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make this a string-based type? Perhaps not since we'd have to update this like crazy after each re:Invent conference...
type AWSService string
const (
IAM AWSService = "IAM"
EC2 AWSService = "EC2"
S3 AWSService = "S3"
...
)
|
|
||
| #### Destroy Cluster | ||
|
|
||
| The `metadata.json` needs to store the custom endpoints from the install-config.yaml so that the users can delete the clusters without any previous state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this reused by hive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hive launches a deprovision in a way that doesn't use metadata.json https://github.com/openshift/hive/blob/f12a23ba3ee502ed8e71761e1d1af3c080e3adad/pkg/install/generate.go#L435-L451
but hive would need to save these custom endpoints so that a deprovision could succeed.
CC @dgoodwin
| Infrastructure global configuration already performs the function of tracking infrastructure related configuration and another global configuration that stores a part of the information doesn't seem like a great option. But it might allow validations and status observation by an independent controller. | ||
|
|
||
| 2. Configure each individual cluster operator | ||
| There are five cluster operators that would need to be configured namely, cluster-kube-controller-manager, cluster-ingress-operator, cluster-machine-api-operator, cluster-image-registry-operator, cluster-credential-operator. There might be more operators like cluster-network-operator that might require access to the AWS APIs in the future to control the security group rules. Configuring all these separately is not a great UX for installer and a user who wants to modify the cluster to use API endpoints as day-2 operation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe OLM operators like Metering (and any other operator leveraging AWS services) would need to likewise be configured.
|
|
||
| #### Installing to custom region | ||
|
|
||
| #### Installing to new regions without native RHCOS support in the region |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This item isn't just relevant for RHCOS support - components like the image registry need their SDKs upgraded as new regions are added.
In the case of the image registry, we have a regionEndpoint override in our operator config which lets admins override the S3 endpoint the registry talks to. In implementing this enhancement, the registry operator would likely read the CustomEndpoint for S3 and apply it to the regionEndpoint override. In theory customers can add the new region endpoints as CustomEndpoint values, and the operator would take care of the rest.
779c638 to
d43421a
Compare
|
updated the document based on previous comments. PTAL ping @deads2k @ironcladlou @adambkaplan @enxebre @openshift/openshift-team-installer |
| * elb2 | ||
| * iam | ||
| * route53 | ||
| * resourcegrouptaggingapi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also needs s3 (image registry)
| // service endpoint of AWS Services. | ||
| // There must be only one ServiceEndpoint for a service. | ||
| // +optional | ||
| ServiceEndpoints []AWSServiceEndpoint `json:"serviceEndpoints,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be a map-based type? I also recommend that we have a string-based type to hold the constants for AWS services that need overrides.
type AWSService string
const (
IAM AWSService = "iam"
EC2 AWSService = "ec2"
ELB AWSService = "elb"
Route53 AWSService = "route53"
...
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend that we have a string-based type to hold the constants for AWS services that need overrides.
maintaining a separate list is too much overhead, and there are infinite AWS services. if OLM based operators start using this service endpoints list.. we wil be stuck adding types for those operators.
I think we can run validations esp when we move from spec to status to see if the service name is unknown.. that too is going to be tough unless aws sdk has a list ready :/
Shouldn't this be a map-based type?
I find the list format to be more ergonomic... but i think we can use maps vs list based on
which one is easy to show validation error and which one might be easy for patching/updating.
ec2:
url: ...
s3:
url: ...
ec2:
url: ...
I'm not sure if the parser itself will error out or just use the later key value...
|
|
||
| * ec2 | ||
| * elb | ||
| * elb2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does elb2 actually exist as a separate service?
|
|
||
| There are 2 ways to configure the AWS SDK, | ||
|
|
||
| 1. Use the sdk [resolver][aws-sdk-endpoints-resolver] like Kubernetes AWS [cloud-provider][k8s-aws-cloud-provider-get-resolver] so that the AWS SDK's default resolver is used for AWS APIs and override with the user provided ones when available. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can work for the machine-api decently.
|
|
||
| To support custom endpoints for AWS APIs, the install-config would allow the users to provide a list of endpoints for various services. When the custom endpoints are set, the installer will validate that the endpoints are reachable. The endpoints will be used by the installer to call AWS APIs for performing various actions like validations, creating the MachineSet objects, and also configuring the terraform AWS provider. | ||
|
|
||
| When custom endpoints are provide for AWS APIs, the cluster operators also need the discover the information, therefore, the installer will make these available using the `config.openshift.io/v1` `Infrastructure` object. The cluster operators can then use the custom endpoints from the object to configure the AWS SDK to use the corresponding endpoints for communications. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo provide/provided
|
|
||
| When custom endpoints are provide for AWS APIs, the cluster operators also need the discover the information, therefore, the installer will make these available using the `config.openshift.io/v1` `Infrastructure` object. The cluster operators can then use the custom endpoints from the object to configure the AWS SDK to use the corresponding endpoints for communications. | ||
|
|
||
| Since various kubernetes components like the kube-apiserver, kubelet (machine-config-operator), kube-controller-manager, cloud-controller-managers use the `.spec.cloudConfig` Config Map reference for cloud provider specific configurations, a new controller `cluster-kube-cloud-config-operator` will be introduced to perform the task of stitching the custom endpoints with the rest of the cloud config, such that all the kubernetes components can continue to directly consume a Config Map for configuration. This controller will perform the specialized stitching on the bootstrap host for control-plane kubelet and also actively reconcile the state in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make sure I understand, would the cluster-kube-cloud-config-operator permanently watch changes to the custom endpoints and reconcile the cloud provider ConfigMap with them?
|
|
||
| 3. The URL for the service endpoint must be `https` and the host should trust the certificate. | ||
|
|
||
| #### Configuring the AWS SDK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @alexander-demichev for aws actuator
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhinavdahiya For QE's testing, do we need some extra configuration on installer client for this section? Or this section is mainly for developer?
| // AWSServiceEndpoint store the configuration of a custom url to | ||
| // override existing defaults of AWS Services. | ||
| // Currently Supports - EC2, IAM, ELB, S3 and Route53. | ||
| type AWSServiceEndpoint struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For route53, and resource tagging api, when you creating the aws client, you need set the api endpoint, you also need set the specific region, like for global you need set to us-east-1, for cn-north-1 and cn-northwest-1, you need set to cn-northwest-1, do you think it's needed here to add another param region? so that match how you create the aws client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the reason you have to set the region is so that the SDK can find the correct endpoint. Since we are already asking the user to give us the endpoint, there is no value for region.
d43421a to
af95e23
Compare
|
Before this merges, I think the related enhancement needs to reach closure. Creating an api that requires stitching in at least four different operators seems like a bad pattern to create without buy in from those components. In addition, the user input here would result in either an improper cloud.conf being provided to those components or information duplication. If you provided an operator/cluster-config-operator control loop to perform the stitching it would make this much more palatable. |
@deads2k the comment seems dated wrt the updated enhancement.. can you take another look at the contents and see if fixes your concerns. @deads2k https://github.com/openshift/enhancements/pull/163/files#diff-b6a7e34af40d81a977a31247b107348eR213 already documents the control loop that stitches for components.
no, the control loop will be the safe guard. |
deads2k
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having a common piece of code to validate, reconcile, and stitch together configuration for the rest of the cluster to use makes this much easier to consume.
I would like the destination more completely described, a few error conditions clarified, and on alternative explained in detail. But in principal, if we agree that an existing cloud conf file cannot serve our purpose, I'm ok with this.
|
|
||
| When custom endpoints are provided for AWS APIs, the cluster operators also need the discover the information, therefore, the installer will make these available using the `config.openshift.io/v1` `Infrastructure` object. The cluster operators can then use the custom endpoints from the object to configure the AWS SDK to use the corresponding endpoints for communications. | ||
|
|
||
| Since various kubernetes components like the kube-apiserver, kubelet (machine-config-operator), kube-controller-manager, cloud-controller-managers use the `.spec.cloudConfig` Config Map reference for cloud provider specific configurations, a new controller `cluster-kube-cloud-config-operator` will be introduced to perform the task of stitching the custom endpoints with the rest of the cloud config, such that all the kubernetes components can continue to directly consume a Config Map for configuration. This controller will perform the specialized stitching on the bootstrap host for control-plane kubelet and also actively reconcile the state in the cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Line break on sentences of 120 characters please.
- I think this will require moving the authoritative source of cloud config into openshift-config-managed so that inputs and output are clear and ownership is well defined.
- This control loop should fail if there are conflicts in values and clearly indicate degraded status.
|
|
||
| ### Non-Goals | ||
|
|
||
| 1. There are no plans to allow self-signed server certificates for endpoints specifically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems important. Other cloud configs allow specifying custom CA bundles to provide scoped trust for cloud provider calls without risking the security of other calls where its use would be unexpected. Why would we exclude this and what is the substitute? Cloud providers are different than system trust.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not what we are trying to solve for today. most definitely later on we might have to solve it.
|
|
||
| // PlatformSpec holds some configuration to the underlying infrastructure provider | ||
| // of the current cluster. It is supposed that only one of the spec structs is set. | ||
| type PlatformSpec struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a union discriminator
| } | ||
| ``` | ||
|
|
||
| #### Global configuration Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there's another alternative. Does AWS have a cloud.conf file similar to the openstack one: https://kubernetes.io/docs/concepts/cluster-administration/cloud-providers/#cloud-conf? This is logically already an API we support since we allow updates when these values are specified. This file format already represents the canonical representation of this information.
Since we have that canonical source and it's already an API we support, it's worth explaining why we would choose not to use it.
…trol details provide details about the source and destination of the stitched cloud config in spec and status
a088feb to
76f0076
Compare
|
Minor tweaks of tweaks, but the direction looks reasonable to me. /approve |
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: abhinavdahiya, deads2k, sdodson The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
The ingress operator uses route53, tagging and elb clients for dns management. The tagging client must match the region of route53 in order to look-up the hosted zone id of the alias target for the wildcard dns record. Since route53 is non-regionalized, it’s based out of us-gov-west-1: https://docs.aws.amazon.com/govcloud-us/latest/UserGuide/using-govcloud-endpoints.html. Therefore, the client must be configured to use us-gov-west-1 even when the infrastructure/cluster region is us-gov-east-1. See openshift/cluster-ingress-operator#460 for the implementation details. cc: @abhinavdahiya |
|
cc @yunjiang29 |
|
FYI, this enhancement was implemented without the storage team noticing and resulted in https://bugzilla.redhat.com/show_bug.cgi?id=2074706. Please poke @openshift/storage when anything regarding cloud config changes. |
ACK. Thanks for bringing this up. I think we are getting better at understanding all the teams/components affected by changes such as this, but this is a good example of an avoidable problem. |
/cc @patrickdillon @jstuever @wking
/assign @sdodson
xref: https://issues.redhat.com/browse/CORS-1271