-
Notifications
You must be signed in to change notification settings - Fork 521
Ingress router resource config #1877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Ingress router resource config #1877
Conversation
This enhancement proposes adding the ability to configure resource limits and requests for the ingress-operator deployment containers via a new v1alpha1 API field in the IngressController custom resource. This addresses the need for: - Setting resource limits for QoS guarantees - Compliance requirements for resource constraints - Scaling operator resources for large deployments Relates to: RFE-1476
Minor updates
Minor updates
|
Hi @joseorpa. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
| ### Alternative 2: Modify v1 API directly | ||
|
|
||
| Add `operatorResourceRequirements` field directly to stable v1 API. | ||
|
|
||
| **Pros**: | ||
| - No need for v1alpha1 version | ||
| - Simpler for users (one API version) | ||
|
|
||
| **Cons**: | ||
| - Changes stable API (breaking compatibility promise) | ||
| - Cannot iterate on design easily | ||
| - Difficult to remove if issues found | ||
| - Against OpenShift API stability guarantees | ||
|
|
||
| **Decision**: Rejected - Use v1alpha1 for new features as per OpenShift conventions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this v1alpha1 convention coming from? Can we introduce v1alpha1 when we already have v1?
The usual approach is to add the field directly to the existing v1 API:
- Define a new featuregate, initially in the TPNU feature set (but not Default).
- Add a field to the v1 API, using the new featuregate (as you've done using the
// +openshift:enable:FeatureGatemarker). - Implement the feature and write tests.
- Add the featuregate to the Default feature set when it's ready.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This v1alpha1 convention comes from openshift/api#2485 (review)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoelSpeed can you help here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a difference between adding a field to an already stable API (which Miciah has pointed out) and adding a completely new API.
The PR I reviewed, and left feedback on, was introducing a completely new API type, and as such, starting as alpha is correct per our latest guidelines.
If you think this should just be a field on an existing v1 API then that's a different discussion
| Create a new v1alpha1 API version for IngressController in the | ||
| `operator.openshift.io` group, following the pattern made for example by | ||
| [cluster monitoring v1alpha1 configuration](https://github.com/openshift/api/blob/94481d71bb6f3ce6019717ea7900e6f88f42fa2c/config/v1alpha1/types_cluster_monitoring.go#L172-L193). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a shared type for all operators?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You mean core Kubernetes corev1.ResourceRequirements ? I've seen there is a lot of types in operator.openshift.io group.
| - Maintain backward compatibility with existing IngressController v1 API | ||
| - Use v1alpha1 API version for this Tech Preview feature | ||
| - Provide sensible defaults that work for most deployments | ||
| - Support both the ingress-operator and kube-rbac-proxy containers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why kube-rbac-proxy? Is that only for QoS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm correcting this as well, it is for QoS but I agree, not directly related to router pods.
| A new controller (`operator-deployment-controller`) in the cluster-ingress-operator | ||
| watches the default IngressController CR and reconciles the operator's own deployment | ||
| when `operatorResourceRequirements` is specified. | ||
|
|
||
| **Controller responsibilities:** | ||
| 1. Watch IngressController resources (v1alpha1) | ||
| 2. Reconcile `ingress-operator` Deployment in `openshift-ingress-operator` namespace | ||
| 3. Update container resource specifications | ||
| 4. Handle error cases gracefully (invalid values, conflicts, etc.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This won't work; CVO manages the ingress-operator deployment. You can't have cluster-ingress-operator update its own deployment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm updating this as well, ingress operator would control just the deployment of the router pods.
| **Mitigation**: | ||
| - Controller reconciliation loop detects and corrects drift | ||
| - Document that configuration should be via IngressController CR, not direct deployment edits | ||
| - Admission webhooks prevent direct deployment modifications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you proposing adding an admission webhook to block updates to the ingress-operator deployment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm correcting this and changing it for a conversion webhook for the different API versions.
|
|
||
| This enhancement proposes adding the ability to configure resource limits and | ||
| requests for the ingress-operator deployment containers via a new v1alpha1 API | ||
| field in the IngressController custom resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is for the ingress-operator deployment, it doesn't make sense to put this in the IngressController CRD, which describes configuration for router pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I will update the content of this part of the enhancement as well.
| 1. **Q**: Should we support auto-scaling (VPA) in the future? | ||
| - **A**: Out of scope for initial implementation, but API should not preclude it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Autoscaling the operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be the router pod for sure, updating this too
| 3. **Q**: Should this apply to all IngressControllers or only the default? | ||
| - **A**: Initial implementation only default, but API supports any IngressController |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the configuration apply to IngressControllers (router) pods at all, or only to the ingress-operator pod?
If you mean it applies only to the ingress-operator pod, are you saying that resource requests and limits for the ingress-operator pod are read from the "default" IngressController, and resource request and limits specified on other IngressController CRs are ignored? Putting configuration for the operator in the IngressController CRD is confusing (see #1877 (comment)).
If you actually mean resource requests and limits for router pods, then it seems to me that it is simplest and least surprising to respect the configuration for all IngressControllers, not only for the default. Does respecting configuration for other IngressControllers pose some problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will be for all router pods.
| 4. **Q**: How do we handle the operator modifying its own deployment safely? | ||
| - **A**: Use owner references carefully, reconcile loop with backoff |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this point? How do you avoid conflicts with CVO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to router pods controlled by ingress-controller
| - [ ] Sufficient field testing (2+ minor releases in Tech Preview) | ||
| - [ ] No major bugs reported for 2 consecutive releases |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an unusual requirement for OpenShift. For a feature like this, we would usually introduce as Tech Preview and graduate to GA in the same release development cycle.
| - [ ] No major bugs reported for 2 consecutive releases | ||
| - [ ] Performance impact assessed and documented | ||
| - [ ] API design validated by diverse user scenarios | ||
| - [ ] At least 10 production users providing positive feedback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you believe you will be able to find 10 production users of this feature?
| - Simpler to implement | ||
| - No API version changes needed | ||
| - Easy to update without CRD changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You would need a CRD change to add a reference to the ConfigMap... unless you would have the operator just check for a ConfigMap in openshift-config with some hard-coded name?
| ### Alternative 3: Separate CRD for operator configuration | ||
|
|
||
| Create a new OperatorConfiguration CRD (similar to how cluster monitoring works). | ||
|
|
||
| **Pros**: | ||
| - Separation of concerns | ||
| - Can configure multiple operators uniformly | ||
|
|
||
| **Cons**: | ||
| - Increases API surface unnecessarily | ||
| - IngressController is the logical place for ingress-operator configuration | ||
| - More CRDs to manage | ||
| - Inconsistent with how other operators handle self-configuration | ||
|
|
||
| **Decision**: Rejected - IngressController CR is the appropriate configuration location |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you really to mean for this EP to be specifically for the ingress-operator pod (and not router pods), then I really like this alternative. Have you considered a variant: adding configuration for resource requests and limits to the ClusterVersion CRD (alongside the existing component overrides)? This makes a lot of sense for a few reasons:
- CVO is the thing that manages the deployment right now; trying to have cluster-ingress-operator update the deployment that CVO manages is asking for trouble.
- The resource requests and limits configuration logically fits under CVO configuration, not the IngressController API.
- The configuration logically fits in with component overrides.
- The resource requests and limits configuration could apply to any operator, not just cluster-ingress-operator; putting the configuration under the ClusterVersion CRD would provide a centralized, consistent way to configure it for multiple operators.
| **Cons**: | ||
| - Not GitOps friendly | ||
| - Requires direct deployment modification | ||
| - Not discoverable via API | ||
| - Doesn't follow OpenShift declarative configuration patterns | ||
| - Difficult to audit and version control | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it would require a CVO override.
|
|
||
| ## Design Details | ||
|
|
||
| ### Open Questions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you address this point from https://github.com/openshift/enhancements/blob/master/CONVENTIONS.md#resources-and-limits?
We do not want cluster components to be restarted based on their resource consumption (for example, being killed due to an out-of-memory condition). We need to detect and handle those cases more gracefully, without degrading cluster performance.
|
/assign |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cc @alebedev87 |
Enhancement: Ingress Operator Resource Configuration via v1alpha1 API
This enhancement proposes adding the ability to configure resource limits
and requests for the ingress-operator deployment containers via a new
v1alpha1 API field in the IngressController custom resource.
This addresses the need for:
Relates to: RFE-1476